GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

dc.contributor.authorFachada, Nuno
dc.contributor.authorFernandes, Daniel
dc.contributor.authorFernandes, Carlos M.
dc.contributor.authorFerreira-Saraiva, Bruno D.
dc.contributor.authorMatos-Carvalho, João P.
dc.contributor.institutionCOPELABS - Cognitive and People-centric Computing
dc.contributor.institutionCICANT - Centre for Research in Applied Communication, Culture, and New Technologies
dc.contributor.institutionECATI - School of Communication, Architecture, Arts and Information Technologies
dc.date.accessioned2025-09-17T13:50:08Z
dc.date.available2025-09-17T13:50:08Z
dc.date.issued2025
dc.descriptionPublisher Copyright: © 2025 by the authors.
dc.description.abstractLarge Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the \textit{ParShift} library, and synthetic data generation and clustering using \textit{pyclugen} and \textit{scikit-learn}. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code, with GPT-4.1 standing out as the only model to always succeed in both tasks. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.en
dc.description.sponsorshipThis research was partially funded by the Fundação para a Ciência e a Tecnologia (FCT, https://ror.org/00snfqn58) under Grants UIDB/04111/2020, UIDB/00066/2020, UIDB/00408/2020, UID/00408/2025, and CEECINST/00002/2021/CP2788/CT0001, as well as by the Instituto Lusófono de Investigação e Desenvolvimento (ILIND), Portugal, under Project COFAC/ILIND/COPELABS/1/2024.
dc.formatapplication/pdf
dc.identifier.citationFachada, N, Fernandes, D, Fernandes, C M, Ferreira-Saraiva, B D & Matos-Carvalho, J P 2025, 'GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries', Future Internet, vol. 17, no. 9, 412. https://doi.org/10.3390/fi17090412
dc.identifier.doihttps://doi.org/10.3390/fi17090412
dc.identifier.issn1999-5903
dc.identifier.urihttp://hdl.handle.net/10437/15547
dc.identifier.urlhttps://www.scopus.com/pages/publications/105017426095
dc.language.isoeng
dc.peerreviewedyes
dc.publisherMultidisciplinary Digital Publishing Institute (MDPI)
dc.relation.ispartofFuture Internet
dc.rightsopenAccess
dc.subjectCOMPUTER SCIENCE
dc.subjectPYTHON
dc.subjectCODE GENERATION
dc.subjectINFORMÁTICA
dc.subjectPYTHON
dc.subjectCRIAÇÃO DE CÓDIGO
dc.titleGPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Librariesen
dc.typearticle

Ficheiros

Principais
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
sets-the-standard-in-automated.pdf
Tamanho:
734.44 KB
Formato:
Adobe Portable Document Format