GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Fachada, Nuno; Fernandes, Daniel; Fernandes, Carlos M.; Ferreira-Saraiva, Bruno D.; Matos-Carvalho, João P.

doi:https://doi.org/10.3390/fi17090412

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

dc.contributor.author	Fachada, Nuno
dc.contributor.author	Fernandes, Daniel
dc.contributor.author	Fernandes, Carlos M.
dc.contributor.author	Ferreira-Saraiva, Bruno D.
dc.contributor.author	Matos-Carvalho, João P.
dc.contributor.institution	COPELABS - Cognitive and People-centric Computing
dc.contributor.institution	CICANT - Centre for Research in Applied Communication, Culture, and New Technologies
dc.contributor.institution	ECATI - School of Communication, Architecture, Arts and Information Technologies
dc.date.accessioned	2025-09-17T13:50:08Z
dc.date.available	2025-09-17T13:50:08Z
dc.date.issued	2025
dc.description	Publisher Copyright: © 2025 by the authors.
dc.description.abstract	Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the \textit{ParShift} library, and synthetic data generation and clustering using \textit{pyclugen} and \textit{scikit-learn}. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code, with GPT-4.1 standing out as the only model to always succeed in both tasks. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.	en
dc.description.sponsorship	This research was partially funded by the Fundação para a Ciência e a Tecnologia (FCT, https://ror.org/00snfqn58) under Grants UIDB/04111/2020, UIDB/00066/2020, UIDB/00408/2020, UID/00408/2025, and CEECINST/00002/2021/CP2788/CT0001, as well as by the Instituto Lusófono de Investigação e Desenvolvimento (ILIND), Portugal, under Project COFAC/ILIND/COPELABS/1/2024.
dc.format	application/pdf
dc.identifier.citation	Fachada, N, Fernandes, D, Fernandes, C M, Ferreira-Saraiva, B D & Matos-Carvalho, J P 2025, 'GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries', Future Internet, vol. 17, no. 9, 412. https://doi.org/10.3390/fi17090412
dc.identifier.doi	https://doi.org/10.3390/fi17090412
dc.identifier.issn	1999-5903
dc.identifier.uri	https://hdl.handle.net/10437/15547
dc.identifier.url	https://www.scopus.com/pages/publications/105017426095
dc.identifier.url	http://hdl.handle.net/10437/15547
dc.language.iso	eng
dc.peerreviewed	yes
dc.publisher	Multidisciplinary Digital Publishing Institute (MDPI)
dc.relation.ispartof	Future Internet
dc.rights	openAccess
dc.subject	COMPUTER SCIENCE
dc.subject	PYTHON
dc.subject	CODE GENERATION
dc.subject	INFORMÁTICA
dc.subject	PYTHON
dc.subject	CRIAÇÃO DE CÓDIGO
dc.title	GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries	en
dc.type	article

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: sets-the-standard-in-automated.pdf
Tamanho:: 734.44 KB
Formato:: Adobe Portable Document Format

Ver/Abrir

Coleções

pure-collection
COPELABS - Artigos de Revistas Internacionais com Arbitragem Científica