TextCL: a Python package for NLP preprocessing tasks

dc.contributor.authorPetukhova, Alina
dc.contributor.authorFachada, Nuno
dc.contributor.institutionFaculdade de Engenharia
dc.date.issued2022-07-01
dc.descriptionSoftwareX 19 (2022) 101122
dc.description.abstractPreprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detectionen
dc.description.statusNon peer reviewed
dc.formatapplication/pdf
dc.identifier.citationPetukhova , A & Fachada , N 2022 , ' TextCL: a Python package for NLP preprocessing tasks ' , SoftwareX 19 (2022) 101122 .
dc.identifier.issn2352-7110
dc.language.isoeng
dc.publisherElsevier
dc.relation.ispartofSoftwareX 19 (2022) 101122
dc.rightsopenAccess
dc.subjectINFORMÁTICA
dc.subjectPROCESSAMENTO DE DADOS
dc.subjectPROCESSAMENTO DE TEXTO
dc.subjectLINGUAGEM NATURAL
dc.subjectLINGUAGEM PYTHON
dc.subjectCOMPUTER SCIENCE
dc.subjectDATA PROCESSING
dc.subjectWORD PROCESSING
dc.subjectNATURAL LANGUAGE
dc.subjectPYTHON PROGRAMMING LANGUAGE
dc.titleTextCL: a Python package for NLP preprocessing tasksen

Ficheiros

Principais
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
1-s2.0-S2352711022000802-main.pdf
Tamanho:
444.62 KB
Formato:
Adobe Portable Document Format
Descrição:
Licença
A mostrar 1 - 1 de 1
Miniatura indisponível
Nome:
license.txt
Tamanho:
1.71 KB
Formato:
Item-specific license agreed upon to submission
Descrição: