TextCL: a Python package for NLP preprocessing tasks

Petukhova, Alina; Fachada, Nuno

TextCL: a Python package for NLP preprocessing tasks

Ficheiros

1-s2.0-S2352711022000802-main.pdf (444.62 KB)

Data

2022-07-01

Autores

Petukhova, Alina

Fachada, Nuno

Editora

Elsevier B.V.

Resumo

Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection

Descrição

SoftwareX 19 (2022) 101122

Palavras-chave

INFORMÁTICA, PROCESSAMENTO DE DADOS, PROCESSAMENTO DE TEXTO, LINGUAGEM NATURAL, LINGUAGEM PYTHON, COMPUTER SCIENCE, DATA PROCESSING, WORD PROCESSING, NATURAL LANGUAGE, PYTHON PROGRAMMING LANGUAGE

Citação

Petukhova, A & Fachada, N 2022, 'TextCL: a Python package for NLP preprocessing tasks', SoftwareX 19 (2022) 101122.

URI

https://hdl.handle.net/10437/12937

Coleções

FE - Artigos de Revistas Internacionais com Arbitragem Científica

Ver registo completo

TextCL: a Python package for NLP preprocessing tasks

Ficheiros

Data

Autores

Título da revista

ISSN da revista

Título do volume

Editora

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções