SciELO - Scientific Electronic Library Online

 
 número53A Segment-based Weighting Technique for URL-based Genre Classification of Web PagesA Method Based on Patterns for Deriving Key Performance Indicators from Organizational Objectives índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Polibits

versão On-line ISSN 1870-9044

Resumo

NOVAK, Attila. Improving Corpus Annotation Quality Using Word Embedding Models. Polibits [online]. 2016, n.53, pp.49-53. ISSN 1870-9044.  https://doi.org/10.17562/PB-53-5.

Web-crawled corpora contain a significant amount of noise. Automatic corpus annotation tools introduce even more noise performing erroneous language identification or encoding detection, introducing tokenization and lemmatization errors and adding erroneous tags or analyses to the original words. Our goal with the methods presented in this article was to use word embedding models to reveal such errors and to provide correction procedures. The evaluation focuses on analyzing and validating noun compounds identifying bogus compound analyses, recognizing and concatenating fragmented words, detecting erroneously encoded text, restoring accents and handling the combination of these errors in a Hungarian web-crawled corpus.

Palavras-chave : Corpus linguistics; lexical resources; corpus annotation; word embeddings.

        · texto em Inglês     · Inglês ( pdf )