Exact and Approximate Prefix Search under Access Locality Requirements for Morphological Analysis and Spelling Correction

Gelbukh, Alexander

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.6 no.3 Ciudad de México ene./mar. 2003

Artículo

Exact and Approximate Prefix Search under Access Locality Requirements for Morphological Analysis and Spelling Correction

La Búsqueda Exacta y Aproximada de Prefijos Bajo los Requerimientos del Acceso Local, para el Análisis Morfológico y Corrección de Ortografía

Alexander Gelbukh

Centro de Investigación en Computación–IPN Av. Juan de Dios Bátiz s/n esq. Miguel Othón de Mendizabal, Unidad Porfesional Adolfo López Mateos, Col. Sn Pedro Zacatenco Del. Gustavo A. Madero, México D.F. C.P. 07738 E–mail: gelbukh@cic.ipn.mx , gelbukh@gelbukh.com

Article received on December 12, 2000
Accepted on March 18, 2003

Abstract

A data structure useful for prefix search in a very large dictionary with an unlimited query string is discussed. This problem is important for morphological analysis of inflective languages, including particularly difficult cases such as German word concatenation or Japanese writing system that does not use spaces; similar tasks arise in DNA computing. The data structure is optimized for locality of access: to find all necessary records, access to only one block (page) of the main data storage is guaranteed, which significantly improves performance. To illustrate its usefulness, the algorithms of exact and approximate search are described, with application to morphological analysis and spelling correction. The algorithms for building, exporting, and updating the data structure are explained.

Keywords: prefix search, approximate prefix search, approximate string matching, morphological analysis, spelling correction, natural language processing, DNA computing.

Resumen

Se presenta una estructura de datos que es útil para la búsqueda de prefijos en un diccionario muy grande con una petición de entrada no limitada. Este problema es importante para el análisis morfológico de los lenguajes fiexivos, incluyendo los casos particularmente difíciles tales como encadenamiento de palabras en el alemán o el sistema de la escritura japonés que no utiliza espacios; las tareas similares se presentan en el procesamiento computational de ADN. La estructura de datos es optimizada para el acceso local: para encontrar todos los registros necesarios, se garantiza el acceso a sólo un bloque (página) del dispositivo principal de almacenamiento de datos, lo que significadamente mejora el rendimiento. Para ilustrar su utilidad, se describen los algoritmos de la búsqueda exacta y aproximada, aplicados al análisis morfológico y la corrección de ortografía. Se explican los algoritmos para la construcción, exportación y actualización de la estructura de datos.

Palabras clave: búsqueda de prefijos, búsqueda aproximada de prefijos, comparación aproximada de cadenas, análisis morfológico, corrección de ortografía, procesamiento de lenguaje natural, computación de ADN.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

Aho, Alfred V. "Algorithms for finding patterns in strings", J. van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 5, pp. 254–300. Elsevier Science Publishers B. V., 1990. [ Links ]

Bayer, R., and K. Unterauer. "Prefix B–Trees", ACM Trans. Database Systems 2., p. 11–26, 1977. [ Links ]

Bolshakov, I. A. "Automatic Error Correction in Inflected Languages." Journal of Soviet Mathematics 56 (1): 2263— 2287, 1991. [ Links ]

Bolshakov, I. A., and A. Gelbukh. "On Detection of Malapropisms by Multistage Collocation Testing", NLDB–2003, 8' International Workshop on Applications of Natural Language to Information Systems, Lecture Notes in Computer Science, 2003, to appear. [ Links ]

Bolshakov, I. A., and A. Gelbukh. "Paronyms for Accelerated Correction, of Semantic Errors", KDS–2003, Knowledge–Dialogue–Solution, Varna Bulgaria, 2003, to appear. [ Links ]

Cassidy, P. "An Investigation of the Semantic Relations in the Roget's Thesaurus: Preliminary Results", A. Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, Fondo de Cultura Económica, Mexico, to appear in 2001. See also Proc. of CICLing–2000, February 13 to 19, 2000, CIC–IPN, Mexico City, ISBN 970–18–4206–5. [ Links ]

Comer, Douglas. "The Ubiquitous B–Tree", Computing Surveys 11 (2), 1979, pp. 121–137. [ Links ]

Cooper, W. S. "The storage problem", Mech. Translat., 1958, pp. 74–83. [ Links ]

Damerau, F. J. "A technique for computer detection and correction of spelling errors", Communications of the ACM, 7(3), 1964, pp. 171–176. [ Links ]

Diccionario. Diccionario de la lengua española. Real academia española, vigésima primera edición, 1992. [ Links ]

Damerau, F. J. "A technique for computer detection and correction of spelling errors", Communications of the ACM, 7(3), 1964, pp. 171–176. [ Links ]

Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998. [ Links ]

Frakes, W., and R. Baeza–Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice–Hall, 1992. [ Links ]

Gelbukh, A. F. An efficiently implementable model of morphology of an inflective language. Ph.D. thesis, VINITI, Moscow, Russia, 1995; see http://www.Gelbukh.com. [ Links ]

Gel'bukh, A. F. "Effective implementation of morphology model for an inflectional natural language", Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 26, N 1, Gelbukh.com 1992, pp. 22–31; see http://www.Gelbukh.com. [ Links ]

Gel'bukh, A. F. "Minimizing the number of memory accesses in dictionary morphologic analysis", Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 25, N 3, 1991, pp. 40–45. see http://www.Gelbukh.com [ Links ]

Gusfield, Dan. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. [ Links ]

Hausser, Ronald. Foundations of Computational Linguistics. Man–Machine Communication in Natural Language. Springer–Verlag, 1999. [ Links ]

Hirst, G., A. Budanitsky. "Correcting Real–Word Spelling Errors by Restoring Lexical Cohesion". Computational Linguistics (to appear), 2003. [ Links ]

Johnson, Theodore, and Dennis Shasha. "B–Trees with Inserts and Deletes: Why Free–at–Empty Is Better Than Merge–at–Half", JCSS 47 (1): 45–76 (1993). See other publications at http://www.informatik.uni-trier.de/~ley/db/access/btree.html. [ Links ]

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing, Prentice–Hall, 2000, 934 pp. [ Links ]

Kernighan, M. D., K. W. Church, W. A. Gale. "A spelling correction program based on a noisy channel model" COLING–90, Helsinki, Vol. II, 1990, pp. 205–211. [ Links ]

Knuth, Donald. The Art of Computer Programming: Sorting and Searching, Vol 3, 2nd Ed, Addison–Wesley, 1998. [ Links ]

Koskenniemi, Kimmo. Two–level Morphology: A General Computational Model for Word–Form Recognition and Production, University of Helsinki, Department of General Linguistics, Publications, Nil, 1983, 160 pp. [ Links ]

Kukich, K. "Techniques for automatically correcting words in texts", ACM Computing Surveys, 24(4), 1992, pp. 377–439. [ Links ]

Lenat, D. B. and R. V. Guha. Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley, 1990. See also more recent publications on CYC project, http://www.cyc.com. [ Links ]

Levenshtein, V. I. "Binary codes capable of correcting deletions, insertions, and reversals" Cybernetics and Control Theory, 10(8), 1966, pp. 707–710. (Originally published in Doklady Academii Nauk SSSR 163(4), 1965, pp. 845–848.) [ Links ]

Multilex. Electronic dictionary family Multilex, ver. 1, 1996. See http://www.multilex.ru. [ Links ]

Sidorov, G. O. "Lemmatization in automatized system for compilation of personal style dictionaries of literature writers." In Word of Dostoyevsky, Russian Academy of Sciences, Moscow, 1996. pp. 266–300. [ Links ]

Wagner, R. A., and M. J. Fisher. "The string–to–string correction problem". Journal of the Association for Computing Machinery, 21, 1974. pp. 168—173. [ Links ]

Yuret, Deniz. Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT, 1998. See http://xxx.lanl.gov/abs/cmp-lg/9805009. [ Links ]

Zaliznyak, A. A. Grammatical dictionary of Russian. Word formation (in Russian). Russkij Jazyk, Moscow, Russia, 1987, 878 pp. [ Links ]