1405-5546

S1405-55462006000100007

00 03 2006

9 3 270 286

Resumen de tesis doctoral

Algoritmos y Métodos para el Reconocimiento de Voz en Español Mediante Sílabas

Algorithms and Methods for the Automatic Speech Recognition in Spanish Language using Syllables

Graduated: José Luis Oropeza Rodríguez
Centro de Investigación en Computación–IPN
Av. Juan de Dios Bátiz s/n esq. Miguel Othón Mendizábal C. P. 07738 México D. F.
j_orope@yahoo.com.mx ]]> Graduado en diciembre 15, 2006

Advisor: Sergio Suárez Guerra
Centro de Investigación en Computación–IPN
Av. Juan de Dios Bátiz s/n esq. Miguel Othón Mendizábal C. P. 07738 México D. F.
ssuare@cic.ipn.mx

Resumen

Actualmente el uso de los fonemas tiene implícita varias dificultades debido a que la identificación de las fronteras entre ellos por lo regular es difícil de encontrar en representaciones acústicas de voz. El presente trabajo plantea una alternativa a la forma en la que el reconocimiento de voz se ha estado implementando desde hace ya bastante tiempo, analizando la forma en la cual el paradigma de la sílaba responde a tal labor dentro del español. Durante los experimentos realizados fueron examinados para la tarea de segmentación tres elementos esenciales: a) la Función de Energía Total en Corto Tiempo, b) la Función de Energía de altas frecuencias Cepstrales (conocida como Energía del parámetro RO), y c) un Sistema Basado en Conocimiento. Tanto el Sistema Basado en Conocimiento y la Función de Energía Total en Corto Tiempo fueron usados en un corpus de dígitos en donde los resultados alcanzados usando sólo la Función de Energía Total en Corto Tiempo, fueron de 90.58%. Cuando se utilizaron los parámetros Función de Energía Total en Corto Tiempo y la Energía del parámetro RO se obtuvo un 94.70% de razón de reconocimiento. Lo cual causa un incremento del 5% con relación al uso de palabras completas en un corpus de voz dependiente de contexto. Por otro lado, cuando se utilizó un corpus de laboratorio del habla continua al usar la Función de Energía Total en Corto Tiempo y el Sistema Basado en Conocimiento, se alcanzó un 78.5% de razón de reconocimiento y un 80.5% de reconocimiento al usar los tres parámetros anteriores. El modelo del lenguaje utilizado para este caso fue el bigram y se utilizaron Cadenas Ocultas de Markov de densidad continua con tres y cinco estados, con 3 mixturas Gaussianas por estado.

Palabras clave: Reconocimiento de voz, reconocimiento de sílabas, sistemas expertos, procesamiento de voz.

]]> Abstract

This work examines the results of incorporating into Automatic Speech Recognition the syllable units for the Spanish language. Because of the boundaries between phonemes–like units its often difficult to elicit them; the use of these has not reached a good performance in Automatic Speech Recognition. In the course of the developing the experiments three approaches for the segmentation task were examined: a) the using of the Short Term Total Energy Function, b) the Energy Function of the Cepstral High Frequency (named ERO parameter), and c) a Knowledge Based System. They represent the most important contributions of this work; they showed good results for the Continuous and Discontinuous speech corpus developed in laboratory.

The Knowledge Based System and Short Term Total Energy Function were used in a digit corpus where the results achieved using Short Term Total Energy Function alone reached 90.58% recognition rate. When Short Term Total Energy Function and RO parameters were used a 94.70% recognition rate was achieved. Otherwise, in the continuous speech corpus created in the laboratory the results achieved a 78.5% recognition rate using Short Term Total Energy Function and Knowledge Based System, and 80.5% recognition rate using the three approaches mentioned above. The bigram model language and Continuous Density Hidden Markov Models with three and five states incorporating three Gaussian Mixtures for state were implemented.

By further including a major number of digital filters and Artificial Intelligent techniques in the training and recognition stages respectively the results can be improved even more. This research showed the potential of the syllabic unit paradigm for the Automatic Speech Recognition for the Spanish language. Finally, the inference rules in the Knowledge Based System associated with rules for splitting words in syllables in the cited language were created.

Keywords: Speech recognition, Syllables recognition, Expert System, Speech processing.

DESCARGA ARTICULO EN FORMATO PDF

Referencias

1. Feal (2000). Feal L., "Sobre el uso de la sílaba como unidad de síntesis en el español", Informe Técnico, Departamento de Informática, Universidad de Valladolid, 2000. [ Links ]

2. Fosler et al. (1999). Fosler–Lussier E., Greenberg S., Morgan N., "Incorporating Contextual Phonetics into Automatic Speech Recognition". XIV International Congress of Phonetic Sciences, pp. 611–614, San Francisco, 1999. [ Links ]

3. Giarratano and Riley (2001). Giarratano Joseph y Riley Gary, International Thompson Editores, Sistemas expertos, principios y programación 2001. [ Links ]

4. Hauenstein (1996). Hauenstein A., "The syllable Re–revisited", Technical Report, Siemens AG, Corporate Research and Development, München Alemania, 1996. [ Links ]

5. Jackson (1986). Jackson L. B. "Digital Filters and Signal Processing". Kluwer Academic Publishers. University of Louisville, Department of Electrical and Computer Engineering, U.S.A., 1986 [ Links ]

6. Jones et al. (1999). Jones R., Downey S., Mason J., "Continuous Speech Recognition Using Syllables", Proceedings of Eurospeech, Vol. 3, pp. 1171–1174, Rhodes, Grecia 1999. [ Links ]

7. Kamakshi et al. (2002). Kamakshi V. Prasad, Nagarajan T. and Murthy Hema A.. "Continuous Speech Recognition Using Automatically Segmented Data at Syllabic Units". Department of Computer Science and Engineering. Indian Institute of Technology, Madras, Chennai 600–036. 2002. [ Links ]

8. Kirschning (1998). Kirschning Albers Ingrid, "Automatic Speech Recognition with the Parallel Cascade Neural Network", PhD Thesis, Tokyo Japan, March 1998. [ Links ]

9. Kosko (1992). Kosko B., "Neural Networks for Signal Processing", Prentice Hall, U.S.A., 1992. [ Links ]

10. Meneido et al. (1999). Meneido Hugo, Neto Joâo P. and Almeida Luís B., INESC–IST, "Syllable Onset Detection Applied to the Portuguese Language". Sixth European Conference on Speech Communication and Technology (EUROSPEECH'99) Budapest, Hunagry, September 5–9, 1999. [ Links ]

11. Meneido and Neto (2000). Meneido H., Neto J., "Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems". INESC, Rua Alves Redol, 9, 1000–029 Lisbon, Portugal, 2000. [ Links ]

12. Mermelstein (1975). Mermelstein Paul "Automatic Segmentation of Speech into Syllabic Units". Haskins Laboratories, New Haven, Connecticut 06510, pp. 880–883,58 (4), June 1975. [ Links ]

13. Oropeza (2000). Oropeza Rodríguez José Luis, "Reconocimiento de Comandos Verbales usando HMM". Tesis de maestría, Centro de Investigación en Computación, Noviembre 2000. [ Links ]

14. Rabiner and Biing–Hwang (1993). Lawrence Rabiner and Biing–Hwang Juang, "Fundamentals of Speech Recognition", Prentice Hall, 1993. [ Links ]

15. Resch (2001a). Resch Barbara. "Gaussian Statistics and Unsupervised Learning". A tutorial for the Course Computational Intelligence Signal Processing and Speech Communication Laboratory. www.igi.tugraz.at/lehre/CI, November 15, 2001. [ Links ]

16. Resch (2001b). Resch Barbara. "Hidden Markov Models". A Tutorial for the Course Computational Laboratory. Signal Processing and Speech Communication Laboratory. www.igi.turgaz.at/lehre/CI, November 15, 2001. [ Links ]

17. Russell and Norvig (1996). Russell Stuart and Norvig Peter, Inteligencia Artificial un enfoque moderno, Prentice Hall, 1996. [ Links ]

18. Savage (1995). Savage Carmona Jesus, "A Hybrid Systems with Symbolic AI and Statistical Methods for Speech Recognition". PhD Thesis, University of Washington, 1995. [ Links ]

19. Suárez (2005). Suárez Guerra Sergio, ¿100% de reconocimiento de voz?. Trabajo inédito, no publicado. [ Links ]

20. Sydral et al. (1995). Sydral A., Bennet R., Greenspan S., "Applied Speech Technology", Eds (1995). CRC Press, ISBN 0–8493–9456–2, U.S.A., 1995. [ Links ]

21. Weber (2000). Weber K., "Multiple Timescale Feature Combination Towards Robust Speech Recognition". Konferenz zur Verarbeitung natürlicher Sprache KOVENS2000, Ilmenau, Alemania, 2000. [ Links ]

22. Wu (1998). Wu, S., "Incorporating information from syllable–length time scales into automatic speech recognition", PhD Thesis, Berkeley University, 1998. [ Links ]

23. Wu et al. (1997). Wu S., Shire M., Greenberg S., Morgan N., "Integrating Syllable Boundary Information into Automatic Speech Recognition ". ICASSP–97, Vol. 1, Munich Germany, vol.2 pp. 987–990, 1997. [ Links ]

24. Zhang (1999). Zhang Jialu, "On the syllable structures of Chinese relating to speech recognition", Institute of Acoustics, Academia Sinica Beijing, China, 1999. [ Links ] ]]> 1 2000 2 1999 611-614 3 2001 4 1996 5 1986 6 1999 3 1171-1174 7 2002 8 Marc h 19 9 1992 10 1999 11 2000 9 1000-029 12 June 1 97 584 880-883 13 14 1993 15 2001 16 2001 17 1996 18 1995 19 20 1995 21 2000 22 1998 23 1997 12 987-990 24 1999