Speaker Verification in Different Database Scenarios

García Perera, Leibny Paola; Aceves López, Roberto; Nolazco Flores, Juan

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.15 n.1 Ciudad de México Jul./Sep. 2011

Artículos

Speaker Verification in Different Database Scenarios

Verificación de hablante en diferentes escenarios de base de datos

Leibny Paola García Perera, Roberto Aceves López, and Juan Nolazco Flores

Departamento de Ciencias Computacionales, Tecnológico de Monterrey, Monterrey, Nuevo León, México. E–mail: paola.garcia@itesm.mx, aceves@itesm.mx, jnolazco@itesm.mx

Article received on July 30, 2010.
Accepted on January 15, 2011.

Abstract

This document shows the results of our Speaker Verification System under two scenarios: the Face and Speaker Verification Evaluation organized by MOBIO (MObile BIOmetric consortium) and the results for the Speaker Recognition Evaluation 2010 organized by NIST. The core of our system is based on a Gaussian Mixture Model (GMM) and maximum likelihood (ML) framework. First, it extracts the important speech features by computing the Mel Frequency Cepstral Coefficients (MFCC). Then, the MFCCs train gender–dependent GMMs that are later adapted to obtain target models. To obtain reliable performance statistics those target–models evaluate a set of trials and final scores are calculated. Finally, those scores are tagged as target or impostor. We tried several system configurations and found that each database requires a specific tuning to improve the performance. For the MOBIO database we obtained an average equal error rate (EER) of 16.43 %. For the NIST 2010 database we accomplished an average EER of 16.61%. NIST2010 database considers various conditions. From those conditions, the interview training and testing conditions showed the best EER of 10.94 %, followed by the phone call training phone call testing conditions of 13.35%.

Keywords: Speaker verification and authentication.

Resumen

Este documento muestra los resultados de nuestro sistema de verificación de hablante bajo dos escenarios: la Evaluación Face and Speaker Verification Evaluation organizada por MOBIO (MObile BIOmetric consortium) y la Evaluación de Reconociemiento de personas 2010 organizada por NIST. La parte central de nuestro esquema se basa en un modelado de Mezclas de Gaussianas (GMM) y máxima verosimilitud. Primero, se extraen los parámetros importantes de la voz calculando los coeficientes ceptrales en escala mel, Mel Frequency Cepstral Coefficients (MFCC). Después, dichos MFFCs entrenan las mezclas de Gaussianas dependientes del género que posteriormente serán adaptadas y se obtendrán los modelos de los usuarios objetivo. Para obtener estadísticas confiables esos modelos objetivo son evaluados por un conjunto de señales no conocidas y se obtienen puntuaciones finales. Por último, esas puntuaciones son etiquetadas como usuario objetivo o impostor. Hemos analizado diferentes configuraciones y encontramos que cada base de datos requiere una sintonización adecuada para mejorar su desempeño. Para la base de datos MOBIO, obtuvimos un porcentaje de error promedio de 16.43 %. Para la base de datos NIST2010, logramos un promedio de error de 16.61%. La base de datos NIST2010 considera varias condiciones. De esas condiciones, la condición de entrevista para entrenamiento y prueba mostró el mejor error con 10.94 %, seguida por la condición de llamada telefónica en entrenamiento y llamada telefónica en prueba con 13.35%.

Palabras clave: Verificación de hablante y autenticación.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Bogert, B. P., Healy, M. J. R. & Tukey, J. W. (1963). The quefrency alanysis of time series for echoes: Cepstrum, pseudo–autocovariance, cross–cepstrum and saphe cracking. Symposium on Time Series Analysis, New York, USA, 209–243. [ Links ]

2. Burget, L., Fapso, M. Hubeika, V., Glembek, O., Karafiát, M., Kockmann, M., Matejka, P., Schwarz, P., & Cernocky, J. (2009). But system for nist 2008 speaker recognition evaluation. Interspeech 2009. Brighton, Great Britain, 2335–2338. [ Links ]

3. Chen, S. S., & Gopinath, R. A. (2001). Gaussianization. In Todd K. Leen, Thomas G. Dietterich,Volker Tresp (Eds.). Advances in neural information processing systems 13, (423–429), Massachusetts, USA, The MIT Press. [ Links ]

4. Davis, S. & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. [ Links ]

5. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39 (1), 1–38. [ Links ]

6. Duda, R. O. & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. [ Links ]

7. Furui, S. (1981). Cepstral analysis techniques for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29 (2), 254–272. [ Links ]

8. Gauvain, J. L. & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing, 2 (2), 291–298. [ Links ]

9. Hermansky, H., Morgan, N., Bayya, A., & Kohn, P. (1992). RASTA–PLP speech analysis technique. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP–92, San Francisco, USA, 1, 121–124. [ Links ]

10. Lee, C.–H. (1997). A unified statistical hypothesis testing approach to speaker verification and verbal information verification. Proceedings COST, Workshop on Speech Technology in the Public Telephone Network: Where are we today?, Rhodes, Greece, 63–72. [ Links ]

11. Marcel, S., McCool, C., Matejka, P., Ahonen, T., Cernocky, J. (2010). Mobile biometry (mobio) face and speaker verification evaluation. Retrieved from http://publications.idiap.ch/index.php/publications/show/1848 [ Links ]

12. Mariéthoz, J. & Bengio, S. (2005). A unified framework for score normalization techniques applied to text–independent speaker verification. IEEE Signal Processing Letters, 12 (7), 532–535. [ Links ]

13. Martin, A.F. & Greenberg, C.S. (2009). NIST 2008 Speaker Recognition Evaluation: Performance Across Telephone and Room Microphone Channels. Interspeech 2009, Brighton, United Kingdom, 2579–2582. [ Links ]

14. McCool, C. & Marcel, S. (2010). Mobio database for the ICPR 2010 face and speech competition. Retrieved from http://publications.idiap.ch/index.php/publications/show/1757 [ Links ]

15. Navratil, J. & Ramaswamy, G.N. (2003). The awe and mystery of t–norm. 8^th European Conference on Speech Communication and Technology, Geneva, Switzerland, 2009–2012. [ Links ]

16. Pelecanos, J. & Sridharan, S. (2001). Feature warping for robust speaker verification. A Speaker Odyssey–The Speaker Recognition Workshop, Crete, Greece, 213–218. [ Links ]

17. Petrovska–Delacrétaz, D., Hannani, A. E., & Chollet, G. (2007). Text–independent speaker verification: state of the art and challenges. Progress in nonlinear speech processing. Lecture Notes in Computer Science, 4391, 135–169. [ Links ]

18. Reynolds, D.A. (1992). A Gaussian mixture modeling approach to text–independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology, Atlanta, Georgia, USA. [ Links ]

19. Reynolds, D.A. (1995), Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17 (1–2), 91–108. [ Links ]

20. Reynolds, D.A., Quatieri, T. F. & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10 (1–3), 19–41. [ Links ]

21. Speaker Recognition Evaluation. Retrieved from http://www.itl.nist.gov/iad/mig/tests/sre/ [ Links ]

22. Speech processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front–end feature extraction algorithm; Compression algorithms. ETSI ES 201 108 V1.1.2 (2000–04), 2000. [ Links ]

23. Viikki, O. & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication– Special issue on robust speech recognition, 25 (1–3), 133–147. [ Links ]

24. Wald, A. (1947). Sequential analysis. New York: John Wiley and Sons. [ Links ]