1405-5546

S1405-55462013000200009

Canada

00 06 2013

17 2 187 196

Artículos

A Knowledge-Base Oriented Approach for Automatic Keyword Extraction

El enfoque basado en conocimiento para la extracción automática de palabras clave

Ludovic Jean-Louis¹, Michel Gagnon¹, and Eric Charton³

¹ École Polytechnique de Montréal, Montréal, QC, Canada ludovic.jean-louis@polymtl.ca

]]> ² Centre de Recherche Informatique de Montréal, Montréal, QC, Canada michel.gagnon@polymtl.ca

³ École Polytechnique de Montréal, Montréal, QC, Canada and Centre de Recherche Informatique de Montréal, Montréal, QC, Canada eric.charton@crim.ca

Article received on 08/12/2012
Accepted on 17/01/2013.

Abstract

Automatic keyword extraction is an important subfield of information extraction process. It is a difficult task, where numerous different techniques and resources have been proposed. In this paper, we propose a generic approach to extract keyword from documents using encyclopedic knowledge. Our two-step approach first relies on a classification step for identifying candidate keywords followed by a learning-to-rank method depending on a user-defined keyword profile to order the candidates. The novelty of our approach relies on i) the usage of the keyword profile ii) generic features derived from Wikipedia categories and not necessarily related to the document content. We evaluate our system on keyword datasets and corpora from standard evaluation campaign and show that our system improves the global process of keyword extraction.

Keywords: Automatic keyword extraction, encyclopedic knowledge.

]]> Resumen

Extracción de palabras clave es una tarea importante del proceso de extracción de información. Esta tarea es difícil de realizar; con la intención de lograrlo muchas distintas técnicas y recursos han sido propuestos. En este artículo se propone el enfoque genérico para extraer palabras clave de documentos usando el conocimiento enciclopédico. El enfoque incluye dos etapas; primero se realiza clasificación con el fin de identificar candidatos a palabras clave y luego se aplica el método de aprendizaje de ranking dependiente del perfil de palabras clave definido por el usuario para ordenar los candidatos. La novedad del enfoque se basa en 1) el uso del perfil de palabras clave y 2) las características genéricas derivadas de las categorías de Wikipedia y no necesariamente relacionadas con el contenido del documento. El sistema se ha evaluado sobre conjuntos de datos de palabras clave y corpus de la campaña de evaluación estándar y se ha demostrado que el sistema propuesto mejora el procedimiento global de extracción de palabras clave.

Palabras clave: Extracción automática de palabras clave, conocimiento enciclopédico.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. [ Links ]

2. Charton, E., Camelin, N., Acuna-Agost, R., Gotab, P., Lavalley, R., Kessler, R., & Fernandez, S. (2008). Pré-traitements classiques ou par analyse distributionnelle: application aux méthodes de classification automatique déployées pour deft08. In Actes DEFT08-TALN'08. [ Links ]

3. Chen, P.-I. & Lin, S.-J. (2010). Automatic keyword prediction using google similarity distance. Expert Systems with Applications, 37. [ Links ]

4. Eichler, K. & Neumann, G. (2010). Dfki keywe: Ranking keyphrases extracted from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation. ACL. [ Links ]

5. Grineva, M., Grinev, M., & Lizorkin, D. (2009). Extracting key terms from noisy and multitheme documents. In Proceedings of WWW '09. ACM. [ Links ]

6. Hammouda, K. M., Matute, D. N., & Kamel, M. S. (2005). Corephrase: keyphrase extraction for document clustering. In Proceedings of MLDM'05. Springer-Verlag. [ Links ]

7. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP '03. ACL. [ Links ]

8. Kim, S. N., Medelyan, O., Kan, M.-Y., & Baldwin, T. (2010). Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation. ACL, Sweden. [ Links ]

9. Lopez, P. & Romary, L. (2010). Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th International Workshop on Semantic Evaluation. ACL, Sweden. [ Links ]

10. Matsuo, Y. & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13, 157-169. [ Links ]

11. Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of EMNLP '09. ACL. [ Links ]

12. Medelyan, O., Witten, I. H., & Milne, D. (2008). Topic indexing with Wikipedia. In Proceedings of the Wikipedia and AI workshop at AAAI-08. [ Links ]

13. Rao, W., Chen, L., Hui, P., & Tarkoma, S. (2012). Move: A large scale keyword-based content filtering and dissemination system. In IEEE 32nd ICDS. [ Links ]

14. Sculley, D. (2010). Combined regression and ranking. In Proceedings of KDD '10. ACM. [ Links ]

15. Stuart, R., Dave, E., Nick, C., & Wendy, C. (2010). Text Mining, chapter Automatic Keyword Extraction from Individual Documents. John Wiley & Sons, Ltd, 1-20. [ Links ]

16. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Inf. Retr., 2(4). [ Links ]

17. Vidal, M., Menezes, G. V., Berlt, K., de Moura, E. S., Okada, K., Ziviani, N., Fernandes, D., & Cristo, M. (2012). Selecting keywords to represent web pages using wikipedia information. In Proceedings of WebMedia 12. ACM, USA. [ Links ]

18. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (1999). Kea: practical automatic keyphrase extraction. In Proceedings of DL 99. ACM, USA. [ Links ]

19. Yang, S., Jin, J., Parag, J., & Liu, S. (2010). Contextual advertising for web article printing. In Proceedings of Doc Eng 10. ACM, USA. [ Links ]

20. Yih, W.-T., Goodman, J., & Carvalho, V. R. (2006). Finding advertising keywords on web pages. In Proceedings of WWW '06. ACM, USA. [ Links ]

21. Zhang, C., Wang, H., Liu, Y., Wu, D., Liao, Y., & Wang, B. (2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 1169-1180. [ Links ]

]]>

1996 24 2 2

123-140

2008

2010

2009

2005

2003

2010

2004 13

157-169

2009

2008

2012

2010

1-20

2000 2 4 4

2012

1999

2010

2006

2008

1169-1180