SciELO - Scientific Electronic Library Online

 
vol.17 issue2Using Stylistic Features for Social Power ModelingGraph Mining under Linguistic Constraints for Exploring Large Texts author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.2 México Apr./Jun. 2013

 

Artículos

 

Detecting Salient Events in Large Corpora by a Combination of NLP and Data Mining Techniques

 

Detección de destacados eventos en un corpus grande combinando técnicas para PLN y minería de datos

 

Delphine Battistelli1, Thierry Charnois2, Jean-Luc Minel3, and Charles Teissèdre4

 

1 STIH, Université Paris Sorbonne, France delphine.battistelli@paris-sorbonne.fr

2 GREYC, Université de Caen, France and MoDyCo, UMR 7114, Université Paris Ouest Nanterre La Défense, France thierry.charnois@unicaen.fr

3 GREYC, Université de Caen, France ean-luc.minel@u-paris10.fr

4 STIH, Université Paris Sorbonne, France charles.teissedre@gmail.com

 

Article received on 05/12/2012
Accepted on 17/01/2013.

 

Abstract

In this paper, we present a framework and a system that extracts "salient" events relevant to a query from a large collection of documents, and which also enables events to be placed along a timeline. Each event is represented by a sentence extracted from the collection. We have conducted some experiments showing the interest of the method for this issue. Our method is based on a combination of linguistic modeling (concerning temporal adverbial meanings), symbolic natural language processing techniques (using cascades of morpho-lexical transducers) and data mining techniques (namely, sequential pattern mining under constraints). The system was applied to a corpus of newswires in French provided by the Agence France Presse (AFP). Evaluation was performed in partnership with French newswire agency journalists.

Keywords: Dates, temporal adverbials, event extraction, sequential pattern.

 

Resumen

En este trabajo se presenta el marco y el sistema para extracción de los eventos "destacados" relevantes a una pregunta de una gran colección de documentos, el cual también permite ubicar los eventos a lo largo de la línea de tiempo. Cada evento se representa por una frase extraída de la colección. Se han realizado unos experimentos que muestran el interés del método para este problema. El método propuesto se basa en la combinación del modelado lingüístico (con respecto a significados adverbiales temporales), las técnicas simbólicas de procesamiento de lenguaje natural (usando cascadas de transductores morfo-léxicos) y técnicas de minería de datos (la minería de patrones secuenciales bajo restricciones). El sistema ha sido aplicado a un corpus de noticias en idioma francés proporcionado por la Agencia France Presse (AFP). La evaluación se realizó en colaboración con periodistas de agencias francesas de noticias.

Palabras clave: Fechas, adverbiales temporales, extracción de eventos, patrón secuencial.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

Acknowledgements

This work has been partially funded by ANR Chronolines and Ecos-Sud 28 80.

 

References

1. McKeown, K.R., Hatzivassiloglou, V., Barzilay, R., Schiffman, B., Evans D., & Teufel, S. (2001). Columbia multi-document summarization: approach and evaluation. Proceedings of the Document Understanding Conference (DUC01), New Orleans, Louisiana, USA.         [ Links ]

2. Barzilay, R., Elhadad, N., & McKeown, K.R. (2002). Inferring Strategies for Sentence Ordering in Multidocument News Summarization. Journal of Artificial Intelligence Research, 17(1), 35-55.         [ Links ]

3. Mani, I. & Wilson, G. (2000). Robust temporal processing of news. 38th Annual Meeting on Association for Computational Linguistics (ACL'00), Hong Kong, China, 69-76.         [ Links ]

4. Li, Z., Wang, B., Li, M., & Ma W.Y. (2005). A Probabilistic Model for Restrospective News Event Detection. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05). Salvador, Brazil, 106-113.         [ Links ]

5. Smith, D.A. (2002). Detecting and Browsing Events in Unstructured Text. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02), Tampere, Finland, 73-80.         [ Links ]

6. Swan, R. & Allan, J. (2000). Automatic Generation of Overview Timelines. 23d Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00), Athens, Greece, 49-56.         [ Links ]

7. Allan, J., Gupta, R., & Khandelwal, V. (2001). Temporal summaries of new topics. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01), New Orleans, Louisiana, USA, 10-18.         [ Links ]

8. Chieu, H.L. & Lee, Y.K. (2004). Query based event extraction along a timeline. 27th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '04), Sheffield, UK, 425-432.         [ Links ]

9. Yan, R., Kong, L., Huang, C., Wan, X., Li, X., & Zhang, Y. (2011). Timeline generation through evolutionary trans-temporal summarization. Conference on Empirical Methods in Natural Language Processing (EMNLP '11), Edinburgh, UK, 433-443.         [ Links ]

10. Kessler, R., Tannier, X., Hagège, C., Moriceau, V., & Bittar, A. (2012). Finding Salient Dates for Building Thematic Timelines: Long Paper. 50th Annual Meeting of the Association for Computational Linguistics (ACL '12), Jeju Island, Korea, 1, 730-739.         [ Links ]

11. Cellier, P., Charnois, T., Plantevit, M., & Crémilleux, B. (2010). Recursive Sequence Mining to Discover Named Entity Relations. Advances in Intelligent Data Analysis IX, Lecture Notes in Computer Science, 6065, 30-41.         [ Links ]

12. Pustejovsky, J., Castaño, J., Ingria, R., Sauri, R., Gaizauskas, R., Setzer, A., & Katz, G. (2003). TimeML: Robust Specification of Event and Temporal Expressions in Text. Fifth International Workshop on Computational Semantics (IWCS-5), Tilburg, Netherlands.         [ Links ]

13. Battistelli, D., Couto, J., Minel, J.L., & Schwer, S.R. (2008). Representing and Visualizing calendar expressions in texts. 2008 Conference on Semantics in Text Processing (STEP'08), Venice, Italy, 365-373.         [ Links ]

14. Teissèdre, C., Battistelli, D., & Minel, J.L. (2010). Resources for Calendar Expressions Semantic Tagging and Temporal Navigation through Texts. 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 3572-3577.         [ Links ]

15. Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. Eleventh International Conference on Data Engineering (ICDE '95), Taipei, Taiwan, 3-14.         [ Links ]

16. Srikant, R. & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Advances in Database Technolohy-EDBT'96, Lecture Notes in Computer Science, 1057, 1-17.         [ Links ]

17. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. Third SIAM International Conference on Data Mining, San Francisco, California.         [ Links ]

18. Zaki, M.J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2), 31-60.         [ Links ]

19. Dong, G. & Pei, J. (2007). Sequence Data Mining. New York: Springer.         [ Links ]

20. Battistelli, D., Cori, M., Minel, J.L., & Teissèdre, C. (2011). Semantics of Calendar Adverbials for Information Retrieval. Foundations of Intelligent Systems, Lecture Notes in Computer Science, 6804, 622-631.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License