SciELO - Scientific Electronic Library Online

vol.17 número2El uso de características estilísticas para modelado del poder socialMinería de grafos bajo restricciones lingüísticas para exploración de textos grandes índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados




Links relacionados

  • No hay artículos similaresSimilares en SciELO


Computación y Sistemas

versión impresa ISSN 1405-5546

Comp. y Sist. vol.17 no.2 México abr./jun. 2013




Detecting Salient Events in Large Corpora by a Combination of NLP and Data Mining Techniques


Detección de destacados eventos en un corpus grande combinando técnicas para PLN y minería de datos


Delphine Battistelli1, Thierry Charnois2, Jean-Luc Minel3, and Charles Teissèdre4


1 STIH, Université Paris Sorbonne, France

2 GREYC, Université de Caen, France and MoDyCo, UMR 7114, Université Paris Ouest Nanterre La Défense, France

3 GREYC, Université de Caen, France

4 STIH, Université Paris Sorbonne, France


Article received on 05/12/2012
Accepted on 17/01/2013.



In this paper, we present a framework and a system that extracts "salient" events relevant to a query from a large collection of documents, and which also enables events to be placed along a timeline. Each event is represented by a sentence extracted from the collection. We have conducted some experiments showing the interest of the method for this issue. Our method is based on a combination of linguistic modeling (concerning temporal adverbial meanings), symbolic natural language processing techniques (using cascades of morpho-lexical transducers) and data mining techniques (namely, sequential pattern mining under constraints). The system was applied to a corpus of newswires in French provided by the Agence France Presse (AFP). Evaluation was performed in partnership with French newswire agency journalists.

Keywords: Dates, temporal adverbials, event extraction, sequential pattern.



En este trabajo se presenta el marco y el sistema para extracción de los eventos "destacados" relevantes a una pregunta de una gran colección de documentos, el cual también permite ubicar los eventos a lo largo de la línea de tiempo. Cada evento se representa por una frase extraída de la colección. Se han realizado unos experimentos que muestran el interés del método para este problema. El método propuesto se basa en la combinación del modelado lingüístico (con respecto a significados adverbiales temporales), las técnicas simbólicas de procesamiento de lenguaje natural (usando cascadas de transductores morfo-léxicos) y técnicas de minería de datos (la minería de patrones secuenciales bajo restricciones). El sistema ha sido aplicado a un corpus de noticias en idioma francés proporcionado por la Agencia France Presse (AFP). La evaluación se realizó en colaboración con periodistas de agencias francesas de noticias.

Palabras clave: Fechas, adverbiales temporales, extracción de eventos, patrón secuencial.





This work has been partially funded by ANR Chronolines and Ecos-Sud 28 80.



1. McKeown, K.R., Hatzivassiloglou, V., Barzilay, R., Schiffman, B., Evans D., & Teufel, S. (2001). Columbia multi-document summarization: approach and evaluation. Proceedings of the Document Understanding Conference (DUC01), New Orleans, Louisiana, USA.         [ Links ]

2. Barzilay, R., Elhadad, N., & McKeown, K.R. (2002). Inferring Strategies for Sentence Ordering in Multidocument News Summarization. Journal of Artificial Intelligence Research, 17(1), 35-55.         [ Links ]

3. Mani, I. & Wilson, G. (2000). Robust temporal processing of news. 38th Annual Meeting on Association for Computational Linguistics (ACL'00), Hong Kong, China, 69-76.         [ Links ]

4. Li, Z., Wang, B., Li, M., & Ma W.Y. (2005). A Probabilistic Model for Restrospective News Event Detection. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05). Salvador, Brazil, 106-113.         [ Links ]

5. Smith, D.A. (2002). Detecting and Browsing Events in Unstructured Text. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02), Tampere, Finland, 73-80.         [ Links ]

6. Swan, R. & Allan, J. (2000). Automatic Generation of Overview Timelines. 23d Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00), Athens, Greece, 49-56.         [ Links ]

7. Allan, J., Gupta, R., & Khandelwal, V. (2001). Temporal summaries of new topics. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '01), New Orleans, Louisiana, USA, 10-18.         [ Links ]

8. Chieu, H.L. & Lee, Y.K. (2004). Query based event extraction along a timeline. 27th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '04), Sheffield, UK, 425-432.         [ Links ]

9. Yan, R., Kong, L., Huang, C., Wan, X., Li, X., & Zhang, Y. (2011). Timeline generation through evolutionary trans-temporal summarization. Conference on Empirical Methods in Natural Language Processing (EMNLP '11), Edinburgh, UK, 433-443.         [ Links ]

10. Kessler, R., Tannier, X., Hagège, C., Moriceau, V., & Bittar, A. (2012). Finding Salient Dates for Building Thematic Timelines: Long Paper. 50th Annual Meeting of the Association for Computational Linguistics (ACL '12), Jeju Island, Korea, 1, 730-739.         [ Links ]

11. Cellier, P., Charnois, T., Plantevit, M., & Crémilleux, B. (2010). Recursive Sequence Mining to Discover Named Entity Relations. Advances in Intelligent Data Analysis IX, Lecture Notes in Computer Science, 6065, 30-41.         [ Links ]

12. Pustejovsky, J., Castaño, J., Ingria, R., Sauri, R., Gaizauskas, R., Setzer, A., & Katz, G. (2003). TimeML: Robust Specification of Event and Temporal Expressions in Text. Fifth International Workshop on Computational Semantics (IWCS-5), Tilburg, Netherlands.         [ Links ]

13. Battistelli, D., Couto, J., Minel, J.L., & Schwer, S.R. (2008). Representing and Visualizing calendar expressions in texts. 2008 Conference on Semantics in Text Processing (STEP'08), Venice, Italy, 365-373.         [ Links ]

14. Teissèdre, C., Battistelli, D., & Minel, J.L. (2010). Resources for Calendar Expressions Semantic Tagging and Temporal Navigation through Texts. 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 3572-3577.         [ Links ]

15. Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. Eleventh International Conference on Data Engineering (ICDE '95), Taipei, Taiwan, 3-14.         [ Links ]

16. Srikant, R. & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Advances in Database Technolohy-EDBT'96, Lecture Notes in Computer Science, 1057, 1-17.         [ Links ]

17. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. Third SIAM International Conference on Data Mining, San Francisco, California.         [ Links ]

18. Zaki, M.J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2), 31-60.         [ Links ]

19. Dong, G. & Pei, J. (2007). Sequence Data Mining. New York: Springer.         [ Links ]

20. Battistelli, D., Cori, M., Minel, J.L., & Teissèdre, C. (2011). Semantics of Calendar Adverbials for Information Retrieval. Foundations of Intelligent Systems, Lecture Notes in Computer Science, 6804, 622-631.         [ Links ]

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons