SciELO - Scientific Electronic Library Online

 
vol.17 issue2Detecting Salient Events in Large Corpora by a Combination of NLP and Data Mining TechniquesCorpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.2 México Apr./Jun. 2013

 

Artículos

 

Graph Mining under Linguistic Constraints for Exploring Large Texts

 

Minería de grafos bajo restricciones lingüísticas para exploración de textos grandes

 

Solen Quiniou1, Peggy Cellier2, Thierry Charnois3, and Dominique Legallois4

 

1 LINA, LUNAM Université de Nantes, Nantes, France solen.quiniou@univ-nantes.fr

2 IRISA, INSA de Rennes, Rennes, France peggy.cellier@irisa.fr

3 GREYC, Université de Caen Basse-Normandie, Caen, France and MoDyCO, Université Paris-Ouest Nanterre La Défense, Paris, France thierry.charnois@unicaen.fr

4 CRISCO, Université de Caen Basse-Normandie, Caen, France dominique.legallois@unicaen.fr

 

Article received on 07/12/2012
Accepted on 11/01/2013.

 

Abstract

In this paper, we propose an approach to explore large texts by highlighting coherent sub-parts. The exploration method relies on a graph representation of the text according to Hoey's linguistic model which allows the selection and the binding of adjacent and non-adjacent sentences. The main contribution of our work consists in proposing a method based on both Hoey's linguistic model and a special graph mining technique, called CoHoP mining, to extract coherent sub-parts of the graph representation of the text. We have conducted some experiments on several English texts showing the interest of the proposed approach.

Keywords: Text coherence, graph representation, graph mining, Hoey's linguistic model.

 

Resumen

En este artículo se propone el enfoque para la exploración de textos grandes destacando las sub-partes coherentes. El método de exploración se basa en la representación del texto mediante un gráfo de acuerdo con el modelo lingüístico de Hoey, el cual permite la selección y vinculación de frases adyacentes y no adyacentes. La principal aportación de este trabajo es la propuesta del método basado en el modelo lingüístico de Hoey por un lado y por otro lado en la técnica especial de minería de grafos llamada minería CoHoP, con el fin de extraer las sub-partes coherentes de la representación gráfica del texto. Se realizaron unos experimentos sobre varios textos en inglés mostrando el interés del enfoque propuesto.

Palabras clave: Coherencia de texto, representación con un grafo, minería de grafos, el modelo lingüístico de Hoey.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

Acknowledgments

This work is partly supported by the French Région Basse-Normandie and by the ANR (French National Research Agency) funded project Hybride ANR-11-BS02-002. The authors would also like to thank Pierre-Nicolas Mougel and Christophe Rigotti (LIRIS, Lyon) for the availability of CoHoP Miner.

 

References

1. Achtert, E., Goldhofer, S., Kriegel, H.-P., Schubert, E., & Zimek, A. (2012). Evaluation of clusterings - metrics and visual support. In Proc. of ICDE'12.         [ Links ]

2. Ben-Ze'ev, A. (2004). Love Online: Emotions on the Internet. Cambridge Univ. Pr.         [ Links ]

3. Derenyi, I., Palla, G., & Vicsek, T. (2005). Clique percolation in random networks. Physical Review Letters, 94.         [ Links ]

4. Don, A., Zheleva, E., Gregory, M., Tarkan, S., Auvil, L., Clement, T., Shneiderman, B., & Plaisant, C. (2007). Discovering interesting usage patterns in text collections: integrating text mining with visualization. In Proc. of CIKM'07.         [ Links ]

5. Feldman, R. & Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Pr.         [ Links ]

6. Hoey, M. (1991). Patterns of Lexis in Text. Describing English Language. Oxford Univ. Pr.         [ Links ]

7. Hovy, E. (1988). Planning coherent multisentential text. In Proc. of ACL88.         [ Links ]

8. Jones, K. S. (2007). Automatic summarising: The state of the art. Information Processing & Management, 43(6).         [ Links ]

9. Legallois, D., Cellier, P., & Charnois, T. (2011). Calcul de réseaux phrastiques pour l'analyse et la navigation textuelle. In Actes de TALN'11.         [ Links ]

10. MacNeilage, P. (2008). The Origin of Speech. UOP Oxford.         [ Links ]

11. Mann, W. & Thompson, S. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3).         [ Links ]

12. Mougel, P.-N., Rigotti, C., & Gandrillon, O. (2012). Finding collections of k-clique percolated components in attributed graphs. In Proc. of PAKDD 12.         [ Links ]

13. Quiniou, S., Cellier, P., Charnois, T., & Legallois, D. (2012). What about sequential data mining techniques to identify linguistic patterns for stylistics? In Proc. ofCICLing'12.         [ Links ]

14. Renouf, A. & Kehoe, A. (2004). Textual Distraction as a Basis for Evaluating Automatic Summarisers. In Proc. of LREC'04.         [ Links ]

15. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proc. of KDD'94.         [ Links ]

16. Washio, T. & Motoda, H. (2003). State of the art of graph-based data mining. SIGKDD Explorations, 5(1).         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License