A Flexible Table Parsing Approach

Schilder, Frank; Kondadadi, Ravi; Kadiyska, Yana

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Polibits

On-line version ISSN 1870-9044

Polibits n.45 México Jun. 2012

A Flexible Table Parsing Approach

Frank Schilder*, Ravi Kondadadi**, and Yana Kadiyska***

Frank Schilder and Ravi Kondadadi are with Thomson Reuters Corporate R & D and Yana Kadiyska is with Thomson Reuters Fixed Income, USA (e-mail: *frank.schilder@thomsonreuters.com, **ravikumar.kondadadi@thomsonreuters.com, ***yana.kadiyska@thomsonreuters.com).

Manuscript received on October 31, 2011.
accepted for publication on December 9, 2011.

Abstract

Relational data is often encoded in tables. Tables are easy to read by humans, but difficult to interpret automatically. In cases where table layout cues are not obtainable (missing HTML tags) or where columns are distorted (by copying from a spreadsheet to text) previous table extraction approaches run into problems. This paper introduces a novel table parsing approach. Our approach is based on a set of simple assumptions: (a) every table can be split up in data cells and headers, and (b) every table can be parsed beginning from a data cell utilizing the overall table structure. The table parsing is defined as "table flattening" in this paper. That is, the parsing starts with a data cell and pulls out all token (i.e., headers and sub-headers) associated with a respective data cell. We propose a parsing technique that uses two simple parsing heuristics: table headers are to the left of and above a data cell. We experimented with trader emails that contained instrument information with bid-ask prices as data cells. We developed a clustering and classifying method for finding prices reliably in the data set we used. This method is transferable to other data cell types and can be applied to other table content.

Key words: Information retrieval, document processing, tables.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] R. Zanibbi, D. Blostein, and J. Cordy, "A survey of table recognition: Models, observations, transformations, and inferences," Int 'I J. Document Analysis and Recognition, vol. 7, no. 1, 2004. [ Links ]

[2] M. Hurst and S. Douglas, "Layout and language: Preliminary investigations in recognizing the structure of tables," in Pwc. of Int 'I Conf. of Document Analysis and Recognition, 1997. [ Links ]

[3] P. Pyreddy and W. B. Croft, "Tintin: A system for retrieval in text tables," in Pwc. of Int'I Conf of Digital Libraries, 1997. [ Links ]

[4] D. Pinto, A. McCallun, X. Wei, and B. Croft, "Table extraction using conditional random fields," in Proc. of SIGIR, Toronto, 2003. [ Links ]

[5] M. Vilain, J. Gibson, B. Wellner, and R. Quimby, "Table classification: An application of machine learning to web-hosted financial documents," MITRE, Technical Report, 2006. [ Links ]

[6] W. Cohen, M. Hurst, and L. Jensen, "A flexible learning system for wrapping tables and lists in HTML documents," in Pwc. of WWW, 2002. [ Links ]

[7] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krapl, and B. Pollak, "Towards domain-independent information extraction from web tables," in Proceedings of the 16th International World Wide Web Conference (WWW 2007). ACM Press, May 8-12, 2007, pp. 71-80. [Online]. Available: http://www2007.org/paper790.php [ Links ]

[8] J. Platt, "Fast training of support vector machines using sequential minimal optimization," in Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds. MIT Press, 1998. [Online]. Available: http://research.microsoft.com/~jplatt/smo.html [ Links ]

[9] R. M. Kaplan and J. Bresnan, "Lexical-functional grammar: A formal system for grammatical representation," in The Mental Representation of Grammatical Relations, J. Bresnan, Ed. Cambridge, MA: MIT Press, 1982, pp. 173-281. [ Links ]

[10] C. Pollard and I. A. Sag, Head-Driven Phrase Structure Grammar. Chicago: University of Chicago Press, 1994. [ Links ]

[11] H. W. Kuhn, "The Hungarian Method for the Assignment Problem," in 50 Years of Integer Programming 1958-2008, M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey, Eds. Springer Berlin Heidelberg, 2010, pp. 29-47. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-68279-0\_2 [ Links ]