SciELO - Scientific Electronic Library Online

 
 número45Content Extraction based on Hierarchical Relations in DOM StructuresString Distances for Near-duplicate Detection índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Resumen

SCHILDER, Frank; KONDADADI, Ravi  y  KADIYSKA, Yana. A Flexible Table Parsing Approach. Polibits [online]. 2012, n.45, pp.13-19. ISSN 1870-9044.

Relational data is often encoded in tables. Tables are easy to read by humans, but difficult to interpret automatically. In cases where table layout cues are not obtainable (missing HTML tags) or where columns are distorted (by copying from a spreadsheet to text) previous table extraction approaches run into problems. This paper introduces a novel table parsing approach. Our approach is based on a set of simple assumptions: (a) every table can be split up in data cells and headers, and (b) every table can be parsed beginning from a data cell utilizing the overall table structure. The table parsing is defined as "table flattening" in this paper. That is, the parsing starts with a data cell and pulls out all token (i.e., headers and sub-headers) associated with a respective data cell. We propose a parsing technique that uses two simple parsing heuristics: table headers are to the left of and above a data cell. We experimented with trader emails that contained instrument information with bid-ask prices as data cells. We developed a clustering and classifying method for finding prices reliably in the data set we used. This method is transferable to other data cell types and can be applied to other table content.

Palabras llave : Information retrieval; document processing; tables.

        · texto en Inglés     · Inglés ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons