Scielo RSS <![CDATA[Polibits]]> http://www.scielo.org.mx/rss.php?pid=1870-904420120001&lang=pt vol. num. 45 lang. pt <![CDATA[SciELO Logo]]> http://www.scielo.org.mx/img/en/fbpelogp.gif http://www.scielo.org.mx <![CDATA[<b>Editorial</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100001&lng=pt&nrm=iso&tlng=pt <![CDATA[<b>Content Extraction based on Hierarchical Relations in DOM Structures</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100002&lng=pt&nrm=iso&tlng=pt This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the additional information that is not part of the main content. The main idea behind our approach is to use the DOM tree as an explicit representation of the inter-relations of the elements in a webpage. Using the information contained in the DOM tree we can identify blocks of content and we can easily determine what of the blocks contains more text. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks. <![CDATA[<b>A Flexible Table Parsing Approach</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100003&lng=pt&nrm=iso&tlng=pt Relational data is often encoded in tables. Tables are easy to read by humans, but difficult to interpret automatically. In cases where table layout cues are not obtainable (missing HTML tags) or where columns are distorted (by copying from a spreadsheet to text) previous table extraction approaches run into problems. This paper introduces a novel table parsing approach. Our approach is based on a set of simple assumptions: (a) every table can be split up in data cells and headers, and (b) every table can be parsed beginning from a data cell utilizing the overall table structure. The table parsing is defined as "table flattening" in this paper. That is, the parsing starts with a data cell and pulls out all token (i.e., headers and sub-headers) associated with a respective data cell. We propose a parsing technique that uses two simple parsing heuristics: table headers are to the left of and above a data cell. We experimented with trader emails that contained instrument information with bid-ask prices as data cells. We developed a clustering and classifying method for finding prices reliably in the data set we used. This method is transferable to other data cell types and can be applied to other table content. <![CDATA[<b>String Distances for Near-duplicate Detection</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100004&lng=pt&nrm=iso&tlng=pt Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection. <![CDATA[<b>Comparing Sanskrit Texts for Critical Editions</b>: <b>The Sequences Move Problem</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100005&lng=pt&nrm=iso&tlng=pt A critical edition takes into account various versions of the same text in order to show the differences between two distinct versions, in terms of words that have been missing, changed, omitted or displaced. Traditionally, Sanskrit is written without spaces between words, and the word order can be changed without altering the meaning of a sentence. This paper describes the characteristics which make Sanskrit text comparisons a specific matter. It presents two different methods for comparing Sanskrit texts, which can be used to develop a computer assisted critical edition. The first one method uses the L.C.S., while the second one uses the global alignment algorithm. Comparing them, we see that the second method provides better results, but that neither of these methods can detect when a word or a sentence fragment has been moved. We then present a method based on N-gram that can detect such a movement when it is not too far from its original location. We show how the method behaves on several examples. <![CDATA[<b>Spatial Reasoning for Determining the Domain of the Set of Tags that Represent Geographic Objects</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100006&lng=pt&nrm=iso&tlng=pt Actualmente, existe una gran cantidad de información geográfica, proveniente de diversas fuentes como imágenes de satélite, fotografías aéreas, mapas, bases de datos, entre otras. Estas fuentes proporcionan una descripción exhaustiva de los objetos geográficos. Sin embargo, la tarea de identificar el dominio geográfico al que pertenecen involucra un procesamiento semántico, el cual está basado en la conceptualización de un dominio, lo que permite interpretarlo de una manera similar a como los seres humanos reconocen a las entidades geográficas y evitar así la vaguedad. Este trabajo propone un método para realizar un proceso de razonamiento espacial cualitativo en representaciones geográficas. El método se basa en el conocimiento a priori del dominio, el cual se encuentra explícitamente formalizado a través de una ontología. El conocimiento descrito en la ontología se valora de acuerdo con un conjunto de etiquetas que pertenecen a algún tipo de dominio geográfico para realizar el análisis semántico correspondiente y mapear esas etiquetas con los conceptos definidos en la ontología. Como resultado, se obtiene un conjunto de dominios geográficos ordenados por su relevancia, para proporcionar un concepto general relacionado directamente con las etiquetas de entrada, simulando la forma en que cognitivamente percibimos algún dominio geográfico en el mundo real.<hr/>Nowadays, there is much geospatial information from different sources such as satellite images, aerial photographs, maps, databases, and so on. It provides a comprehensive description of geographic objects. However, the task to identify the geographic domain to which it belongs is not simple, because this task involves semantic processing based on a conceptualization of a domain. It allows us to understand information in a way similar to how humans recognize the geographic entities and avoid vagueness. We propose a method for qualitative spatial reasoning in geospatial representations. The method is based on a priori knowledge, which is explicitly formalized through an ontology. The knowledge described in the ontology is assessed according to a set of tags that belong to any geographical domain for semantic analysis, to map those tags to concepts defined in the ontology. As a result, a set of geographic domains in order of relevance is obtained, for providing a general concept directly related to the input tags, simulating the way in which humans cognitively perceive a geographic domain in the real world. <![CDATA[<b>Tracking Emotions of Bloggers</b>: <b>A Case Study for Bengali</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100007&lng=pt&nrm=iso&tlng=pt The present paper describes the identification and tracking of bloggers' emotions with respect to time from the structured Bengali blog documents. The assignment of Ekman's six basic emotions to the bloggers' comments is carried out at sentence and paragraph level granularities. The Referential Informative Chain (RIC) developed for each blogger consists of the nodes representing the emotional states of that blogger. Each node of a RIC contains the identification information of its associated blogger, timestamp, section and emotional sentences. The nodes are arranged in each RIC based on the ascending order of the associated timestamps. An affect scoring technique has been employed to capture the emotions from each of the nodes of a blogger's RIC. The incorporation of self emotions and influential emotions as extracted from other bloggers plays a significant role in detecting the emotions of a blogger's present state. The Extrinsic evaluation produces precision (P), recall (R) and F-Measure of 61.05%, 69.81% and 65.13% respectively for evaluating the total of 193 emotional states of 20 bloggers. The Intrinsic evaluation has been conducted using a manual rater with the help of a statistical agreement coefficient, Krippendorff s alpha a. Two types of alpha, namely nominal alpha and interval alpha produce the average scores of 0.67 and 0.72, respectively. <![CDATA[<b>Knowledge Vertices in XUNL</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100008&lng=pt&nrm=iso&tlng=pt This paper addresses some lexical issues in the development of XUNL - a knowledge representation language descendent from and alternative to the Universal Networking Language (UNL). We present the current structure and the role of Universal Words (UW) in UNL and claim that the syntax and the semantics of UWs demand a thorough revision in order to accomplish the requirements of language, culture and human independency. We draw some guidelines for XUNL and argue that its vertices should be represented by Arabic numerals; should be equivalent to sets of synonyms; should consist of generative lexical roots; should correspond to the elementary particles of meaning; and should not bear any non-relational meaning. <![CDATA[<b>Automatic Extraction of Semantic Valences of Verbs from Explanatory Dictionaries</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100009&lng=pt&nrm=iso&tlng=pt En este trabajo se propone el uso de métodos simbólicos para la extracción de las valencias semánticas de verbos describiéndolas bajo el concepto de patrones de rección de la teoría Significado ⇔ Texto. El método se basa en el procesamiento automático de las definiciones de verbos contenidas en diccionarios explicativos y en el análisis de relaciones semánticas, principalmente de inclusión y de sinonimia, establecidas entre ellos. Partimos de la hipótesis de que las definiciones lexicográficas existentes en diccionarios explicativos deben proporcionar la suficiente información para identificar los actantes de verbos. Los resultados obtenidos demuestran que, a pesar de que en muchas de las definiciones no es posible encontrar información relativa a la estructura argumental de los verbos, es posible deducirla identificando y analizando las definiciones con las que existan relaciones sinonímicas y de inclusión.<hr/>In this work we propose the application of symbolic methods for extraction of semantic valences of the verbs describing them under the Government Pattern concept of the Meaning ⇔Text Theory. The method is based on the automatic processing of the definitions of verbs used in Explanatory Dictionaries and the analysis of semantic relationships, as inclusion and synonymy, given among them. We believe that lexicographic definitions of Explanatory Dictionaries supply enough information for identifying verb actants. The obtained results show that even when it is not possible to find information related to the argument structure of verbs in the definitions, it is possible to deduce it identifying and analyzing other definitions which semantic relationships are established. <![CDATA[<b>Using Continuations to Account for Plural Quantification and Anaphora Binding</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100010&lng=pt&nrm=iso&tlng=pt We give in this paper an explicit formal account of plural semantics in the framework of continuation semantics introduced in [1]; and extended in [4]. We deal with aspects of plural dynamic semantics such as plural quantification, plural anaphora, conjunction and disjunction, distributivity and maximality conditions. Those phenomena need no extra stipulations to be accounted for in this framework, because continuation semantics provides a unified account of scope-taking. <![CDATA[<b>Demodulation of Interferograms based on Particle Swarm Optimization</b>]]> http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1870-90442012000100011&lng=pt&nrm=iso&tlng=pt A parametric method to carry out fringe pattern demodulation by means of a particle swarm optimization is presented. The phase is approximated by the parametric estimation of an «th-grade polynomial so that no further unwrapping is required. A particle swarm is used to optimize the input parameters of the function that estimates the phase. A fitness function is established to evaluate the particles, which considers: (a) the closeness between the observed fringes and the recovered fringes, (b) the phase smoothness and c) the prior knowledge of the object, such as its shape and size. The swarm of particles evolves until a fitness average threshold is obtained. We demonstrate that the method is able to successfully demodulate fringe patterns and even a one-image closed-fringe pattern.