Scielo RSS <![CDATA[Computación y Sistemas]]> vol. 21 num. 4 lang. en <![CDATA[SciELO Logo]]> <![CDATA[Introduction to the Special Issue on Human Language Technologies]]> <![CDATA[EDGE2VEC: Edge Representations for Large-Scale Scalable Hierarchical Learning]]> Abstract: In present front-line of Big Data, prediction tasks over the nodes and edges in complex deep architecture needs a careful representation of features by assigning hundreds of thousands, or even millions of labels and samples for information access system, especially for hierarchical extreme multi-label classification. We introduce edge2vec, an edge representations framework for learning discrete and continuous features of edges in deep architecture. In edge2vec, we learn a mapping of edges associated with nodes where random samples are augmented by statistical and semantic representations of words and documents. We argue that infusing semantic representations of features for edges by exploiting word2vec and para2vec is the key to learning richer representations for exploring target nodes or labels in the hierarchy. Moreover, we design and implement a balanced stochastic dual coordinate ascent (DCA)-based support vector machine for speeding up training. We introduce a global decision-based top-down walks instead of random walks to predict the most likelihood labels in the deep architecture. We judge the efficiency of edge2vec over the existing state-of-the-art techniques on extreme multi-label hierarchical as well as flat classification tasks. The empirical results show that edge2vec is very promising and computationally very efficient in fast learning and predicting tasks. In deep learning workbench, edge2vec represents a new direction for statistical and semantic representations of features in task-independent networks. <![CDATA[Text Analysis Using Different Graph-Based Representations]]> Abstract: This paper presents an overview of different graph-based representations proposed to solve text classification tasks. The core of this manuscript is to highlight the importance of enriched/non-enriched co-occurrence graphs as an alternative to traditional features representation models like vector representation, where most of the time these models can not map all the richness of text documents that comes from the web (social media, blogs, personal web pages, news, etc). For each text classification task the type of graph created as well as the benefits of using it are presented and discussed. In specific, the type of features/patterns extracted, the implemented classification/similarity methods and the results obtained in datasets are explained. The theoretical and practical implications of using co-occurrence graphs are also discussed, pointing out the contributions and challenges of modeling text document as graphs. <![CDATA[Property Modifiers and Intensional Essentialism]]> Abstract: In this paper, I deal with property modifiers defined as functions that associate a given root property P with a modified property [M P]. Property modifiers typically divide into four kinds, namely intersective, subsective, privative and modal. Here I do not deal with modal modifiers like alleged, which appear to be well-nigh logically lawless, because, for instance, an alleged assassin is or is not an assassin. The goal of this paper is to logically define the three remaining kinds of modifiers together with the rule of left subsectivity that I launch as the rule of pseudo-detachment to replace the modifier M in the premise by the property M* in the conclusion, and prove that the rule of pseudo-detachment is valid for all kinds of modifiers. Furthermore, it is defined in a way that avoids paradoxes like a small elephant being smaller than a large mouse. <![CDATA[Integrating CALL Systems with Chatbots as Conversational Partners]]> Abstract: Computer Assisted Language Learning (CALL) systems is used as a media to teach a language without the need for a classroom or a teacher. CALL systems include language lessons and exercises to enhance the learners’ vocabulary, grammar, and writing skills, and to provide learners with immediate feedback on their achievements. However, there is still a need to concentrate on practicing conversation where the computer plays the role of a conversational partner. Using language learning environments, dialogue systems, and chatbots will fill full this need. This paper presents CALL activities, CALL role, and limitations. The paper discusses the need to integrate CALL system with a conversational agent or a chatbot to enable learners to practice a language in a conversational manner. Different experiments and evaluations are illustrated in this paper which show an improvement in learning outcomes by using chatbot as a conversational partner. The paper concludes that such integration between CALL and chatbot will lead to better results for language learners. <![CDATA[Sentence Similarity Computation based on WordNet and VerbNet]]> Abstract: Sentence similarity computing is increasingly growing in several applications, such as question answering, machine-translation, information retrieval and automatic abstracting systems. This paper firstly sums up several methods to calculate similarity between sentences which consider semantic and syntactic knowledge. Second, it presents a new method for the sentence similarity measure that aggregates, in a linear function, three components: the Lexical similarity Lexsim including the common words, the semantic similarity SemSim using the synonymy words and the syntactico-semantic similarity SynSemSim based on common semantic arguments, notably, thematic role and semantic class. Concerning the word-based semantic similarity, a measure is computed to estimate the semantic degree between words by exploiting the WordNet ”is a” taxonomy. Moreover, the semantic argument determination is based on the VerbNet database. The proposed method yielded competitive results compared to previously proposed measures and with regard to the Li’s benchmark, which shown a high correlation with human ratings. Furthermore, experiments performed on the Microsoft Paraphrase Corpus showed the best F-measure values compared to other measures for high similarity thresholds. <![CDATA[Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking]]> Abstract: Lexicalization of the input of sequential taggers has gone a long way since it was invented by Molina and Pla [4]. In this paper we thoroughly investigate the method introduced by Indig and Endrédy [2] to find out the best lexicalization level for chunking and to explore the behavior of different IOB representations. Both tasks are applied to the CoNLL-2000 dataset. Our goal is to introduce a transformation method to accommodate the parameters of the development set to the training set using their frequency distributions which other tasks like POS tagging or NER could benefit too. <![CDATA[Parsing Arabic Nominal Sentences with Transducers to Annotate Corpora]]> Abstract: Studying Arabic nominal sentences is important to analyze and annotate successfully Arabic corpora. This type of sentences is frequent in Arabic text and speech. Transducers can be used to realize local grammars and treat several linguistic phenomena. In this paper, we propose a parsing approach for Arabic nominal sentences using transducers. To do this, we first study the typology of the Arabic nominal sentence indicating its different forms. Then, we develop a set of lexical and syntactic rules dealing with this type of sentences and respecting the specificity of the Arabic language. After that, we present our parsing approach based on the transducers and on certain principles. In fact, this approach allows the annotation of Arabic nominal sentences but also the Arabic verbal ones. Finally, we present an idea about the implementation and experimentation of our approach in NooJ platform. The obtained results are satisfactory. <![CDATA[Subjectivity Detection in Nuclear Energy Tweets]]> Abstract: The subjectivity detection is an important binary classification task that aims at distinguishing natural language texts as opinionated (positive or negative) and non-opinionated (neutral). In this paper, we develop and apply recent subjectivity detection techniques to determine subjective and objective tweets towards the hot topic of nuclear energy. This will further help us to detect the presence or absence of social media bias towards Nuclear Energy. In particular, significant network motifs of words and concepts were learned in dynamic Gaussian Bayesian networks, while using Twitter as a source of information. We use reinforcement learning to update each weight based on a probabilistic reward function over all the weights and, hence, to regularize the sentence model. The proposed framework opens new avenues in helping government agencies manage online public opinion to decide and act according to the need of the hour. <![CDATA[SNEIT: Salient Named Entity Identification in Tweets]]> Abstract: Social media is a rich source of information and opinion, with exponential data growth rate. However social media posts are difficult to analyze since they are brief, unstructured and noisy. Interestingly, many social media posts are about an entity or entities. Understanding which entity is central (Salient Entity) to a post, helps better analyze the post. In this paper we propose a model that aids in such analysis by identifying the Salient Entity in a social media post, tweets in particular. We present a supervised machine-learning model, to identify Salient Entity in a tweet and propose that the tweet is most likely about that particular entity. We have used the premise that, when an image accompanies a text, the text most likely is about the entity in that image, to build a dataset of tweets and salient entities. We trained our model using this dataset. Note that this does not restrict the applicability of our model in any way. We use tweets with images only to obtain objective ground truth data, while features for the model are derived from tweet text. Our experiments show that the model identifies Salient Named Entity with an F-measure of 0.63. We show the effectiveness of the proposed model for tweet-filtering and salience identification tasks. We have made the human annotated dataset and the source code of this model publicly available. <![CDATA[Named Entity Recognition on Code-Mixed Cross-Script Social Media Content]]> Abstract: Focusing on the current multilingual scenario in social media, this paper reports automatic extraction of named entities (NE) from code-mixed cross-script social media data. Our prime target is to extract NE for question answering. This paper also introduces a Bengali-English (Bn-En) code-mixed cross-script dataset for NE research and proposes domain specific taxonomies for NE. We used formal as well as informal language-specific features to prepare the classification models and employed four machine learning algorithms (Conditional Random Fields, Margin Infused Relaxed Algorithm, Support Vector Machine and Maximum Entropy Markov Model) for the NE recognition (NER) task. In this study, Bengali is considered as the native language while English is considered as the non-native language. However, the approach presented in this paper is generic in nature and could be used for any other code-mixed dataset. The classification models based on CRF and SVM performed well among the classifiers. <![CDATA[Complexity Metric for Code-Mixed Social Media Text]]> Abstract: An evaluation metric is an absolute necessity for measuring the performance of any system and complexity of any data. In this paper, we have discussed how to determine the level of complexity of code-mixed social media texts that are growing rapidly due to multilingual interference. In general, texts written in multiple languages are often hard to comprehend and analyze. At the same time, in order to meet the demands of analysis, it is also necessary to determine the complexity of a particular document or a text segment. Thus, in the present paper, we have discussed the existing metrics for determining the code-mixing complexity of a corpus, their advantages and shortcomings as well as proposed several improvements on the existing metrics. The new index better reflects the variety and complexity of a multilingual document. Also, the index can be applied to a sentence and seamlessly extended to a paragraph or an entire document. We have employed two existing code-mixed corpora to suit the requirements of our study. <![CDATA[A Supervised Method to Predict the Popularity of News Articles]]> Abstract: In this study, we identify the features of an article that encourage people to leave a comment for it. The volume of the received comments for a news article shows its importance. It also indirectly indicates the amount of influence a news article has on the public. Leaving comment on a news article indicates not only the visitor has read the article but also the article has been important to him/her. We propose a machine learning approach to predict the volume of comments using the information that is extracted about the users’ activities on the web pages of news agencies. In order to evaluate the proposed method, several experiments were performed. The results reveal salient improvement in comparison with the baseline methods. <![CDATA[Using Linguistic Knowledge for Machine Translation Evaluation with Hindi as a Target Language]]> Abstract: Several proposed metrics of MT Evaluation like BLEU have been criticized for their poor performance in evaluating machine translations. Languages like Hindi which have relatively free word-order and are morphologically rich pose additional problems in such evaluation. We attempt here to make use of linguistic knowledge to evaluate machine translations with Hindi as a target language. We formulate the problem of MT Evaluation as minimum cost assignment problem between test and reference translations with cost function based on linguistic knowledge. <![CDATA[Pre-Processing of English-Hindi Corpus for Statistical Machine Translation]]> Abstract: Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, which makes it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages – i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair. <![CDATA[Knowledge Representation and Phonological Rules for the Automatic Transliteration of Balinese Script on Palm Leaf Manuscript]]> Abstract: Balinese ancient palm leaf manuscripts record many important knowledges about world civilization histories. They vary from ordinary texts to Bali’s most sacred writings. In reality, the majority of Balinese can not read it because of language obstacles as well as tradition which perceived them as a sacrilege. Palm leaf manuscripts attract the historians, philologists, and archaeologists to discover more about the ancient ways of life. But unfortunately, there is only a limited access to the content of these manuscripts, because of the linguistic difficulties. The Balinese palm leaf manuscripts were written in Balinese script in Balinese language, in the ancient literary texts composed in the old Javanese language of Kawi and Sanskrit. Balinese script is considered to be one of the most complex scripts from Southeast Asia. A transliteration engine for transliterating the Balinese script of palm leaf manuscript to the Latin-based script is one of the most demanding systems which has to be developed for the collection of palm leaf manuscript images. In this paper, we present an implementation of knowledge representation and phonological rules for the automatic transliteration of Balinese script on palm leaf manuscript. In this system, a rule-based engine for performing transliterations is proposed. Our model is based on phonetics which is based on traditional linguistic study of Balinese transliteration. This automatic transliteration system is needed to complete the optical character recognition (OCR) process on the palm leaf manuscript images, to make the manuscripts more accessible and readable to a wider audience. <![CDATA[Cause and Effect Extraction from Biomedical Corpus]]> Abstract: The objective of the present work is to automatically extract the cause and effect from discourse analyzed biomedical corpus. Cause-effect is defined as a relation established between two events, where first event acts as the cause of second event and the second event is the effect of first event. Any causative constructions need three components, a causal marker, cause and effect. In this study, we consider the automatic extraction of cause and effect realized by explicit discourse connective markers. We evaluated our system using BIONLP/NLPBA 2004 shared task test data and obtained encouraging results. <![CDATA[Hybrid Attention Networks for Chinese Short Text Classification]]> Abstract: To improve the classification performance for Chinese short text with automatic semantic feature selection, in this paper we propose the Hybrid Attention Networks (HANs) which combines the word- and character-level selective attentions. The model firstly applies RNN and CNN to extract the semantic features of texts. Then it captures class-related attentive representation from word- and character-level features. Finally, all of the features are concatenated and fed into the output layer for classification. Experimental results on 32-class and 5-class datasets show that, our model outperforms multiple baselines by combining not only the word- and character-level features of the texts, but also class-related semantic features by attentive mechanism. <![CDATA[Content-based SMS Classification: Statistical Analysis for the Relationship between Number of Features and Classification Performance]]> Abstract: High dimensionality of the feature space is one of the difficulty that affect short message service (SMS) classification performance. Some studies used feature selection methods to pick up some features, while other studies used the full extracted features. In this work, we aim to analyse the relationship between features size and classification performance. For that, a classification performance comparison was carried out between ten features sizes selected by varies feature selection methods. The used methods were chi-square, Gini index and information gain (IG). Support vector machine was used as a classifier. Area Under the ROC (Receiver Operating Characteristics) Curve between true positive rate and false positive rate was used to measure the classification performance. We used the repeated measures ANOVA at p &lt; 0.05 level to analyse the performance. Experimental results showed that IG method outperformed the other methods in all features sizes. The best result was with 50% of the extracted features. Furthermore, the results explicitly showed that using larger features size in the classification does not mean superior performance but sometimes leads to less classification performance. Therefore, feature selection step should be used. By reducing the used features for the classification, without degrading the classification performance, it means reducing memory usage and classification time. <![CDATA[Extractive Summarization: Limits, Compression, Generalized Model and Heuristics]]> Abstract: Due to its promise to alleviate information overload, text summarization has attracted the attention of many researchers. However, it has remained a serious challenge. Here, we first prove empirical limits on the recall (and F1-scores) of extractive summarizers on the DUC datasets under ROUGE evaluation for both the single-document and multi-document summarization tasks. Next we define the concept of compressibility of a document and present a new model of summarization, which generalizes existing models in the literature and integrates several dimensions of the summarization problem, viz., abstractive versus extractive, single versus multi-document, and syntactic versus semantic. Finally, we examine some new and some existing single-document summarization algorithms in a single framework and compare with state of the art summarizers on DUC data. <![CDATA[Learning to Answer Questions by Understanding Using Entity-Based Memory Network]]> Abstract: This paper introduces a novel neural network model for question answering, the entity-based memory network. It enhances neural networks’ ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities’ states. These entities’ states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Entities in this model are regard as the basic units that carry information and construct text. Information carried by text are encoded in the states of entities. Hence text can be best understood by analysing its containing entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results. <![CDATA[Automatic Analysis of Annual Financial Reports: A Case Study]]> Abstract: The main goal of reporting in the financial system is to ensure high quality and useful information about the financial position of firms, and to make it available to a wide range of users, including existing and potential investors, financial institutions, employees, the government, etc. Formal reports contain both strictly regulated, financial sections, and unregulated, narrative parts. Our research starts from the hypothesis that there is a relation between business performance and not only content, but also the linguistic properties of unregulated parts of annual reports. In the paper we first present our dataset of financial reports and the techniques we used to extract the unregulated textual parts. Next, we introduce our approaches of differential content analysis and analysis of correlation with financial aspects. The differential content analysis is based on TF-IDF weighting and is aimed at finding the characteristic terms for each year (i.e. the terms which were not prevailing in the previous reports by the same firm). For correlation of linguistic characteristics of reports with financial aspects, an array of linguistic features was considered and selected financial indicators were used. Linguistic features range from measurements, such as personal/impersonal pronouns ratio, to assessments of characteristics like financial sentiment, trust, doubt, and discursive features expressing certainty, modality, etc. While some features show strong correlation with industry (e.g., shorter and more personal reports by IT industry compared to automotive industry), doubt, communication – as well as necessity and cognition words to some extent – are positively correlated with failure. <![CDATA[Post-Processing for the Mask of Computational Auditory Scene Analysis in Monaural Speech Segregation]]> Abstract: Speech segregation is one of the most difficult tasks in speech processing. This paper uses computational auditory scene analysis, support vector machine classifier, and post-processing on binary mask to separate speech from background noise. Mel-frequency cepstral coefficients and pitch are the two features used for support vector machine classification. Connected Component Labeling, Hole Filling, and Morphology are applied on the resulting binary mask as post-processing. Experimental results show that our method separates speech from background noise effectively. <![CDATA[Beyond Pairwise Similarity: Quantifying and Characterizing Linguistic Similarity between Groups of Languages by MDL]]> Abstract: We present a minimum description length-based algorithm for finding the regular correspondences between related languages and show how it can be used to quantify the similarity between not only pairs, but whole groups of languages directly from cognate sets. We employ a two-part code, which allows to use the data and model complexity of the discovered correspondences as information-theoretic quantifications of the degree of regularity of cognate realizations in these languages. Unlike previous work, our approach is not limited to pairs of languages, does not limit the size of discovered correspondences, does not make assumptions about the shape or distribution of correspondences, and requires no expert knowledge or fine-tuning of parameters. We here test our approach on the Slavic languages. In a pairwise analysis of 13 Slavic languages, we show that our algorithm replicates their linguistic classification exactly. In a four-language experiment, we demonstrate how our algorithm efficiently quantifies similarity between all subsets of the analyzed four languages and find that it is excellently suited to quantifying the orthographic regularity of closely-related languages. <![CDATA[How the Accuracy and Computational Cost of Spiking Neuron Simulation are Affected by the Time Span and Firing Rate]]> Abstract: It is known that, depending on the numerical method, the simulation accuracy of a spiking neuron increases monotonically and that the computational cost increases in a power-law complexity as the time step reduces. Moreover, the mechanism responsible for generating the action potentials also affects the accuracy and computational cost. However, little attention has been paid to how the time span and firing rate influence the simulation. This study describes how the time span and firing rate variables affect the accuracy, computational cost, and efficiency. It was found that the simulation is importantly affected by these two variables. <![CDATA[Proving Distributed Coloring of Forests in Dynamic Networks]]> Abstract: The design and the proof of correctness of distributed algorithms in dynamic networks are difficult tasks. These networks are characterized by frequent topology changes due to unpredictable appearance and disappearance of mobile devices and/or communication links. In this paper, we propose a correct-by-construction approach for specifying and proving distributed algorithms in a forest topology. In the first stage, we specify a formal pattern using the Event-B method, based on the refinement technique. The proposed pattern relies on the Dynamicity Aware-Graph Relabeling Systems (DA-GRS) which is an existing model for building and maintaining a forest of spanning trees in dynamic networks. It is based on evolving graphs as a powerful model to record the evolution of a network topology. In the second stage, we deal with distributed algorithms which can be applied to spanning trees of the forest. In fact, we use the proposed pattern to specify a tree-coloring algorithm. The proof statistics comparing the development of this algorithm with and without using the pattern show the efficiency of our solution in terms of proofs reduction. <![CDATA[Remedies for the Inconsistences in the Times of Execution of the Unsorted Database Search Algorithm within the Wave Approach]]> Abstract: The typical semiclassical wave version of the unsorted database search algorithm based on a system of coupled simple harmonic oscillators does not consider an important ingredient of Grover’s original algorithm as it is quantum entanglement. The role of entanglement in the wave version of the unsorted database search algorithm is explored and contradictions with the time of execution of Grover’s algorithm are found. We remedy the contradictions by employing two arguments, one of them qualitative and the other quantitative. For the qualitative argument we employ the probabilistic nature of a legitimate quantum algorithm and remedy the above inconsistence. Within the quantitative argument we identify a parameter in the wave version of the unsorted database search algorithm which is related to entanglement. The contradiction with the time of execution of Grover’s algorithm is solved by choosing an appropriate values of such a parameter which incorporates entanglement to the wave version of the unsorted database search algorithm. The utility of the present arguments are evident if the wave version of the unsorted data base search algorithm is experimentally implemented through a system of N quantum dots with a harmonic oscillator potential as a confinement potential for each of the quantum dots. Each of the above N vibrating quantum dots must be coupled to an extra single vibrating quantum dot which entangles to all of them. In order to obtain optimal results, the coupling constants of the mentioned quantum dots should be adjusted in the way described in the present work.