Scielo RSS <![CDATA[ComputaciĆ³n y Sistemas]]> vol. 22 num. 3 lang. en <![CDATA[SciELO Logo]]> <![CDATA[Introduction to the thematic issue on Natural Language Processing]]> <![CDATA[Unsupervised Creation of Normalization Dictionaries for Micro-Blogs in Arabic, French and English]]> Abstract: Text normalization is a necessity to correct and make more sense of the micro-blogs messages, for information retrieval purposes. Unfortunately, tools and resources of text normalization are rarely shared. In this paper, an approach is presented based on an unsupervised method for text normalization using distributed representations of words, known also as "word embedding", applied on Arabic, French and English Languages. In addition, a tool will be supplied to create dictionaries for micro-blogs normalization, in a form of pairs of misspelled word with its standard-form word, in the languages: Arabic, French and English. The tool will be available as open source1 including the resources: word embedding’s models (with vocabulary size of 9 million words for Arabic language model, 5 million words for English language model and 683 thousand words for French language model), and also three normalization dictionaries of 10 thousand pairs in Arabic language, 3 thousand pairs in French language and 18 thousand pairs in English language. The evaluation of the tool shows an average in Normalization success of 96% for English language, 89.5% for Arabic Language and 85% for French Language. Also, the results of using an English language normalization dictionary with a sentiment analysis tool for micro-blog’s messages, show an increase in f-measure from 58.15 to 59.56. <![CDATA[Semantic Role Labeling of English Tweets]]> Abstract: Semantic role labeling (SRL) is a task of defining the conceptual role to the arguments of predicate in a sentence. This is an important task for a wide range of tweet related applications associated with semantic information extraction. SRL is a challenging task due to the difficulties regarding general semantic roles for all predicates. It is more challenging for Social Media Text (SMT) where the nature of text is more casual. This paper presents an automatic SRL system for English tweets based on Sequential Minimal Optimization (SMO) algorithm. Proposed system is evaluated through experiments and reports comparable performance with the prior state-of-the art SRL system. <![CDATA[A 5W1H Based Annotation Scheme for Semantic Role Labeling of English Tweets]]> Abstract: Semantic Role Labeling (SRL) is a well researched area of Natural Language Processing. State-of-the-art lexical resources have been developed for SRL on formal texts that involve a tedious annotation scheme and require linguistic expertise. The difficulties increase manifold when such complex annotation scheme is applied on tweets for identifying predicates and role arguments. In this paper, we present a simplified approach for annotation of English tweets for identification of predicates and corresponding semantic roles. For annotation purpose, we adopted the 5W1H (Who, What, When, Where, Why and How) concept which is widely used in journalism. The 5W1H task seeks to extract the semantic information in a natural language sentence by distilling it into the answers to the 5W1H questions: Who, What, When, Where, Why and How. The 5W1H approach is comparatively simple and convenient with respect to the ProbBank Semantic Role Labeling task. We report an the performance of our annotation scheme for SRL on tweets and show that non-expert annotators can produce quality SRL data for tweets. This paper also reports the difficulties and challenges involved with semantic role labeling on twitter data and propose solutions to them. <![CDATA[Enhancing Deep Learning Gender Identification with Gated Recurrent Units Architecture in Social Text]]> Abstract: Author profiling consists in inferring the authors’ gender, age, native language, dialects or personality by examining his/her written text. This paper represent an extension of the recursive neural network that employs a variant of the Gated Recurrent Units (GRUs) architecture. Our study focuses on gender identification based on Arabic Twitter and Facebook texts by investigating the examined texts features. The introduced exploiting a model that applies a mixture of unsupervised and supervised techniques to learn word vectors capturing the words syntactic and semantic. We applied our approach on two corpora of two social media varieties: twitter texts, in which each author is assigned at least 100 tweets, and Facebook corpus containing short texts with an average of 15.77 words per author. The obtained experimental results are comparable to the best findings provided by the best per-forming systems presented in PAN Lab at CLEF 2017. <![CDATA[Artificial Method for Building Monolingual Plagiarized Arabic Corpus]]> Abstract: Plagiarism in textual documents is a widespread problem seen the large digital repository existing on the web. Moreover, it is difficult to make evaluation and comparison between solutions because of the lack of plagiarized resources in Arabic language publicly available. In this context, this paper describes automatic construction of a paraphrased corpus in order to deal with these issues and conduct our experiments, as follows: First, we collected a large corpus containing more than 12 million sentences from different resources. Then, we cleaned it up unnecessary data by applying a set of preprocessing techniques. After that, we used word2vec algorithm to create a vocabulary from the collected corpus. It extracted efficiently the semantic relationships between words to exploit. Subsequently, we replaced each word of the source corpus with the most similar vocabulary word based on an index used randomly to eventually obtain a suspect corpus. Different experiments are done. Thus, we varied the dimensions of vectors and window sizes to predict the correct context of words and identify the semantically closest words of the target. <![CDATA[Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis]]> Abstract: Our paper presents a distant supervision algorithm for automatically collecting and labeling ’TEAD‘, a dataset for Arabic Sentiment Analysis (SA), using emojis and sentiment lexicons. The data was gathered from Twitter during the period between the 1st of June and the 30th of November 2017. Although the idea of using emojis to collect and label training data for SA, is not novel, getting this approach to work for Arabic dialect was very challenging. We ended up with more than 6 million tweets labeled as Positive, Negative or Neutral. We present the algorithm used to deal with mixed-content tweets (Modern Standard Arabic MSA and Dialect Arabic DA). We also provide properties and statistics of the dataset alongside experiments results. Our tryouts covered a wide range of standard classifiers proved to be efficient for sentiment classification problem. <![CDATA[Stance and Sentiment in Czech]]> Abstract: Sentiment analysis is a wide area with great potential and many research directions. One direction is stance detection, which is somewhat similar to sentiment analysis. We supplement stance detection dataset with sentiment annotation and explore the similarities of these tasks. We show that stance detection and sentiment analysis can be mutually beneficial by using gold label for one task as features for the other task. We analysed the presence of target entities for stance detection in the dataset. We outperform the state-of-the-art results for stance detection in Czech and set new state-of-the-art results for the newly created sentiment analysis part of the extended dataset. <![CDATA[The Big Five: Discovering Linguistic Characteristics that Typify Distinct Personality Traits across Yahoo! Answers Members]]> Abstract: In psychology, it is widely believed that there are five big factors that determine the different personality traits: Extraversion, Agreeableness, Conscientiousness and Neuroticism as well as Openness. In the last years, researchers have started to examine how these factors are manifested across several social networks like Facebook and Twitter. However, to the best of our knowledge, other kinds of social networks such as social/informational question-answering communities (e.g., Yahoo! Answers) have been left unexplored. Therefore, this work explores several predictive models to automatically recognize these factors across Yahoo! Answers members. As a means of devising powerful generalizations, these models were combined with assorted linguistic features. Since we do not have access to ask community members to volunteer for taking the personality test, we built a study corpus by conducting a discourse analysis based on deconstructing the test into 112 adjectives. Our results reveal that it is plausible to lessen the dependency upon answered tests and that effective models across distinct factors are sharply different. Also, sentiment analysis and dependency parsing proven to be fundamental to deal with extraversion, agreeableness and conscientiousness. Furthermore, medium and low levels of neuroticism were found to be related to initial stages of depression and anxiety disorders. <![CDATA[RESyS: Towards a Rule-based Recommender System based on Semantic Reasoning]]> Abstract: The ability to be available and stay connected always for work or social issues has become a reality and a necessity for today’s Information Society. Harnessing the potential of Semantic Technologies-based reasoning for intelligent redirection of voice calls and recommender systems has been gauged as a promising field to enhance the current voice phone calling experience. Such experience might be fostered by a disruption based on rule-based recommendation and inference leveraging current state of the art technology in smartphone apps or fixed line telecommunications standards to its full potential. In this paper, we present RESyS, a software platform hinging on Semantic Technologies and rule-based recommendations which has been tested both on a research and an industry proof-of-concept business pilot, validating its major goals. <![CDATA[A Formula Embedding Approach to Math Information Retrieval]]> Abstract: Intricate math formulae, which majorly constitute the content of scientific documents, add to the complexity of scientific document retrieval. Although modifications in conventional indexing and search mechanisms have eased the complexity and exhibited notable performance, the formula embedding approach to scientific document retrieval sounds equally appealing and promising. Formula Embedding Module of the proposed system uses a Bit Position Information Table to transform math formulae, contained inside scientific documents, into binary formulae vectors. Each set bit of a formula vector designates presence of a specific mathematical entity. Mathematical user query is transformed into query vector, in similar fashion, and the corresponding relevant documents are retrieved. Relevance of a search result is characterized by extent of similarity between the indexed formula vector and the query vector. Promising performance, under moderately constrained situation, substantiates competence of the proposed approach. <![CDATA[Unsupervised Sentence Embeddings for Answer Summarization in Non-factoid CQA]]> Abstract: This paper presents a method for summarizing answers in Community Question Answering. We explore deep Auto-encoder and Long-short-term-memory Auto-encoder for sentence representation. The sentence representations are used to measure similarity in Maximal Marginal Relevance algorithm for extractive summarization. Experimental results on a benchmark dataset show that our unsupervised method achieves state-of-the-art performance while requiring no annotated data. <![CDATA[Discovering Continuous Multi-word Expressions in Czech]]> Abstract: Multi-word expressions frequently cause incorrect annotations in corpora, since they often contain foreign words or syntactic anomalies. In case of foreign material, the annotation quality depends on whether the correct language of the sequence is detected. In case of inter-lingual homographs, this problem becomes difficult. In the previous work, we created a dataset of Czech continuous multi-word expressions (MWEs). The candidates were discovered automatically from Czech web corpus considering their orthographic variability. The candidates were classified and annotated manually. Afterwards, the dataset was extended automatically by generating all word forms of those MWEs that were annotated as nouns. In this work, we used the dataset as positive examples, we filtered out negative examples from the MWE candidates. We trained a classifier with mean accuracy 92.7%. We have shown that the combined approach slightly outperforms approaches concerning only association measures mainly on MWEs containing inter-lingual homographs and out-of-vocabulary words. The discovery methods can be applied to other languages which encounter orthographic variability in web corpora. <![CDATA[Using BiLSTM in Dependency Parsing for Vietnamese]]> Abstract: Recently, deep learning methods have achieved good results in dependency parsing for many natural languages. In this paper, we investigate the use of bidirectional long short-term memory network models for both transition-based and graph-based dependency parsing for the Vietnamese language. We also report our contribution in building a Vietnamese dependency treebank whose tagset conforms to the Universal Dependency schema. Various experiments demonstrate the efficiency of this method, which achieves the best parsing accuracy in comparison to other existing approaches on the same corpus, with unlabeled attachment score of 84.45% or labeled attachment score of 78.56%. <![CDATA[Arabic Dialect Identification based on Probabilistic-Phonetic Modeling]]> Abstract: The identification of Arabic dialects is considered to be the first pre-processing component for any natural language processing problem. This task is useful for automatic translation, information retrieval, opinion mining and sentiment analysis. In this purpose, we propose a statistical approach based on the phonetic modeling to identify the correspondent Arabic dialect for each input acoustic signal. The main idea consists first, and for each dialect, in calculating a referenced phonetic model. Second, for every input audio signal, we calculate an appropriate phonetic model. Third, we compare this latter to all referenced Arabic dialect models. Finally, we associate the input acoustic signal to the dialect where the referenced phonetic model minimizes the cosine similarity. The obtained results are satisfactory. Indeed, based on 117 audio sequences, we attain a classification rate of 93%. Supporting the achieved results and the coverage of most of Arabic dialects, this study can be a reference for future work addressing dialectical speech processing applications. <![CDATA[Word Sense Disambiguation Features for Taxonomy Extraction]]> Abstract: Many NLP tasks, such as fact extraction, coreference resolution etc, rely on existing lexical taxonomies or ontologies. One of the possible approaches to create a lexical taxonomy is to extract taxonomic relations from a monolingual dictionary or encyclopedia: a semi-formalized resource designed to contain many relations of this kind. Word-sense disambiguation (WSD) is a mandatory tool for such approaches. The quality of the extracted taxonomy greatly depends on WSD results. Most WSD approaches can be posed as machine learning tasks. For this sake feature representation ranges from collocation vectors as in Lesk algorithm or neural network features in Word2Vec to highly specialized word sense representation models such as AdaGram. In this work we apply several WSD algorithms to dictionary definitions. Our main focus is the influence of different approaches to extract WSD features from dictionary definitions on WSD accuracy. <![CDATA[Building a Nasa Yuwe Language Corpus and Tagging with a Metaheuristic Approach]]> Abstract: Nasa Yuwe is the language of the Nasa indigenous community in Colombia. It is currently threatened with extinction. In this regard, a range of computer science solutions have been developed to the teaching and revitalization of the language. One of the most suitable approaches is the construction of a Part-Of-Speech Tagging (POST), which encourages the analysis and advanced processing of the language. Nevertheless, for Nasa Yuwe no tagged corpus exists, neither is there a POS Tagger and no related works have been reported. This paper therefore concentrates on building a linguistic corpus tagged for the Nasa Yuwe language and generating the first tagging application for Nasa Yuwe. The main results and findings are 1) the process of building the Nasa Yuwe corpus, 2) the tagsets and tagged sentences, as well as the statistics associated with the corpus, 3) results of two experiments to evaluate several POS Taggers (a Random tagger, three versions of HSTAGger, a tagger based on the harmony search metaheuristic, and three versions of a memetic algorithm GBHS Tagger, based on Global-Best Harmony Search (GBHS), Hill Climbing and an explicit Tabu memory, which obtained the best results in contrast with the other methods considered over the Nasa Yuwe language corpus. <![CDATA[Rhetorical Relations in the Speech of Alzheimer’s Patients and Healthy Elderly Subjects: An Approach from the RST]]> Abstract: The study is aimed to extract discourse relations patterns in conversational speech of subjects with Alzheimer’s Disease (AD) and adults with healthy aging processes using the Rhetorical Structure Theory (RST). By means of the RST, we analyzed semi-structured interviews of native Spanish speakers. Seven subjects were in the mild, moderate or advanced stages of AD, and 6 were cognitively intact individuals. The procedure involved the segmentation of each conversational discourse into Semantic Dialog Units (SDUs), the labeling of their rhetorical relations and the construction of tree diagrams. We perform a correlation analysis to determine the significance of the use of rhetorical relations for each group. We found a significantly (p-value &lt;.05) lower rhetorical relations production density in subjects with AD. We also observed that most rhetorical relations used by healthy older subjects were Elaboration, Concession, Interpretation, Non-Volitional Cause, Solutionhood and Volitional Result. <![CDATA[Automated Lung Segmentation on Computed Tomography Image for the Diagnosis of Lung Cancer]]> Abstract: Image processing techniques are widely used in several medical areas for early detection and treatment especially in the detection of various cancer tumors such as Squamous, Adenocarcinoma, Large Cell Carcinomas and Small Cell Lung Cancer. Segmentation of lung tissues from Computed Tomography (CT), image is considered as a pre-processing step in Lung Imaging. However, during Lung Segmentation, the Juxta-Pleural nodules (nodules attach to parenchymal walls), are missed out as they have similar appearance (intensity) to that of other non-pulmonary structures, which leads to a challenge to segment lung region along with Juxta-Pleural nodules. The complexity to segment lung region is mainly due to its inhomogeneity (different structures and intensity values of lungs). Thus, the existing segmentation algorithms like image thresholding algorithm, region-growing algorithm, active contour, level sets, etc. fail to segment lung tissues including Juxta-Pleural nodules. So, in this paper, a new fully-automated lung segmentation method with Juxta-Pleural nodules inclusion, is proposed. <![CDATA[A Statistical Background Modeling Algorithm for Real-Time Pixel Classification]]> Abstract: This paper introduces a statistical background pixel classifier intended for real-time and low-resource implementation. The algorithm works within a smart video surveillance application aimed to detect unattended objects in images with fixed backgrounds. The algorithm receives an input image and builds an initial background model based on image statistics. Using this information, the algorithm identifies new objects that do not belong to the original image. The algorithm categorizes image pixels in four possible classes: shadows, midtones, highlights and foreground pixels. The classification stage produces a binary mask where only objects of interest are shown. The pixel classifier processes Quarter VGA (320 x 240) gray-scale images at a nomial processing rate of 30 frames per second. Higher resolutions such as VGA (640 x 480) have been also tested. We compare results with traditional statistical background modeling methods. Our experiments demonstrate that our approach achieves successful background segmentation at a minimal resource consumption while maintaining real-time execution. <![CDATA[Offshore Wind Farm Layout Optimization via Differential Evolution]]> Abstract: The Wind Farm Layout Problem (WFLP) consists in the placement of eolic generators (either in a grid, or at any position) into a delimited terrain. Several factors are taken into account to solve the WFLP, which include produced energy, costs - environmental, installation, maintenance, etc-, average useful life of turbines, among other. Likewise, optimization techniques involve the use of one or more objective functions, considering traditional as well as evolutionary approaches. Differential Evolution (DE) is an algorithm proposed for global optimization, whose operators are both simple to program and to utilize, still providing good convergence properties. The original authors of DE suggested its first five variants, which are: best/1/bin, best/2/bin, current − to − best/1/bin, rand/1/bin, and rand/2/bin. In this article it is proposed the comparison of five DE variants when they are used to solve 25 different instances of the WFLP; experimental results show that DE/best/1/bin outperforms the remaining algorithms in terms of convergence velocity as well as in the quality of the obtained wind-farm. <![CDATA[An Efficient Framework to Detect Cracks in Rail Tracks Using Neural Network Classifier]]> Abstract: Objectives: The detection of defects or cracks in rail track plays an important role in railway management, which prevents train accidents in both summer and rainy seasons. During summer, the cracks are formed on the track which slips the train wheel. In rainy environment, the rail tracks are affected by corrosion which also produced cracks on it. Methods: In present method, the cracks or defects are detected Echo image display device or semi conduction magnetism sensor devices which consumes more time. The proposed method enhances the track image using adaptive histogram equalization technique and further features as Grey Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP) feature are extracted from the enhanced rail track image. These extracted features are trained and classified using neural network classifier which classifies the rail track image into either cracked or non-cracked image. The novelty of this work is to use soft computing approach for the detection of cracks in rail tracks. This methodology is trained by several crack images which are obtained from different environment. This method automatically classifies the current image based on the trained patterns, thus improves the classification accuracy. Findings: The performance of the proposed system achieves the accuracy rate of 94.9%, with respect to manually crack detected and segmented images. <![CDATA[SimulES-W: A Collaborative Game to Improve Software Engineering Teaching]]> Abstract: There is empirical evidence concerning the effectiveness and benefits of game-based learning (GBL). Our mainly interest is to present a tool that can be used to complement teaching software engineering in a motivating and didactic way. This paper studies the use of a GBL tool called SimulES-W (Simulation in Software Engineering), to teach Software Engineering in an undergraduate engineering program. SimulES-W has three characteristics: it is based on real software cases, it can be customized during the learning process, and it is a collaborative game. These characteristics are important because they help us understand and propose a new learning scenario, and to research with this the learning processes in their environments According to it, the first characteristic of SimulES-W makes it a motivating and engaging game, which brings up cases, which usually are only present in real software projects. Thanks to the second characteristic, the educators can use SimuelES-W to customize the education material, and tune the game for specific software engineering courses. The third characteristic is related to the proposed game as activity that involves group discussions and decision-making. This paper presents SimulES-W a digital version of SimulES and reports the results of an evaluation from a pedagogical perspective, where game adequacy for teaching a subject and positive potential impact in student’s academic performance are investigated. <![CDATA[Adaptive Algorithm Based on Renyi’s Entropy for Task Mapping in a Hierarchical Wireless Network-on-Chip Architecture]]> Abstract: This paper describes the use of Renyi’s entropy as a way to improve the convergence time of the Population-Based Incremental Learning (PBIL) optimization algorithm. As a case study, the algorithm was used in a hierarchical wireless network-on-chip (WiNoC) for the sake of performing the optimal task mapping of applications. Two versions of Renyi’s entropy are used and compared to the more traditional Shannon formulation. The obtained results are promising and suggest that Renyi’s entropy may help to reduce the PBIL convergence time, without degrading the quality of the found solutions. <![CDATA[Teletraffic Analysis for VoIP Services in WLAN Systems with Handoff Capabilities]]> Abstract: In this work, a TDMA-based system is studied for VoIP services with and without mobility. The system is considered to be dedicated for voice-only services such as a WLAN environment under the PCF mode or a cellular system. In such environments, mobiles usually have mobility. This introduces the need to study the system when a handoff procedure is enabled. The teletraffic analysis considers call arrivals, call completions, and ON/OFF activity processes for individual VoIP sessions. The VoIP system is simulated in order to verify the accuracy of the analytical results. Since many papers published in the literature consider a fixed number of VoIP users in the system, this analysis can be useful to estimate the system’s performance when the system is in statistical equilibrium. As an additional feature of this work, a simple fluid model to calculate the number of active and inactive users in a VoIP system with and without handoff capabilities is proposed and developed. <![CDATA[Continuous Testing and Solutions for Testing Problems in Continuous Delivery: A Systematic Literature Review]]> Abstract: Continuous Delivery is a software development discipline where quality software is built in a way that it can be released into production at any time. However, even though instructions on how to implement it can be found in the literature, it has been challenging to put it into practice. Testing is one of these biggest challenges. On the one hand, there are several Continuous Delivery testing problems related to Continuous Delivery reported in the literature. On the other hand, some sources state that Continuous Testing is the missing element in Continuous Delivery. In this paper, we present a systematic literature review. We look at proposals, techniques, approaches, methods, frameworks, tools and solutions for testing problems. We also attempt to validate whether Continuous Testing is the missing component of Continuous Delivery by analyzing the different definitions of it and the testing stages and levels in Continuous Delivery. Finally, we look for open issues in Continuous Testing. We have found 56 articles and the results indicate that Continuous Testing is straight related to Continuous Delivery. We also describe how solutions have been proposed to face the testing problems. Lastly, we show that there are still open issues to solve.