SciELO - Scientific Electronic Library Online

 
vol.23 número3Computational Linguistics: Introduction to the Thematic IssueCentral Embeddings for Extractive Summarization Based on Similarity índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Comp. y Sist. vol.23 no.3 Ciudad de México Jul./Set. 2019  Epub 09-Ago-2021

https://doi.org/10.13053/cys-23-3-3282 

Articles of the Thematic Issue

Ontological Knowledge for Rhetorical Move Analysis

Mohammed Alliheedi1  3 

Robert E. Mercer2  3 

Sandor Haas-Neill4 

1 Al Baha University, Faculty of Computer Science and Information Technology, Al Baha, Kingdom of Saudi Arabia. mallihee@uwaterloo.ca

2 The University of Western Ontario, Dept. of Computer Science, London, Canada. mercer@csd.uwo.ca

3 University of Waterloo, David R. Cheriton School of Computer Science, Waterloo, ON, Canada.

4 McMaster University, Dept. of Medical Science, Hamilton, ON, Canada. haasneis@mcmaster.ca


Abstract

Scholarly writing in the experimental biomedical sciences follows the IMRaD (Introduction, Methods, Results, and Discussion) structure. Many Biomedical Natural Language Processing tasks take advantage of this structure. Recently, a new challenging information extraction task has been introduced as a means of obtaining these types of detailed information: identifying the argumentation structure in biomedical articles. Argumentation mining can be used to validate scientific claims and experimental methodology, and to plot deeper chains of scientific reasoning. One subtask in identifying the argumentation structure is the identification of rhetorical moves, text segments that are rhetorical and perform specific communicative goals, in the Methods section. Based on a descriptive taxonomy of rhetorical moves structured around IMRaD, the foundational linguistic knowledge needed for a computationally feasible model of the rhetorical moves is described: semantic roles. One goal is to provide FrameNet and VerbNet-like ontologies for the specialized domain of biochemistry. Using the observation that the structure of scholarly writing in the laboratory-based experimental sciences closely follows the laboratory procedures, we focus on the procedural verbs in the Methods section. Occasionally, the text does not contain fillers for all of the semantic role slots that are needed to perform an adequate analysis of a verb. To overcome this problem, an ontology of experimental procedures can be interrogated to provide a most likely candidate for the missing semantic role(s).

Keywords: Ontological knowledge; rhetorical move

1 Introduction

Scientists must routinely review the scholarly literature in their fields to keep abreast of current advances and to retrieve information relevant to their research. However, the volume of online scientific literature is immense, and rapidly increasing. In the biomedical field, the National Centre for Biotechnology Information (NCBI) developed a literature search engine, PubMed1, to access various databases such as MEDLINE (journal citations and abstracts for biomedical literature), full-text life science e-journals, and online books.

In 2010 PubMed repositories consisted of more than 20 million citations for biomedical literature [23]. By 2019 the number of citations had increased to more than 30 million2. As a consequence, it has become extremely challenging for biomedical scientists to keep current with information in their fields. This challenge has attracted Natural Language Processing (NLP) researchers to develop resources and automated tools for performing various tasks in Information Extraction (IE) and Text Mining (TM) using online corpora of biomedical articles, and thus enable biomedical researchers to better manage and exploit this volume of data [18].

These research activities have led to the development of a new field, Biomedical Natural Language Processing (BioNLP), a collaboration between the biomedical and computational linguistics/artificial intelligence communities [17]. The types of tasks currently handled by BioNLP systems have generally been aimed at extracting very specific and limited information, for example, protein and gene names and relations [11], and so have been able to rely on relatively simple forms of information extraction. BioNLP has adapted various standard information extraction techniques, including both rule-based (e.g., shallow parsing, syntactic pattern-matching) and Machine Learning (e.g., Support Vector Machines, k-nearest neighbour classification method), to address several text-mining tasks, including extracting: protein-protein interactions (PPI) [21], drug-drug interactions (DDI) [28], gene relationships [19], and protein-residue associations [25].

Although these approaches fulfil some information needs, information extraction systems based on these can only recognize and extract minimal and specific information from biomedical texts. But other, more in-depth and comprehensive, information contained in biomedical texts would be highly valuable to scientists because this type of information can enable validating scientific claims, tracing current research directions in their field, reproducing scientific procedures and so forth. Recently, a new and more challenging information extraction task has been introduced as a means of obtaining these types of detailed information: identifying the argumentation structure in biomedical articles (e.g., [15] and [16]). Argumentation mining can be used to validate scientific claims and experimental methodology, and to plot deeper chains of scientific reasoning. Unlike earlier simpler forms of information extraction, here the goal is to identify the structure of argumentative components within an entire text-for example, premises, evidence, conclusions-as well as the relationships between components.

To achieve this goal the text needs to be analyzed. Our approach to this analysis is based on a working hypothesis:

We hypothesize that recognizing and detecting rhetorical moves would provide important information to our argumentation analysis framework, and that the Method sections in biochemistry articles contain moves which can be correlated with the author's experimental procedures. These moves can be used to determine salient information about the elements of the article's argumentative structure (e.g., premises) and can contribute to the overall understanding of the author's scientific claims.

A key aspect of our hypothesis is that development of a frame-based knowledge representation can be based on the semantics of the verbs associated with these procedures. This representation can provide detailed knowledge for understanding these rhetorical moves, which will in turn facilitate analysis of argumentation structure. In other words, we propose that a procedurally rhetorical verb-centric frame semantics can be used to obtain a sufficiently deep analysis of sentence meaning .

While this approach seems straightforward enough, the writing style of biochemistry articles requires the reader to have knowledge about biochemistry and biochemistry laboratory techniques and practices. This paper first gives the semantic roles that can be used in the semantics of each verb. Then an example of how an ontology containing knowledge about biochemistry laboratory techniques and practices can be used to fill the semantic roles of verbs which cannot be filled by information in the text.

2 Related Work

Swales [29] proposed the Create-A-Research-Space (CARS) model that uses intuition about the argumentative structure of scientific research articles. Swales defined rhetorical moves as text segments that convey communicative goals. He reviewed the Introduction section in 48 articles from social and natural science and found common rhetorical structures among most of these articles. Swales identified three moves in these articles: establishing a research territory, establishing a niche, and occupying the niche.

However, despite the widespread influence of the CARS model, some researchers observed two problems: (i) the inconsistent assignment of rhetorical moves to text segments because the identification of the rhetorical moves relies on overall text comprehension, and (ii) a lack of empirical validation of moves in linguistic terms [20].

To overcome these problems, Kanoksilap-atham [20] advanced Swales' approach to move analysis by developing a framework that combines his original CARS model with the use of Biber's multidimensional analysis [6] to enrich the model with additional information about linguistic characteristics. Biber's multidimensional analysis [6] is concerned with variation in the speaking and writing of English. Multidimensional analysis can be used to identify differences in linguistic characteristics between various text types at different levels of document structure (e.g., genre, internal section level). Although Kanoksilapatham provides an extension to the Swales's move analysis study, and attempted validation of these moves in biochemistry articles, she only provides a descriptive analysis about rhetorical moves without defining an explicit method for analyzing and recognizing these moves in texts.

Liakata et al. [22] developed an annotation scheme called Core Scientific Concepts (CoreSC) to classify sentences into scientific categories (e.g., related to author's other work). The CoreSC scheme consists of three layers: the first includes several categories to classify sentences; the second layer is concerned with properties of these categories; and the third layer creates a link to related instances of the same category. The authors use Machine Learning classifiers (i.e., Conditional Random Fields and Support Vector Machines) to automatically classify sentences into the CoreSC categorizes. The data set consisted of 265 biochemistry and chemistry articles. The authors were only able to achieve an accuracy around 50% in categorizing sentences in the appropriate CoreSC scientific categories which is inadequate for such a task.

Green [15] proposed a plan for creating an annotated corpus of biomedical genetics research articles. Green emphasized that this corpus would be beneficial to the argumentation mining community since it would provide a fine-grained annotation of argumentative components. Also since there are as yet few annotated corpora available, such a corpus would enrich research in the field of Computational Argumentation in general. The author stated that this corpus will be publicly available for further investigation by different research groups in various tasks of argumentation mining.

Green [16] specified a set of argumentation schemes for scientific claims in genetics research articles. The author used a corpus of unannotated genetics research articles, and identified the components (e.g., premises, conclusions) of an argument as well as its type of scheme. Based on the analyses of various genetics research articles, the author specified 10 argumentation schemes that are semantically different. These schemes were new and had not previously been proposed.

Furthermore, the specification of argumentation schemes was used to create annotation guidelines. Then, these guidelines were evaluated in a pilot study based on participants' ability to recognize these schemes by reading the guidelines. Overall, the author's ultimate goal for this initial study was to develop annotation guidelines for creating corpora for argumentation mining research. However, based on the pilot study, the results showed a variation in performance since there were two groups of participants (i.e., undergraduate students and researchers). The students performed poorly in recognizing argumentation schemes while the researchers were able to identify these schemes correctly in most cases.

3 Our Proposed Approach: Rhetorical Moves Mirror Scientific Experimental Procedures

Our intention is to develop a formal knowledge representation based on procedural verbs as a method for argumentation analysis. We introduced the notion of Swale's CARS model [29] in Section 2. We hypothesize that recognizing and detecting rhetorical moves would provide additional information to our framework of argumentation analysis. We also hypothesize that the Method sections in biochemistry articles contain moves which can be correlated with the author's experimental procedures. These moves can be used to determine salient information about the elements of the article's argumentative structure (e.g., premises) and can contribute to the overall understanding of the author's scientific claims. A key aspect of our hypothesis is that development of a frame-based knowledge representation can be based on the semantics of the verbs associated with these procedures. This representation can provide detailed knowledge for understanding these rhetorical moves, which will in turn facilitate analysis of argumentation structure. In other words, we propose that procedurally rhetorical verb-centric frame semantics can be used to obtain a deeper analysis of sentence meaning than is currently the case with simple methods of Information Extraction (e.g., shallow syntactic pattern) and in a computationally feasible manner.

Scientific argument3 is defined as a process that scientists follow by using certain procedures to obtain empirical data which will either support or defeat their claims, hence leading to the intended conclusion. The strength of a scientific argument depends on its reproducibility and consistency. For a scientific argument to be strong, a scientist should identify and explain all the procedures in their experiment, i.e., reproducibility, so that another researcher who follows the same procedures will reach the same conclusion, i.e., consistency. Thus, for a well-constructed scientific article, a scientist should expect the same conclusion if she follows the same procedures in the same sequence as described in the Method section.

Scientific writing in the biochemistry domain has certain characteristics that made it ideal for our purposes. In this domain, experimental procedures describe the sequence of actions the biochemist performs to carry out an experiment to derive verifiable scientific conclusions. The experimental procedures themselves can be verified because they are standard procedures described in detail in experimental manuals (e.g., Boyer [7] and Sambrook and Russell [26]). Verbs play an essential role as indicators of these experimental procedures.

These procedures can be viewed as corresponding to the elements of the scientific argumentation structure. For example, when examining a biological substance (e.g., a certain type of bacteria) in order to prove a hypothesis (e.g., this bacteria is correlated with a certain disease) the biochemist would perform a sequence of certain procedures to arrive at a conclusion. Essentially, biochemists create an argumentation framework through the scientific methodology they follow-how they perform their experiments is how they argue. We can observe that this genre- biochemistry articles-is procedure-oriented since the scientific procedures that are described are parallel to the scientific argumentation in the text. For example:

Example 1 "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer and proteins harvested in SDS-sample buffer, separated by SDS-PAGE, and analyzed by autoradiography." [12].

In this example, the verbs "washed", "harvested", "separated", and "analyzed" are used to illustrate the procedure steps in sequential order. Such an experiment can be reproduced if one follows these steps.

Fillmore [13] introduced the notion of frame semantics as a theory of meaning. A semantic frame is defined as "any coherent individuatable perception, memory, experience, action or object" by Fillmore [14]. In other words, coherently structured concepts that are related to each other represent a complete knowledge of world events or experiences. For example, to understand the word "buy", one would access the knowledge contained in the commercial transaction frame which includes words such as the person who buys the goods (buyer), the goods that are being sold (goods), the person who sells the goods (seller), and .the currency that the buyer and seller agree on (money).

Following Fillmore's theory of frame semantics, FrameNet [5] was developed to create an online lexical resource for English. This framework includes more than 170,000 manually annotated sentences and 10,000 words. The computational linguistic community has been attracted to the concept of the frame semantics and developed computational resources using this concept, such as VerbNet [27], an on-line verb lexicon for English and PropBank [24], an annotated corpus with basic semantic propositions.

Following the notion of frame semantics, we propose to build a knowledge representation framework to analyze verbs in a procedure-oriented genre. Our concept of procedurally rhetorical verb-centric frame semantics is intended to address this gap by developing a computationally feasible knowledge representation that will enable argumentation analysis.

The knowledge contained in the frame semantics will facilitate the extraction of elements of arguments, i.e., argumentation mining. To reiterate, our hypothesis is that procedurally rhetorical verb-centric frame semantics can provide a knowledge representation framework for analyzing and representing the meanings of the verbs used in biochemistry articles. In turn, these frames will facilitate the identification of argumentation structure in the discourse describing experimental procedures.

4 Ontological Knowledge Sources

To provide the knowledge required to achieve the rhetorical move analysis discussed in the previous section, we propose two sources organized as ontologies. An ontology, as used here, is composed of the concepts and the relations between them. We discuss two ontologies below. The first, semantic roles, represents the knowledge about verbs that we argue is needed to analyze rhetorical moves. This information is organized in VerbNet-like [27] verb frames. The second knowledge source is composed of information about experimental procedures in the biochemistry domain. This information is organized in the familiar graph-based web of objects, classes of objects, and relations among these.

4.1 Semantic Roles

As described earlier our experimental event scheme was inspired by the annotation scheme for bio-events [30]. We based our experimental event scheme for verb arguments on the inventory of semantic roles in VerbNet [27] and modified and added new semantic roles to define our scheme. Our experimental event scheme includes: Theme, Patient, Predicate, Agent, Location, Goal, etc. The complete set of semantic roles and their definitions in our experimental event scheme is presented in Table 2.

Table 1 Rhetorical Moves in the Method Section of Biochemistry Articles (from [20]

Move type Definition
Description-of-method Concerned with sentences that describe experimental events.
Appeal-to-authority Concerned with sentences that discuss the use of well-established methods.
Background information Concerned with all background information for the experimental events such as “method justification, comment, or observation, exclusion of data, approval of use of human tissue” as defined by Kanoksilapatham (2003).
Source-of-materials Concerned with the use of certain biological materials in the experimental events.

Table 2 Semantic Roles in the Annotation Scheme of our Experimental Event 

Semantic role Definition
Agent Generally a human or an animate subject.
Patient Participants that have undergone a process.
Theme Participants in a location or undergoing a change of location.
Goal:
Physical Identifies a thing toward which an action is directed or a place to which something moves.
Purpose Identifies the stated purpose in a sentence for doing certain actions.
Factitive A referent that results from the action or state identified by a verb.
Location The physical place where the experiments took place.
Protocol-Detail:
Time Identifies the time or a duration of an experimental process.
Temperature Identifies the temperature of an experimental process.
Condition Identifies the condition of how an experimental process is performed.
Repetition Identifies the number of times an experimental process is repeated.
Buffer Identifies the buffer that was used in an experimental process.
Cofactor Identifies the cofactor that was used in an experimental process.
Instrument:
Change Describes objects (or forces) that come in contact with an object and cause some change.
Measure Describes an object or protocol that can measure another object(s).
Observe Describes an object which can be used to observe another object(s).
Maintain Describes an object or protocol which can be used to maintain the state of object(s).
Catalyst Describes an object that can be used as a catalytic “facilitator” for an experimental event to occur.
Reference Refers to a method or protocol that is being used.
Mathematical Describes a mathematical or computational instrument

We have extended the VerbNet definition of the semantic role Instrument from simply describing "an object or force that comes in contact with an object and causes some change in them" [27] to include a variety of subcategories that correspond to various types of biological and man-made instruments that are used in a biochemistry laboratory. The new semantic roles (with example text in boldface) are:

  1. Instruments used to change the state of an object. For example:

    Example 2 "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer…"[12].

    In this example, the pulldown buffer was used to wash (change the state of) the Beads with bound proteins. In this instance, the phrase "pulldown buffer" should be labeled as instrument (change).

  2. Instruments used to maintain the state of an object. For example:

    Example 3 "Once the samples were in EPR tubes, they were immediately frozen in liquid nitrogen, and stored in liquid nitrogen before using." [10].

    In this example, the liquid nitrogen was used to store (maintain the condition of) the samples which were in the EPR tubes. In this case, the phrase "liquid nitrogen" should be labeled as instrument (maintain).

  3. Instruments used to observe an object. For example:

    Example 4 The mitochondria was observed by spinning disk confocal microscopy.

    The spinning disk confocal microscopy is used to observe the mitochondria. We should label the phrase "spinning disk confocal microscopy" as instrument (observe).

  4. Instruments used as a catalyst in experimental processes to occur. For example:

    Example 5 "The ca. 900 bp PCR products were digested with NdeI and HindIII and ligated into pUC19." [9].

    In this example, the NdeI and HindIII are enzymes used to facilitate the digestion (cutting) of the ca.(approximately) 900 bp PCR products. In this instance, the phrase "NdeI and HindIII" should be labeled as instrument (catalyst).

  5. Instrument used to measure an object. For example:

    Example 6 "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer and proteins harvested in SDS-sample buffer, separated by SDS-PAGE, and analyzed by autoradiography ."[ 12]

    In this example, the autoradiography was used to analyze (measure) the proteins. In this example, the word "autoradiography" should be labeled as instrument (measure).

  6. It could be used to describe a mathematical or computational instrument (e.g., simulation, algorithm, equation, and the use of software). For example:

    Example 7 "Simulations of these EPR spectra were accomplished with the computer program QPOWA [ 30, 31] )." [10].

    The computer program QPOWA was used here as computational instrument to perform simulations of the mentioned above EPR spectra. So, the phrase "the computer program QPOWA [ 30, 31]" should be labeled as instrument (computational instrument).

  7. Finally it could be used as a reference for method or protocol that being used. For example:

    Example 8 "The preparation of authentic vaccinia H5R protein and recombinant B1R protein kinase were as previously described [11]." [8]

    The phrase "as previously described [11]" is to indicate that the authors referring to other method that they used in their current experimental process. We should label the phrase "as previously described [11]" as instrument (reference).

These sub-categories of the semantic role (instrument) are not necessarily exclusive to the mentioned types above. However, based on our full-text analysis, these instrument types are as comprehensive as we have achieved to date. We will add or update these sub-categories if we encounter a new type (usage) of instrument.

We have also proposed a new semantic role protocol detail that identifies certain types of information about experimental processes. These new subcategories (with example text in boldface) are:

  1. Time or the duration of a process [27]. For example:

    Example 9 "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer…" [12].

  2. Temperature of an experimental process. For example:

    Example 10: "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer..." [12].

  3. Condition or manner of which an experimental process was carried out. For example:

    Example 11 "Beads with bound proteins were washed six times (for 10 min under rotation at 4 C) with pulldown buffer . . . " [12].

  4. Buffer which is "a solution containing either a weak acid and a conjugate base or a weak base and a conjugate acid, used to stabilize the pH of a liquid upon dilution."4 For example:

    Example 12 "For phosphorylation, three identical reactions contained H5R protein (70 pmol), B1R protein kinase (90 μl) Tris-HCl, pH 7.4 (20 mM), magnesium chloride (5 mM), ATP (50 μM), [γ-32P] ATP (50 μCi) and dithiothreitol (2 mM) in a total volume of 500 μl "[ 8].

  5. Cofactor is defined as "substances that are required for, or increase the rate of, catalysis."5 For example:

    Example 13 "For phosphorylation, three identical reactions contained H5R protein (70 pmol), B1R protein kinase (90 μl), Tris-HCl, pH 7.4 (20 mM), magnesium chloride (5 mM), ATP (50 μM), [γ-32P] ATP (50 μCi) and dithiothreitol (2 mM) in a total volume of 500 μl "[8].

  6. Repetition of a step in experimental processes. For example:

    Example 14 "Beadswith boundproteinswere washed six times (for 10 min under rotation at 4 C) with pulldown buffer…" [12].

With these semantic roles we are able to provide the frames for procedural verbs. To illustrate, Fig. 1 contains the frame for the verb digest.

Fig. 1 The verb frame for the verb digest 

4.2 An Ontology of Biochemical Techniques and Laboratory Practices

Knowledge about how experiments are carried out in a biochemistry laboratory is absolutely essential to the understanding of much of the text found in biochemistry articles. We needed assistance from a biochemist to understand many of the sentences that are present in our corpus. With this in mind we have developed an ontology prototype to assist with a computational approach to analyzing the sentences found in the Methods section of a biochemistry article. Details of this prototype ontology are described elsewhere [3].

The example of a procedure called Alkaline Agarose Gel Electrophoresis is given in text format in Fig. 2. This is a common procedure used to isolate the biological substance that is used in future procedures from the other substances found in the solution that results from the previous procedures. The knowledge about how this electrophoesis procedure is carried out has been implemented in the prototype ontology. Why this knowledge is important is discussed in the following section.

Fig. 2 Alkaline Agarose Gel Electrophoresis Ontology 

5 A Manual Annotation of a Portion of a Method Section

We have selected three articles from our corpus randomly to manually analyze and extract steps in experimental procedures (processes) from the method section. Table 3 shows some sentences from one of these articles [9]. The purpose of this analysis is to identify the semantic roles of experimental processes and the semantic frames of procedural verbs that occurred in these processes. Also, we want to demonstrate the usefulness of our approach by mapping the knowledge of frame semantics and the ontological knowledge to rhetorical moves.

Table 3 Some sentences from the article Biochem-3-_-77373 [9] 

No. Sentence
1 The over-expression plasmid for L1, pUB5832, was digested with NdeI and HindIII, and the resulting ca. 900 bp piece was gel purified and ligated using T4 ligase into pUC19, which was also digested with NdeI and HindIII, to yield the cloning plasmid pL1PUC19.
2 Mutations were introduced into the L1 gene by using the overlap extension method of Ho et al. [60], as described previously [68].
3 The oligonucleotides used for the preparation of the mutants are shown in Table 1.1.

The sentences in Table 3 are three contiguous sentences in a biochemistry article. They discuss the idea of cutting a DNA piece from a plasmid, which is "a small circular and double-stranded DNA molecule that is distinct from a cell's chromosomal DNA",6 and ligate (attach) that piece to another plasmid to produce the desired protein. Table 4 shows five events from the sentences in Table 3. The events 1, 2, 3, and 4, which are demonstrated in Fig. 3, are extracted from Sentence No. 1, and Sentence No. 2 has only Event 5, while there is no actual experimental event in Sentence No. 3. It rather simply refers to a table in the article's prior text. Each event in Table 4 represents one complete experimental procedure. Also the actual sequence of experimental events in the lab don't necessarily follow the sequence that these events appear in the text. Another important aspect to note is that not all the essential information about experimental processes is found in the text, some information can be implied. However, these implied pieces of information can be inferred from an ontology of standard biochemistry procedures, some of which we have developed. Taking a look at Events 1-4 in Table 4:

  1. Digestion of pUB5832: a 900 bp piece was cut out using two restriction enzymes (NdeI and Hind III).

  2. Then, the gel purification of the 900 bp piece: gel electrophoresis was used in this purification step. This is implied information derived from the ontology.

  3. At any time before Event 4, the digestion of pUC19 happens, This could happen before, after, between, or during Events 1 and 2.

  4. After Events 1, 2, and 3, ligation of the 900 bp into pUC19 occurs.

Table 4 Extracted events from two sentences in the article Biochem-3-_-77373 [9] 

Event 1 Event 2 Event 3
Sentence No. 1
— Patient: The over-expression plasmid for L1, pUB5832
— Predicate: digested
— Instrument (catalyst): NdeI and HindIII
Sentence No. 1
— Patient: the resulting ca. 900 bp piece
— Predicate: gel purified
— Instrument (catalyst): Gel electrophoresis
Sentence No. 1
— Patient: pUC19
— Predicate: digested
— Instrument (catalyst): NdeI and HindIII
Event 4 Event 5
Sentence No. 1
— Patient: the resulting ca. 900 bp piece
— Predicate: ligated
— Instrument (catalyst): using T4 ligase
— goal: into pUC19
Sentence No. 2
— Patient: the L1 gene
— Predicate: introduced (mutated)
— Instrument (reference type): using the overlap extension method of Ho et al.
Sentence No. 3 does not contain experimental events.

Fig. 3 A sequence of the events 1, 2, 3 and 4 from sentence No.1 

A lot of information can be derived from the text using knowledge about the verbs. This has been described earlier: the semantic roles of each verb together with syntactic information allows this information to be extracted from the text. Table 4 shows this extracted information. However, this is not enough to understand the information provided in the text.

A proper interpretation of the description of events in Sentence No. 1 cannot be completely derived from the text alone. An understanding of laboratory practice together with knowledge of what is involved in performing plasmid digestion, purification, and ligation is required. Some of the event sequencing can be derived from the text, for instance, the pragmatics of the conjunction "and" usually indicates that the second conjunct follows temporally after the first conjunct has completed. The phrase "the resulting" is also a key linguistic clue to determine this sequence. But, when the third event happens requires knowledge of biochemistry and laboratory practice as well as knowledge of the complete method. The linguistic information provided by the use of a relative clause does not enable a complete understanding of this event, so the ontology is required for the information required to do a proper interpretation. Another important aspect of the text is that all of the referents are described by singular nouns. However, knowing the biological processes that are carried out in the laboratory is important: solutions containing large numbers of the biological elements are used. Hence, one is not dealing with a single plasmid or a single piece from the plasmid, and when the digestion occurs, all of the pieces from the plasmids are in the solution including ones that didn't get digested, thus the need for the gel purification step which separates the various biological elements.

An example of inferring implied information from the ontology can be given. Event 2 in Table 4 is gel purification. What is used to perform this task is not given in the text. The following SPARQL query extracts some domain knowledge about the experimental procedure of Alkaline Agarose Gel Electrophoresis from our framework providing the missing instrument semantic role information.

Figure 4 shows all of the instruments involved in any state for all steps of the Alkaline Agarose Gel Electrophoresis procedure. Using this information and knowledge about the steps in procedure, the instrument gel electrophoresis can be inferred.

Fig. 4 Result of Query1: Extract all devices involved in all steps of the Alkaline Agarose Gel Electrophoresis procedure 

SPARQL Query

Query1. Return all devices involved in a state of all

steps (1.1, 1.2, 3)

SELECT ?step ? s tat e ?item

WHERE { ?step r d f : type : Step .

?step : hasState ? s tat e .

? s t a te : invol ves ?item .

?item r d f : type : Device }

6 Conclusions and Future Work

In this research we have provided prototypes for two ontologies of the biochemistry domain. The first ontology, procedurally rhetorical frame semantics, provides semantic roles for procedural verbs. The second ontology provides information about biochemical techniques. This ontology can be used to give information that does not appear in the scientific article text. To the best of our knowledge, no research has proposed or incorporated the idea of a semantic frame based on verb analysis to assist in the analysis of argumentation in biochemistry articles. Nor has any attempt been made to build an ontology of biochemical techniques and laboratory practices.

Our future goal is an in-depth argumentation analysis of biochemistry articles. Having access to the rhetorical moves that have been extracted using the two ontologies will enable a computationally feasible technique that will enable argumentation mining of more-detailed scientific knowledge than is currently available. This will be an important step towards providing researchers in Computational Argumentation working in domains with similar discourse structure with a means of using and evaluating the metrics we will develop. We have begun conducting an annotation study for both semantic roles [1] and rhetorical moves [2]. In addition, we have built a prototype ontology that we described in other work [3].

The SPARQL Query and Fig. 4 show the power of using the ontological knowledge to obtain relevant information about specific experimental processes7. We have also developed a set of frames for frequent procedural verbs (e.g., "digest") in our analyzed data set. Our aim is to extend the VerbNet project by providing syntactic and semantic information for these procedural verbs. Further details can be found in the first author's PhD thesis [4].

References

1. Alliheedi, M. & Mercer, R. E. (2019). Semantic roles: Towards rhetorical moves in writing about experimental procedures. Proceedings of the 32nd Canadian Conference on Artificial Intelligence, pp. 518-524. [ Links ]

2. Alliheedi, M., Mercer, R. E., & Cohen, R. (2019). Annotation of rhetorical moves in biochemistry articles. Proceedings of the 6th Workshop on Argument Mining, pp. 113-123. [ Links ]

3. Alliheedi, M., Wang, Y., & Mercer, R. E. (2019). Biochemistry procedure-oriented ontology: A case study. Proceedings of the 11th International Conference on Knowledge Engineering and Ontology Development, To appear in Science and Technology Publications (SCITEPRESS), pp. [ Links ]

4. Alliheedi, Mohammed (2019). Procedurally Rhetorical Verb-Centric Frame Semantics as a Knowledge Representation for Argumentation Analysis of Biochemistry Articles. Ph.D. thesis, University of Waterloo. [ Links ]

5. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. Proceedings of the 17th International Conference on Computational Linguistics - Volume 1, pp. 86-90. [ Links ]

6. Biber, D. (1991). Variation Across Speech and Writing. Cambridge University Press. [ Links ]

7. Boyer, R. F. (2012). Biochemistry Laboratory: Modern Theory and Techniques. Prentice Hall. [ Links ]

8. Brown, N. G., Morrice, D. N., Beaud, G., Hardie, G., & Leader, D. P. (2000). Identification of sites phosphorylated by the vaccinia virus B1R kinase in viral protein H5R. BMC Biochemistry, Vol. 1, No. 2. [ Links ]

9. Carenbauer, A. L., Garrity, J. D., Periyannan, G., Yates, R. B., & Crowder, M. W. (2002). Probing substrate binding to Metallo-p-Lactamase L1 from Stenotrophomonas maltophilia by using site-directed mutagenesis. BMC Biochemistry, Vol. 3, No. 4. [ Links ]

10. Chen, W. & Guidotti, G. (2001). The metal coordination of sCD39 during ATP hydrolysis. BMC Biochemistry, Vol. 2, No. 9. [ Links ]

11. Cohen, K. B. & Demner-Fushman, D. (2014). Biomedical Natural Language Processing. Natural Language Processing 11. John Benjamins Publishing Company. [ Links ]

12. Ester, C. & Uetz, P. (2008). The FF domains of yeast U1 snRNP protein Prp40 mediate interactions with Luc7 and Snu71. BMC Biochemistry, Vol. 9, No. 29. [ Links ]

13. Fillmore, C. J. (1976). Frame semantics and the nature of language. Annals of the New York Academy of Sciences, Vol. 280, No. 1, pp. 20-32. [ Links ]

14. Fillmore, C. J. (1977). Topics in lexical semantics. In Cole, R. W., editor, Current Issues in Linguistic Theory. Indiana University Press, pp. 76-138. [ Links ]

15. Green, N. (2014). Towards creation of a corpus for argumentation mining the biomedical genetics research literature. Proceedings of the First Workshop on Argumentation Mining, pp. 11-18. [ Links ]

16. Green, N. (2015). Identifying argumentation schemes in genetics research articles. Proceedings of the 2nd Workshop on Argumentation Mining, pp. 12-21. [ Links ]

17. Huang, C.-C. & Lu, Z. (2016). Community challenges in biomedical text mining over 10 years: Success, failure and the future. Briefings in Bioinformatics, Vol. 17, No. 1, pp. 132-144. [ Links ]

18. Hunter, L. & Cohen, K. B. (2006). Biomedical language processing: What's beyond PubMed? Molecular Cell, Vol. 21, No. 5, pp. 589-594. [ Links ]

19. Hur, J., Özgür, A., Xiang, Z., & He, Y. (2012). Identification of fever and vaccine-associated gene interaction networks using ontology-based literature mining. Journal of Biomedical Semantics, Vol. 3, No. 18. [ Links ]

20. Kanoksilapatham, B. (2005). A corpus-based investigation of scientific research articles: Linking move analysis with multidimensional analysis. Ph.D. thesis, Georgetown University. [ Links ]

21. Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. (2008). Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, Vol. 9, No. 2, pp. S4. [ Links ]

22. Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, Vol. 28, No. 7, pp. 991-1000. [ Links ]

23. Lu, Z. (2011). PubMed and beyond: A survey of web tools for searching biomedical literature. Database, Vol. 2011. [ Links ]

24. Palmer, M., Gildea, D., & Kingsbury, P. (2005). The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, Vol. 31, No. 1, pp. 71-106. [ Links ]

25. Ravikumar, K., Liu, H., Cohn, J. D., Wall, M. E., & Verspoor, K. (2012). Literature mining of protein-residue associations with graph rules learned through distant supervision. Journal of Biomedical Semantics, Vol. 3, No. 3, pp. S2. [ Links ]

26. Sambrook, J. & Russell, D. W. (2001). Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press. [ Links ]

27. Schuler, K. K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. thesis, University of Pennsylvania. [ Links ]

28. Segura-Bedmar, I., Martinez, P., & de Pablo-Sanchez, C. (2010). Extracting drug-drug interactions from biomedical texts. BMC Bioinformatics, Vol. 11(Suppl5), No. P9. [ Links ]

29. Swales, J. (1990). Genre Analysis: English in Academic and Research Settings. Cambridge University Press. [ Links ]

30. Thompson, P., Nawaz, R., McNaught, J., & Ananiadou, S. (2011). Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics, Vol. 12, No. 393. [ Links ]

4Buffer - Biology-Online Dictionary. (n.d.). Retrieved September 23, 2017, from http://www.biologyonline.org/dictionary/Buffer.

5coenzymes and cofactors. (n.d.). Retrieved September 23, 2017, from http://academic.brooklyn.cuny.edu/biology/bio4fv/page/coenzy$\_$.htm.

6plasmid / plasmids — Learn Science at Scitable. (n.d.). Retrieved December 22, 2017, from https://www.nature.com/scitable/definition/plasmid-plasmids-28.

7For additional detail about this prototype ontology, please see [3].

Received: January 29, 2019; Accepted: March 04, 2019

* Corresponding author is Mohammed Alliheedi. mallihee@uwaterloo.ca

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License