PoSLemma: How Traditional Machine Learning and Linguistics Preprocessing Aid in Machine Generated Text Detection

doi:10.13053/CyS-27-4-4778

Abstract:

With the release of several Large Language Models (LLMs) to the public, concerns have emerged regarding their ethical implications and potential misuse. This paper proposes an approach to address the need for technologies that can distinguish between text sequences generated by humans and those produced by LLMs. The proposed method leverages traditional Natural Language Processing (NLP) feature extraction techniques focusing on linguistic properties, and traditional Machine Learning (ML) methods like Logistic Regression and Support Vector Machines (SVMs). We also compare this approach with an ensemble of Long-Short Term Memory (LSTM) networks, each analyzing different paradigms of Part of Speech (PoS) taggings. Our traditional ML models achieved F1 scores of 0.80 and 0.72 in the respective analyzed tasks.

Keywords::
Generative text detection, text generation, AuTexTification, logistic regression, support vector machine (SVM), classification

Sections-
1 Introduction
3 Methodology
4 Results
5 Conclusions
Acknowledgments
References

1 Introduction

There has been an increase—in the last few years—in the number of LLMs available to the public, among such models we find: Pathways Language Model (PaLM) [⁴], BLOOM [²⁸], Bidirectional Encoder Representations from Transformers (BERT) [⁶], BART [¹⁴], Robustly Optimized BERT Pretraining Approach (RoBERTa) [¹⁵], Generative Pre-trained Transformer (GPT) [²³], GPT-2 [²⁴], GPT-3 [³], and, more recently, Chat-GPT and GPT-4 [²²].

⁴
PaLM: Scaling language modeling with pathways, 2022

4. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N. et al. (2022). PaLM: Scaling language modeling with pathways. DOI: 10.48550/arXiv.2204.02311.
²⁸
BLOOM: A 176B-parameter open-access multilingual language model, 2022

28. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., Villanova-Moral, A., Ruwase, O. et al. (2022). BLOOM: A 176B-parameter open-access multilingual language model. BigScience Workshop. DOI: 10.48550/arXiv.2211.05100.
⁶
BERT: Pre-training of deep bidirectional transformers for language understanding, 2019

6. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186.
¹⁴
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019

14. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. DOI: 10.18653/v1/2020.acl-main.703.
¹⁵
RoBERTa: A robustly optimized BERT pretraining approach, 2019

15. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. DOI: 10.48550/arXiv.1907.11692.
²³
Improving language understanding by generative pre-training, 2018

23. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
²⁴
Language models are unsupervised multitask learners, 2019

24. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI, Vol. 1, No. 8, pp. 9.
³
Language models are few-shot learners, 2020

3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C. et al. (2020). Language models are few-shot learners. 34th Conference on Neural Information Processing Systems, Vol. 33, No. 159, pp. 1877–1901.
²²
GPT-4 technical report, 2023

22. OpenAI (2023). GPT-4 technical report. DOI: 10.48550/arXiv.2303.08774.

Although great performance has been achieved in terms of text generation, there are some ethical issues that need to be addressed, for instance: the lack of validation for the data retrieved from these models, since they suffer from hallucinations—the information provided might be incorrect— [¹³]; the creation of fake news [³⁴, ³] and academic cheating [¹].

¹³
From ChatGPT to FactGPT: A participatory design study to mitigate the effects of large language model hallucinations on users, 2023

13. Leiser, F., Eckhardt, S., Knaeble, M., Mädche, A., Schwabe, G., Sunyaev, A. (2023). From ChatGPT to FactGPT: A participatory design study to mitigate the effects of large language model hallucinations on users. Proceedings of Mensch und Computer 2023, pp. 81–90. DOI: 10.1145/3603555.3603565.
³⁴
Defending against neural fake news, 2019

34. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y. (2019). Defending against neural fake news. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vol. 32, No. 812, pp. 9054–9065.
³
Language models are few-shot learners, 2020

3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C. et al. (2020). Language models are few-shot learners. 34th Conference on Neural Information Processing Systems, Vol. 33, No. 159, pp. 1877–1901.
¹
Who wrote this essay? Detecting AI-generated writing in second language education in higher education
Teaching English with Technology, 2023

1. Alexander, K., Savvidou, C., Alexander, C. (2023). Who wrote this essay? Detecting AI-generated writing in second language education in higher education. Teaching English with Technology, Vol. 23, No. 2, pp. 25–43. DOI: 10.56297/BUKA4060/XHLD5365.

Since these models are publicly available, there exists high risk of misuse of LLMs models; therefore, there is a requirement for the creation of algorithms and systems capable of differentiating human generated text from that generated by LLMs. To this purpose, both, LLMs, as well as traditional Machine Learning (ML) have been tested on this classification task. When it comes to LLMs, several models have been used to leverage their capabilities (including BERT, RoBERTa, and GPT-2), while traditional ML have been less used (see Related work).

In this paper, we present our approach for the human vs. machine generated text classification problem, to this purpose, we selected the AuTexTification dataset [²⁶] to work with. This dataset is divided, mainly, into two subtasks: 1) human vs. machine generated text and 2) model profiling, i.e., for a given machine generated sequence, determine which model created it.

²⁶
AuTexTification dataset (full data), 2023

26. Sarvazyan, A., González, J. Á., Franco, M., Rangel, F. M., Chulvi, M. A., Rosso, P. (2023). AuTexTification dataset (full data). Zenodo. DOI: 10.5281/zenodo.7956207.

Our proposal includes: a preprocessing stage, common in traditional NLP tasks, conformed by Tokenization, Multi-Word Token Expansion, Part of Speech and Lemmatization.

This stage is followed by Logistic Regression and Support Vector Machines (SVMs) models with 10-fold Cross Validation. Due to the particularities of both subtasks, special attention was taken into the feature extraction stages (previous to providing inputs to the models), creating combinations of unigrams, bigrams and trigrams of different nature, from PoS Taggins to Lemmas.

The rest of the paper is structured as follows: Section 2 provides a literature review on the present task, highlighting the most important research. Section 3 introduces the dataset, preprocessing and models employed for our study.

In Section 4 we present our empirical results from the experiments, as well as an overview of the meaning of such values. Lastly, Section 5 offers our conclusions drawn based on our results and observations.

2 Related Work

The task of detecting machine-generated text is closely related to and a direct consequence of the task of text generation, which was first implemented by authors such as Mann and McKeown [¹⁷, ¹⁸].

¹⁷
Nigel: A systemic grammar for text generation, 1983

17. Mann, W. C., Matthiessen, C. M. (1983). Nigel: A systemic grammar for text generation. Technical report, University of Southern California, Marina del Rey, Information Sciences Institute.
¹⁸
Text generation (Studies in natural language processing), 1992

18. McKeown, K. R. (1992). Text generation (Studies in natural language processing). Cambridge University Press.

However, only recently has the problem of detecting whether a given text was generated or identifying the model that produced the text become challenging, even for humans [⁵]. Most efforts to detect or classify generated content apply Large Language Models such as BERT [¹⁰, ³²], DistilBERT [²¹], RoBERTa [³⁰, ⁷, ³¹, ¹², ²⁵, ⁹, ¹⁶, ¹⁹, ³⁵], and others [²].

⁵
All that’s ’human’ is not gold: Evaluating human evaluation of generated text, 2021

5. Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., Smith, N. A. (2021). All that’s ’human’ is not gold: Evaluating human evaluation of generated text. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, pp. 7282–7296. DOI: 10.18653/v1/2021.acl-long.565.
¹⁰
Automatic detection of generated text is easiest when humans are fooled, 2019

10. Ippolito, D., Duckworth, D., Callison-Burch, C., Eck, D. (2019). Automatic detection of generated text is easiest when humans are fooled. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1808–1822. DOI: 10.18653/v1/2020.acl-main.164.
³²
Authorship attribution for neural text generation, 2020

32. Uchendu, A., Le, T., Shu, K., Lee, D. (2020). Authorship attribution for neural text generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8384–8395. DOI: 10.18653/v1/2020.emnlp-main.673.
²¹
ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text, 2023

21. Mitrović, S., Andreoletti, D., Ayoub, O. (2023). ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text. DOI: 10.48550/arXiv.2301.13852.
³⁰
Release strategies and the social impacts of language models, 2019

30. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., Wang, J. (2019). Release strategies and the social impacts of language models. OpenIA Report.
⁷
TweepFake: About detecting deepfake tweets
PLoS One, 2021

7. Fagni, T., Falchi, F., Gambini, M., Martella, A., Tesconi, M. (2021). TweepFake: About detecting deepfake tweets. PLoS One, Vol. 16, No. 5. DOI: 10.1371/journal.pone.0251415.
³¹
Automatic detection of bot-generated tweets, 2022

31. Tourille, J., Sow, B., Popescu, A. (2022). Automatic detection of bot-generated tweets. Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, pp. 44–51. DOI: 10.1145/3512732.3533584.
¹²
Stylometric detection of AI-generated text in twitter timelines, 2023

12. Kumarage, T., Garland, J., Bhattacharjee, A., Trapeznikov, K., Ruston, S., Liu, H. (2023). Stylometric detection of AI-generated text in twitter timelines. DOI: 10.48550/arXiv.2303.03697.
²⁵
Cross-domain detection of GPT-2-generated technical text, 2022

25. Rodríguez, J. D., Hay, T., Gros, D., Shamsi, Z., Srinivasan, R. (2022). Cross-domain detection of GPT-2-generated technical text. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1213–1233. DOI: 10.18653/v1/2022.naacl-main.88.
⁹
How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection, 2023

9. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. DOI: 10.48550/arXiv.2301.07597.
¹⁶
AI vs. human–differentiation analysis of scientific content generation, 2023

16. Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., Liu, X. (2023). AI vs. human–differentiation analysis of scientific content generation. DOI: 10.48550/arXiv.2301.10416.
¹⁹
Smaller language models are better black-box machine-generated text detectors, 2023

19. Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., Berg-Kirkpatrick, T. (2023). Smaller language models are better black-box machine-generated text detectors. DOI: 10.48550/arXiv.2305.09859.
³⁵
G3Detector: General GPT-generated text detector, 2023

35. Zhan, H., He, X., Xu, Q., Wu, Y., Stenetorp, P. (2023). G3Detector: General GPT-generated text detector.
²
AraBERT: Transformer-based model for arabic language understanding, 2020

2. Antoun, W., Baly, F., Hajj, H. (2020). AraBERT: Transformer-based model for arabic language understanding. Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15.

Classical approaches to detect generated text have also been attempted. Solaiman et al. [³⁰] utilized classical machine learning classifiers such as logistic regression to identify content generated by GPT-2, finding that they do not perform worse than language models.

³⁰
Release strategies and the social impacts of language models, 2019

30. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., Wang, J. (2019). Release strategies and the social impacts of language models. OpenIA Report.

Unfortunately, this approach has not been completely followed by subsequent literature. In addition, statistical criteria have also been considered to address the task of detecting generated text [⁸, ²⁰], as well as feature extraction [³³, ²⁹].

⁸
GLTR: Statistical detection and visualization of generated text, 2019

8. Gehrmann, S., Harvard, S., Strobelt, H., Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116. DOI: 10.18653/v1/P19-3019.
²⁰
DetectGPT: Zero-shot machine-generated text detection using probability curvature, 2023

20. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. DOI: 10.48550/arXiv.2301.11305.
³³
Distinguishing ChatGPT (-3.5,-4)-generated and human-written papers through Japanese stylometric analysis
PLoS One, 2023

33. Zaitsu, W., Jin, M. (2023). Distinguishing ChatGPT (-3.5,-4)-generated and human-written papers through Japanese stylometric analysis. PLoS One, Vol. 18, No. 8. DOI: 10.1371/journal.pone.0288453.
²⁹
ChatGPT generated text detection, 2023

29. Shijaku, R., Canhasi, E. (2023). ChatGPT generated text detection. Unpublished.

In the AuTexTification task [²⁷], a preliminary evaluation of the task of detecting generated English text, logistic regression achieved a score of 0.6578 in F1-macro, surpassing the language model DeBERTa V3 (0.571), whereas in the model attribution task logistic regression reached a score of 0.3998, while DeBERTa V3 reached 0.6042.

²⁷
Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains
Procesamiento del Lenguaje Natural, 2023

27. Sarvazyan, A. M., González, J. Á., Franco-Salvador, M., Rangel, F., Chulvi, B., Rosso, P. (2023). Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains. Procesamiento del Lenguaje Natural, , No. 71, pp. 275–288.

3 Methodology

3.1 Dataset Description

The AuTexTification dataset [²⁶] consists on data for two different subtasks: human vs machine generated text detection and model attribution. Therefore, this dataset contains text produced either by humans or by any of the following models: BLOOM-1b7, BLOOM-3b, BLOOM-7b1, GPT-3 Babbage, GPT-3 Curie and GPT-3 DaVinci-003.

²⁶
AuTexTification dataset (full data), 2023

26. Sarvazyan, A., González, J. Á., Franco, M., Rangel, F. M., Chulvi, M. A., Rosso, P. (2023). AuTexTification dataset (full data). Zenodo. DOI: 10.5281/zenodo.7956207.

Consequently, the training dataset for the first subtask consists on 33,845 text samples labeled as either machine-generated or human; on the other hand, for the second subtask, its training set includes 22,416 machine-generated text samples, each labeled with one of the letters A to F, corresponding to the modelo which generated the text.

The maximum text length is 98 and 97 words for subtasks one and two, respectively. Subtask 1 has a fixed testing set of 21,832 samples, whereas subtask 2 has 5,605 samples in its testing set. AuTexTification provides separate datasets for English and Spanish Languages, but our study only focuses on English. Figure 1 shows the structure of the dataset; it highlights in red the data used in this paper.

Thumbnail

Fig. 1
The AuTexTification subtasks and their corresponding datasets for each language

3.2 Preprocessing and Feature Extraction

3.2.1 Machine Learning Algorithms

We utilized a preprocessing pipeline for both tasks with the aim of extracting more significant syntactic and lexical features. This preprocessing sequence encompassed the following stages:

Tokenization: The raw text was broken down into individual tokens (words in this case).
Multi-Word Token (MWT) Expansion: This involved identifying combinations of words that function as a single unit and treating them as cohesive entities.
Part of Speech Tagging (PoS): To understand the grammatical roles and relationships within the text, we applied Part of Speech tagging. This involved assigning a specific tag, such as noun, verb, adjective, etc., to each token. Part of Speech tagging assists in capturing syntactic structures and provides valuable context for subsequent steps.
Lemmatization: Is a text normalization technique that involves reducing words to their base or root form.

The preprocessing steps applied to the raw text contribute significantly to its refinement, making it more suitable for analysis and improving classification accuracy. For our experiments, we included the sequence of lemmas and Part of Speech (PoS) tags as features.

The incorporation of PoS sequences was particularly important for both tasks, as it provided valuable insights into the writing style and enabled us to identify patterns and combinations of PoS tags that contributed to the text’s structure. Overall, utilizing PoS sequences enhanced our ability to extract syntactic information.

3.2.2 LSTM Ensemble

For this approach, we decided to work with two different PoS tags configurations, i.e., we used UPoS and XPoS, since the first provides a PoS analysis universal to most natural languages and the latter provides a specific PoS analysis for the given language which, in this particular case, is English.

The proposal was tested only on subtask 1 of the dataset (See Section 3.3), and involves the training of two LSTM cells, each one of them analysing—simultaneously— either UPoS or XPoS, where each tag was converted to a one-hot representation; once both LSTMs give their outputs, these are concatenated to become the input of a Multi-Layer Perceptron (MLP) of the following characteristics:

— A linear layer with a number of neurons equal to the sum of the hidden states of both LSTMs, for a total of 64 units.
— A dropout layer—for regularization purposes—, with a probability of 0.5 for every neuron to be zeroed out.
— A ReLU activation funciton layer, and,
— A single neuron as output, with a Sigmoid activation function for classification purposes.

Figure 2 shows the pipeline used for this ensemble proposal.

Thumbnail

Fig. 2
LSTM ensemble proposal pipeline

3.3 Subtask 1: Automatically Generated Text Identification

We deemed feature extraction crucial for this task, prompting us to experiment with various combinations of features:

a) Unigrams of lemmas with frequency counts.
b) Unigrams of lemmas with binary counts.
c) Unigrams of PoS with frecuency counts.
d) Unigrams of lemmas with binary counts along with unigrams of PoS and their frequency counts.
e) Unigrams, bigrams and trigrams of lemmas with binary counts.
f) Unigrams, bigrams and trigrams of PoS with frecuency counts.
g) Unigrams, bigrams and trigrams of lemmas with binary counts along with unigrams, bigrams and trigrams with their frecuency counts.

We employed Logistic Regression and Support Vector Machines (SVMs) for the classification, utilizing Stochastic Gradient Descent for both algorithms. In the case of the Support Vector Machine, a linear kernel was applied.

3.4 Task 2: Model Attribution

For this task, we experimented with the following features:

a) Unigrams of lemmas with binary counts.
b) Unigrams of PoS with frecuency counts.
c) Unigrams, bigrams and trigrams of lemmas with binary counts.
d) Unigrams, bigrams and trigrams of PoS with frecuency counts.
e) Unigrams, bigrams and trigrams of lemmas with binary counts along with unigrams, bigrams and trigrams with their frecuency counts.

Continuing with the previous proposal, we employed Logistic Regression and Support Vector Machines using Stochastic Gradient Descent.

4 Results

4.1 Validation Methods and Evaluation

As stated in Section 3.1, the AuTexTification tasks data is available for two subtasks, each one of them with four datasets: a training and a testing datasets, for English and Spanish Languages (see Figure 1).

For the purpose of this study, we exclusively focused on utilizing the training and testing data associated with subtasks one and two, for the English language.

We employed the training set as a development set. This involved partitioning the training data further to create a validation subset.

For the conventional machine learning techniques, we adopted a cross-validation approach as our chosen validation method.

The validation results presented in this study represent the average outcomes obtained from applying a 10-fold cross-validation process. On the other hand for our LSTM ensemble proposal, we used a hold out validation procedure on the training set, where 80% was still considered as training data, while the resting 20% became the validation set.

Figures 3 and 4 show a graphical representation of both 10-fold cross validation and hold out validation, respectively.

Thumbnail

Fig. 3
10-fold cross validation procedure

Thumbnail

Fig. 4
Hold out validation procedure

Subsequently, we proceeded to evaluate our models on the designated test set. The outcomes generated from this testing phase serve as the basis for our comparative analysis against both the state-of-the-art methods and the established baselines set forth in the contest.

4.2 Classification results. Subtask 1

4.2.1 LSTM ensemble

For the LSTM ensemble, the selected loss function was Binary Cross Entropy, with a learning rate of 3×10−4; the number of epochs proposed was 80; and the optimizer used was Adam [¹¹]; additionally, a hold out validation procedure was performed on the provided training set, as described in Section 4.1.

¹¹
ADAM: A method for stochastic optimization, 2014

11. Kingma, D. P., Ba, J. (2014). ADAM: A method for stochastic optimization. International Conference on Learning Representations. DOI: 10.48550/arXiv.1412.6980.

To test the model’s performance, the corresponding testing set was used. This model only achieved an F1 average macro value of 0.62.

4.2.2 Machine Learning Algorithms

Initially, we tested the various features suggested in the preceding section using the development set. The results are presented in Table 1.

Table 1
Results of experiments of the subtask 1

Model	Feat.	Recall	Precision	Accuracy	F1 score
LR	a	0.7636	0.763	0.7633	0.0763
	b	0.774	0.7735	0.7732	0.7731
	c	0.6215	0.5793	0.5811	0.53332
	d	0.7831	0.7807	0.7801	0.7797
	e	0.7405	0.7405	0.7405	0.7404
	f	0.761	0.7546	0.7538	0.7538
	g	0.8057	0.8055	0.8055	0.8054
SVM	a	0.754	0.7552	0.7516	0.7512
	b	0.7664	0.7616	0.7625	0.7587
	c	0.6425	0.6076	0.6063	0.573
	d	0.774	0.7616	0.7625	0.7587
	e	0.7301	0.7301	0.73	0.7299
	f	0.7558	0.7516	0.7518	0.7506
	g	0.8016	0.8014	0.8014	0.8013

The Table 1 highlights that the optimal feature combination for both classification algorithms is the combination ”g,” encompassing unigrams, bigrams, and trigrams of lemmas, as well as unigrams, bigrams, and trigrams for the PoS sequence.

As a result, we have chosen to employ these features for the test set.The results are presented in Table 2.

Table 2
Results in the testing set with the selected features (subtask 1)

	Recall	Precision	Accuracy	F1 score
LR	0.7296	0.6942	0.699	0.6852
SVM	0.7319	0.7258	0.7277	0.7252
Baseline				0.6578
SotA				0.8091

4.3 Classification results. Subtask 2

For subtask 2, we adhered to a similar approach, commencing with experimentation on the development set. The results are presented in Table 3.

Table 3
Results of experiments mentioned in the previous section

Model	Feat.	Recall	Precision	Accuracy	F1 score
LR	a	0.2897	0.2777	0.2793	0.2277
	b	0.3844	0.3916	0.3924	0.3847
	c	0.4274	0.4399	0.4399	0.4302
	d	0.3332	0.3416	0.3415	0.3444
	e	0.459	0.4649	0.4646	0.4594
SVM	a	0.2814	0.2513	0.2537	0.1945
	b	0.3643	0.3701	0.3798	0.3647
	c	0.4182	0.4284	0.4284	0.4176
	d	0.3321	0.3208	0.3219	0.3088
	e	0.4585	0.4514	0.4517	0.4443

Table 3 demonstrates that the most effective feature blend for both classification algorithms is the ”e” combination.

This amalgamation comprises unigrams, bigrams, and trigrams of lemmas, along with unigrams, bigrams, and trigrams for the PoS sequence. Subsequently, we opted to utilize these features for the test set. The outcomes are outlined in Table 4.

Table 4
Results for the test set with the selected features

	Recall	Precision	Accuracy	F score
LR	0.4785	0.4854	0.4869	0.4797
SVM	0.462	0.4844	0.4639	0.4602
Baseline				0.3998
SotA				0.625

5 Conclusions

We introduce a proposal addressing two distinct tasks that share similarities with the Natural Language Processing task of ”author attribution.”

Our approach places a stronger emphasis on feature extraction rather than on the model itself, employing traditional machine learning algorithms. Notably, both tasks surpass the baseline set by the AuTexTification contest organizers, utilizing Logistic Regression.

While our approach may not attain the pinnacle of state-of-the-art performance, it highlights the enduring importance of preprocessing and feature selection.

Tables 1 and 3 show us the significant impact of these factors, as they reveal substantial variations in outcomes even with the exact same model.

In the first task, our results with Support Vector Machine approach the state-of-the-art performance without resorting to large models, indicating that notable progress can be achieved. Furthermore, our findings underscore the pivotal role of syntactic information as a crucial feature when discerning authorship.

It is important to note that, although a Deep Learning (DL) architecture was proposed, we consider that the lower performance achieved by the model is inherent to the lack of a detailed text preprocessing, based strictly on the tasks to solve.

Therefore, special care must to be taken when proposing DL models, since the preprocessing must consider both, the characteristics of the text generation and model profiling tasks, as well as the type of preprocessing that works well with a given DL architecture.

We aspire to build upon our current approach by integrating these features into more sophisticated machine learning algorithms, such as Neural Networks. This progressive step aims to further enhance the quality of our outcomes.

Acknowledgments

The authors gratefully acknowledge the Instituto Politécnico Nacional (Secretaría Académica, Secretaría de Investigación y Posgrado, Centro de Investigación en Computación) and the Consejo Nacional de Humanidades Ciencias y Tecnologías (CONAHCYT) for their economic support to develop this work.

References

1. Alexander, K., Savvidou, C., Alexander, C. (2023). Who wrote this essay? Detecting AI-generated writing in second language education in higher education. Teaching English with Technology, Vol. 23, No. 2, pp. 25–43. DOI: 10.56297/BUKA4060/XHLD5365. Links

2. Antoun, W., Baly, F., Hajj, H. (2020). AraBERT: Transformer-based model for arabic language understanding. Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15. Links

3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C. et al. (2020). Language models are few-shot learners. 34th Conference on Neural Information Processing Systems, Vol. 33, No. 159, pp. 1877–1901. Links

4. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N. et al. (2022). PaLM: Scaling language modeling with pathways. DOI: 10.48550/arXiv.2204.02311. Links

5. Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., Smith, N. A. (2021). All that’s ’human’ is not gold: Evaluating human evaluation of generated text. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, pp. 7282–7296. DOI: 10.18653/v1/2021.acl-long.565. Links

6. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186. Links

7. Fagni, T., Falchi, F., Gambini, M., Martella, A., Tesconi, M. (2021). TweepFake: About detecting deepfake tweets. PLoS One, Vol. 16, No. 5. DOI: 10.1371/journal.pone.0251415. Links

8. Gehrmann, S., Harvard, S., Strobelt, H., Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116. DOI: 10.18653/v1/P19-3019. Links

9. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. DOI: 10.48550/arXiv.2301.07597. Links

10. Ippolito, D., Duckworth, D., Callison-Burch, C., Eck, D. (2019). Automatic detection of generated text is easiest when humans are fooled. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1808–1822. DOI: 10.18653/v1/2020.acl-main.164. Links

11. Kingma, D. P., Ba, J. (2014). ADAM: A method for stochastic optimization. International Conference on Learning Representations. DOI: 10.48550/arXiv.1412.6980. Links

12. Kumarage, T., Garland, J., Bhattacharjee, A., Trapeznikov, K., Ruston, S., Liu, H. (2023). Stylometric detection of AI-generated text in twitter timelines. DOI: 10.48550/arXiv.2303.03697. Links

13. Leiser, F., Eckhardt, S., Knaeble, M., Mädche, A., Schwabe, G., Sunyaev, A. (2023). From ChatGPT to FactGPT: A participatory design study to mitigate the effects of large language model hallucinations on users. Proceedings of Mensch und Computer 2023, pp. 81–90. DOI: 10.1145/3603555.3603565. Links

14. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. DOI: 10.18653/v1/2020.acl-main.703. Links

15. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. DOI: 10.48550/arXiv.1907.11692. Links

16. Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., Liu, X. (2023). AI vs. human–differentiation analysis of scientific content generation. DOI: 10.48550/arXiv.2301.10416. Links

17. Mann, W. C., Matthiessen, C. M. (1983). Nigel: A systemic grammar for text generation. Technical report, University of Southern California, Marina del Rey, Information Sciences Institute. Links

18. McKeown, K. R. (1992). Text generation (Studies in natural language processing). Cambridge University Press. Links

19. Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., Berg-Kirkpatrick, T. (2023). Smaller language models are better black-box machine-generated text detectors. DOI: 10.48550/arXiv.2305.09859. Links

20. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. DOI: 10.48550/arXiv.2301.11305. Links

21. Mitrović, S., Andreoletti, D., Ayoub, O. (2023). ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text. DOI: 10.48550/arXiv.2301.13852. Links

22. OpenAI (2023). GPT-4 technical report. DOI: 10.48550/arXiv.2303.08774. Links

23. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI. Links

24. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI, Vol. 1, No. 8, pp. 9. Links

25. Rodríguez, J. D., Hay, T., Gros, D., Shamsi, Z., Srinivasan, R. (2022). Cross-domain detection of GPT-2-generated technical text. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1213–1233. DOI: 10.18653/v1/2022.naacl-main.88. Links

26. Sarvazyan, A., González, J. Á., Franco, M., Rangel, F. M., Chulvi, M. A., Rosso, P. (2023). AuTexTification dataset (full data). Zenodo. DOI: 10.5281/zenodo.7956207. Links

27. Sarvazyan, A. M., González, J. Á., Franco-Salvador, M., Rangel, F., Chulvi, B., Rosso, P. (2023). Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains. Procesamiento del Lenguaje Natural, , No. 71, pp. 275–288. Links

28. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., Villanova-Moral, A., Ruwase, O. et al. (2022). BLOOM: A 176B-parameter open-access multilingual language model. BigScience Workshop. DOI: 10.48550/arXiv.2211.05100. Links

29. Shijaku, R., Canhasi, E. (2023). ChatGPT generated text detection. Unpublished. Links

30. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., Wang, J. (2019). Release strategies and the social impacts of language models. OpenIA Report. Links

31. Tourille, J., Sow, B., Popescu, A. (2022). Automatic detection of bot-generated tweets. Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, pp. 44–51. DOI: 10.1145/3512732.3533584. Links

32. Uchendu, A., Le, T., Shu, K., Lee, D. (2020). Authorship attribution for neural text generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8384–8395. DOI: 10.18653/v1/2020.emnlp-main.673. Links

33. Zaitsu, W., Jin, M. (2023). Distinguishing ChatGPT (-3.5,-4)-generated and human-written papers through Japanese stylometric analysis. PLoS One, Vol. 18, No. 8. DOI: 10.1371/journal.pone.0288453. Links

34. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y. (2019). Defending against neural fake news. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vol. 32, No. 812, pp. 9054–9065. Links

35. Zhan, H., He, X., Xu, Q., Wu, Y., Stenetorp, P. (2023). G3Detector: General GPT-generated text detector. Links

[1] 1. Alexander, K., Savvidou, C., Alexander, C. (2023). Who wrote this essay? Detecting AI-generated writing in second language education in higher education. Teaching English with Technology, Vol. 23, No. 2, pp. 25–43. DOI: 10.56297/BUKA4060/XHLD5365. Links

[2] 2. Antoun, W., Baly, F., Hajj, H. (2020). AraBERT: Transformer-based model for arabic language understanding. Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15. Links

[3] 3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C. et al. (2020). Language models are few-shot learners. 34th Conference on Neural Information Processing Systems, Vol. 33, No. 159, pp. 1877–1901. Links

[4] 4. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N. et al. (2022). PaLM: Scaling language modeling with pathways. DOI: 10.48550/arXiv.2204.02311. Links

[5] 5. Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., Smith, N. A. (2021). All that’s ’human’ is not gold: Evaluating human evaluation of generated text. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, pp. 7282–7296. DOI: 10.18653/v1/2021.acl-long.565. Links

[6] 6. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186. Links

[7] 7. Fagni, T., Falchi, F., Gambini, M., Martella, A., Tesconi, M. (2021). TweepFake: About detecting deepfake tweets. PLoS One, Vol. 16, No. 5. DOI: 10.1371/journal.pone.0251415. Links

[8] 8. Gehrmann, S., Harvard, S., Strobelt, H., Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116. DOI: 10.18653/v1/P19-3019. Links

[9] 9. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. DOI: 10.48550/arXiv.2301.07597. Links

[10] 10. Ippolito, D., Duckworth, D., Callison-Burch, C., Eck, D. (2019). Automatic detection of generated text is easiest when humans are fooled. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1808–1822. DOI: 10.18653/v1/2020.acl-main.164. Links

[11] 11. Kingma, D. P., Ba, J. (2014). ADAM: A method for stochastic optimization. International Conference on Learning Representations. DOI: 10.48550/arXiv.1412.6980. Links

[12] 12. Kumarage, T., Garland, J., Bhattacharjee, A., Trapeznikov, K., Ruston, S., Liu, H. (2023). Stylometric detection of AI-generated text in twitter timelines. DOI: 10.48550/arXiv.2303.03697. Links

[13] 13. Leiser, F., Eckhardt, S., Knaeble, M., Mädche, A., Schwabe, G., Sunyaev, A. (2023). From ChatGPT to FactGPT: A participatory design study to mitigate the effects of large language model hallucinations on users. Proceedings of Mensch und Computer 2023, pp. 81–90. DOI: 10.1145/3603555.3603565. Links

[14] 14. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. DOI: 10.18653/v1/2020.acl-main.703. Links

[15] 15. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. DOI: 10.48550/arXiv.1907.11692. Links

[16] 16. Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., Liu, X. (2023). AI vs. human–differentiation analysis of scientific content generation. DOI: 10.48550/arXiv.2301.10416. Links

[17] 17. Mann, W. C., Matthiessen, C. M. (1983). Nigel: A systemic grammar for text generation. Technical report, University of Southern California, Marina del Rey, Information Sciences Institute. Links

[18] 18. McKeown, K. R. (1992). Text generation (Studies in natural language processing). Cambridge University Press. Links

[19] 19. Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., Berg-Kirkpatrick, T. (2023). Smaller language models are better black-box machine-generated text detectors. DOI: 10.48550/arXiv.2305.09859. Links

[20] 20. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. DOI: 10.48550/arXiv.2301.11305. Links

[21] 21. Mitrović, S., Andreoletti, D., Ayoub, O. (2023). ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text. DOI: 10.48550/arXiv.2301.13852. Links

[22] 22. OpenAI (2023). GPT-4 technical report. DOI: 10.48550/arXiv.2303.08774. Links

[23] 23. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI. Links

[24] 24. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI, Vol. 1, No. 8, pp. 9. Links

[25] 25. Rodríguez, J. D., Hay, T., Gros, D., Shamsi, Z., Srinivasan, R. (2022). Cross-domain detection of GPT-2-generated technical text. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1213–1233. DOI: 10.18653/v1/2022.naacl-main.88. Links

[26] 26. Sarvazyan, A., González, J. Á., Franco, M., Rangel, F. M., Chulvi, M. A., Rosso, P. (2023). AuTexTification dataset (full data). Zenodo. DOI: 10.5281/zenodo.7956207. Links

[27] 27. Sarvazyan, A. M., González, J. Á., Franco-Salvador, M., Rangel, F., Chulvi, B., Rosso, P. (2023). Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains. Procesamiento del Lenguaje Natural, , No. 71, pp. 275–288. Links

[28] 28. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., Villanova-Moral, A., Ruwase, O. et al. (2022). BLOOM: A 176B-parameter open-access multilingual language model. BigScience Workshop. DOI: 10.48550/arXiv.2211.05100. Links

[29] 29. Shijaku, R., Canhasi, E. (2023). ChatGPT generated text detection. Unpublished. Links

[30] 30. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., Wang, J. (2019). Release strategies and the social impacts of language models. OpenIA Report. Links

[31] 31. Tourille, J., Sow, B., Popescu, A. (2022). Automatic detection of bot-generated tweets. Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, pp. 44–51. DOI: 10.1145/3512732.3533584. Links

[32] 32. Uchendu, A., Le, T., Shu, K., Lee, D. (2020). Authorship attribution for neural text generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8384–8395. DOI: 10.18653/v1/2020.emnlp-main.673. Links

[33] 33. Zaitsu, W., Jin, M. (2023). Distinguishing ChatGPT (-3.5,-4)-generated and human-written papers through Japanese stylometric analysis. PLoS One, Vol. 18, No. 8. DOI: 10.1371/journal.pone.0288453. Links

[34] 34. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y. (2019). Defending against neural fake news. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vol. 32, No. 812, pp. 9054–9065. Links

[35] 35. Zhan, H., He, X., Xu, Q., Wu, Y., Stenetorp, P. (2023). G3Detector: General GPT-generated text detector. Links

SciELO

Computación y Sistemas

Articles

PoSLemma: How Traditional Machine Learning and Linguistics Preprocessing Aid in Machine Generated Text Detection

Abstract:

1 Introduction

2 Related Work

3 Methodology

3.1 Dataset Description

3.2 Preprocessing and Feature Extraction

3.2.1 Machine Learning Algorithms

3.2.2 LSTM Ensemble

3.3 Subtask 1: Automatically Generated Text Identification

3.4 Task 2: Model Attribution

4 Results

4.1 Validation Methods and Evaluation

4.2 Classification results. Subtask 1

4.2.1 LSTM ensemble

4.2.2 Machine Learning Algorithms

4.3 Classification results. Subtask 2

5 Conclusions

Acknowledgments

References