Leveraging Open-Source Large Language Models for Native Language Identification

Yee Man Ng, Ilia Markov

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Native Language Identification (NLI) – the task of identifying the native language (L1) of a person based on their writing in the second language (L2) – has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.

Original languageEnglish
Title of host publicationVarDial 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings of the Workshop
EditorsYves Scherrer, Tommi Jauhiainen, Nikola Ljubesic, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
PublisherAssociation for Computational Linguistics (ACL)
Pages20-28
Number of pages9
ISBN (Electronic)9798891762084
Publication statusPublished - 2025
Event12th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2025 - co-located with the 31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, United Arab Emirates
Duration: 19 Jan 2025 → …

Publication series

NameVarDial 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings of the Workshop

Conference

Conference12th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2025 - co-located with the 31st International Conference on Computational Linguistics, COLING 2025
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/25 → …

Bibliographical note

Publisher Copyright:
© 2025 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Leveraging Open-Source Large Language Models for Native Language Identification'. Together they form a unique fingerprint.

Cite this