Leveraging Open-Source Large Language Models for Native Language Identification

Yee Man Ng, Ilia Markov

Research output: Working paper / PreprintPreprintAcademic

21 Downloads (Pure)

Abstract

Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.
Original languageUndefined/Unknown
Publication statusPublished - 15 Sept 2024

Keywords

  • cs.CL
  • Leveraging Open-Source Large Language Models for Native Language Identification

    Ng, Y. M. & Markov, I., 2025, VarDial 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings of the Workshop. Scherrer, Y., Jauhiainen, T., Ljubesic, N., Nakov, P., Tiedemann, J. & Zampieri, M. (eds.). Association for Computational Linguistics (ACL), p. 20-28 9 p. (VarDial 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings of the Workshop).

    Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Cite this