No longer lost in translation: Evidence that Google Translate works for comparative Bag-of-Words Text Applications

Erik De Vries, Martijn Schoonvelde, Gijs Schumacher

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models-such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

Original languageEnglish
Pages (from-to)417-430
Number of pages14
JournalPolitical Analysis
Volume26
Issue number4
Early online date11 Sep 2018
DOIs
Publication statusPublished - Oct 2018

Fingerprint

search engine
evidence
language
text analysis
gold standard

Keywords

  • automated content analysis
  • bag-of-words models
  • Google Translate
  • LDA
  • statistical analysis of texts

Cite this

De Vries, Erik ; Schoonvelde, Martijn ; Schumacher, Gijs. / No longer lost in translation : Evidence that Google Translate works for comparative Bag-of-Words Text Applications. In: Political Analysis. 2018 ; Vol. 26, No. 4. pp. 417-430.
@article{ce76de42fb8f46a8bb942eef88859ed9,
title = "No longer lost in translation: Evidence that Google Translate works for comparative Bag-of-Words Text Applications",
abstract = "Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models-such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.",
keywords = "automated content analysis, bag-of-words models, Google Translate, LDA, statistical analysis of texts",
author = "{De Vries}, Erik and Martijn Schoonvelde and Gijs Schumacher",
year = "2018",
month = "10",
doi = "10.1017/pan.2018.26",
language = "English",
volume = "26",
pages = "417--430",
journal = "Political Analysis",
issn = "1047-1987",
publisher = "Oxford University Press",
number = "4",

}

No longer lost in translation : Evidence that Google Translate works for comparative Bag-of-Words Text Applications. / De Vries, Erik; Schoonvelde, Martijn; Schumacher, Gijs.

In: Political Analysis, Vol. 26, No. 4, 10.2018, p. 417-430.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - No longer lost in translation

T2 - Evidence that Google Translate works for comparative Bag-of-Words Text Applications

AU - De Vries, Erik

AU - Schoonvelde, Martijn

AU - Schumacher, Gijs

PY - 2018/10

Y1 - 2018/10

N2 - Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models-such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

AB - Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models-such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

KW - automated content analysis

KW - bag-of-words models

KW - Google Translate

KW - LDA

KW - statistical analysis of texts

UR - http://www.scopus.com/inward/record.url?scp=85054482177&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054482177&partnerID=8YFLogxK

U2 - 10.1017/pan.2018.26

DO - 10.1017/pan.2018.26

M3 - Article

VL - 26

SP - 417

EP - 430

JO - Political Analysis

JF - Political Analysis

SN - 1047-1987

IS - 4

ER -