Abstract
Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models-such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.
Original language | English |
---|---|
Pages (from-to) | 417-430 |
Number of pages | 14 |
Journal | Political Analysis |
Volume | 26 |
Issue number | 4 |
Early online date | 11 Sept 2018 |
DOIs | |
Publication status | Published - Oct 2018 |
Funding
et al. de Vries Erik 1 Schoonvelde Martijn 2 http://orcid.org/0000-0002-6503-4514 Schumacher Gijs 3 * 1 Department of Media and Social Sciences , University of Stavanger , Stavanger , Norway . Email: [email protected] 2 Department of Political Science and Public Administration , Vrije Universiteit , Amsterdam , The Netherlands . Email: [email protected] 3 Department of Political Science , University of Amsterdam , Amsterdam , The Netherlands . Email: [email protected] Authors’ note : Replication code and data are available at the Political Analysis Dataverse (De Vries, Schoonvelde, and Schumacher 2018 ) while the supplementary materials for this article are available on the Political Analysis web site. The authors would like to thank James Cross, Aki Matsuo, Christian Rauh, Damian Trilling, Mariken van der Velden and Barbara Vis for helpful comments and suggestions. GS and MS acknowledge funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 649281, EUENGAGE. EdV acknowledges funding for a research assistantship from the Access Europe (since 2018: UVAccess Europe) research center at the University of Amsterdam. Contributing Editor: Jonathan N. Katz * Email: [email protected] 10 2018 11 09 2018 26 4 417 430 Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology. 2018 The Author(s) Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.
Funders | Funder number |
---|---|
Access Europe | |
Department of Media and Social Sciences , University of Stavanger | |
Department of Political Science and Public Administration | |
European Union’s Horizon 2020 | |
Society for Political Methodology | |
Lyme Disease Association | |
Horizon 2020 Framework Programme | 649281 |
Universiteit van Amsterdam |
Keywords
- automated content analysis
- bag-of-words models
- Google Translate
- LDA
- statistical analysis of texts