Skip to main navigation Skip to search Skip to main content

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

  • Degen Huang
  • , Jiahuan Pei
  • , Cong Zhang
  • , Kaiyu Huang
  • , Jianjun Ma

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human's intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system's efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.
Original languageEnglish
Article number3182622
Pages (from-to)1-21
Number of pages21
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume17
Issue number3
Early online date2 Apr 2018
DOIs
Publication statusPublished - Sept 2018
Externally publishedYes

Funding

This work is supported by National Natural Science Foundation of China (Nos. 61672127, 61173100) and National Social Science Foundation of China (No.15BYY175). We also wish to thank NVIDIA Corporation for their donation of Tesla K40c GPU device. Authors’ addresses: D. Huang, J. Pei, C. Zhang, K. Huang, and J. Ma, Innovation Park Building A0933, School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China; emails: [email protected], {p_sunrise, cccaaag, loverainbow}@mail.dlut.edu.cn, [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 ACM 2375-4699/2018/04-ART23 $15.00 https://doi.org/10.1145/3182622 This work is supported by National Natural Science Foundation of China (Nos. 61672127, 61173100) and National Social Science Foundation of China (No.15BYY175). We also wish to thank NVIDIA Corporation for their donation of Tesla K40c GPU device.

FundersFunder number
Nvidia
National Natural Science Foundation of China61173100, 61672127
National Office for Philosophy and Social Sciences15BYY175

    Fingerprint

    Dive into the research topics of 'Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement'. Together they form a unique fingerprint.

    Cite this