Abstract
Tables are used for representing similarly structured data in an enormous variety of documents and formats. Over the past 30 years, the huge increase in the volume of published documents on the web has opened up the potential for automatically extracting information from human-readable tables in those documents. Such tables may express information that is useful for many applications (such as web search and question answering), but they often have diverse contents and structures. To deal with this diversity, these applications can benefit from automatically integrating the information from such tables into coherent, machine-readable Knowledge Bases (KBs).
The first step of this integration process is table interpretation, which consists of finding associations between table contents and some background knowledge. However, effectively leveraging this background knowledge introduces several challenges. One such challenge is that on existing domains tables are not only interpreted using a KB as background knowledge, but are also used to extend the same KB with new information, which may cause a bias towards redundant extractions. A further issue is that there may also be tables for which the KB does not provide enough background knowledge to permit interpretation on their own. Additionally, on new domains this background knowledge must be specified manually, which may involve much human effort. Existing approaches are often unsuited to deal with such real-world phenomena.
In this thesis, we present several approaches towards overcoming these challenges in order to extend KBs with new information extracted from human-readable tables.
We investigate several real-world phenomena that impede the integration of tables with KBs, and develop methods for overcoming these challenges.
First, we describe the challenge of extracting novel information from tables that is not yet present in a KB, when the same KB is used for interpreting them. For addressing this problem, we develop a new evaluation approach, as well as a novel table interpretation method.
Second, we describe the challenge of extending KBs from tables with extracted n-ary facts, which involve more than two entities or values. To address this, we present and evaluate a data extraction pipeline for Wikipedia tables that overcomes the diversity of table layouts and the sparsity of n-ary information in the Wikidata KB.
Third, we investigate the problem of creating a KB on a new domain from tables with minimal human supervision. We propose a solution that combines weakly supervised machine learning and logical reasoning, and we evaluate it on tables from scientific papers.
Fourth, we describe a new system for supporting research into data extraction and KB extension pipelines from tables.
In conclusion, the techniques we have developed effectively leverage background knowledge to overcome the variety of table contents and structures, thereby contributing towards the extension of KBs with new information from human-readable tables.
Nevertheless, there is still much potential for research into KB extension from tables.
We close this thesis with suggestions for future research, and reflections on research in this field.
Original language | English |
---|---|
Qualification | PhD |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 8 Dec 2021 |
Publication status | Published - 8 Dec 2021 |
Keywords
- knowledge bases
- web tables
- data integration
- knowledge graphs
- table interpretation