Combining information on structure and content to automatically annotate natural science spreadsheets

Martine de Vos, Jan Wielemaker, Hajo Rijgersberg, Guus Schreiber, Bob Wielinga, Jan Top

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

In this paper we propose several approaches for automatic annotation of natural science spreadsheets using a combination of structural properties of the tables and external vocabularies. During the design process of their spreadsheets, domain scientists implicitly include their domain model in the content and structure of the spreadsheet tables. However, this domain model is essential to unambiguously interpret the spreadsheet data. The overall objective of this research is to make the underlying domain model explicit, to facilitate evaluation and reuse of these data. We present our annotation approaches by describing five structural properties of natural science spreadsheets, that may pose challenges to annotation, and at the same time, provide additional information on the content. For example, the main property we describe is that, within a spreadsheet table, semantically related terms are grouped in rectangular blocks. For each of the five structural properties we suggest an annotation approach, that combines heuristics on the property with knowledge from external vocabularies. We evaluate our approaches in a case study, with a set of existing natural science spreadsheets, by comparing the annotation results with a baseline based on purely lexical matching. Our case study results show that combining information on structural properties of spreadsheet tables with lexical matching to external vocabularies results in higher precision and recall of annotation of individual terms. We show that the semantic characterization of blocks of spreadsheet terms is an essential first step in the identification of relations between cells in a table. As such, the annotation approaches presented in this study provide the basic information that is needed to construct the domain model of scientific spreadsheets.

LanguageEnglish
Pages63-76
Number of pages14
JournalInternational Journal of Human-computer Studies
Volume103
DOIs
StatePublished - 1 Jul 2017

Fingerprint

Natural sciences
Spreadsheets
natural sciences
vocabulary
Structural properties
heuristics
semantics
evaluation
Semantics

Keywords

  • Domain Model
  • Implicit knowledge
  • Methodology
  • Spreadsheets
  • Units of measure
  • Vocabularies

Cite this

@article{3b62144c7c004131b7f97b9c6094f8de,
title = "Combining information on structure and content to automatically annotate natural science spreadsheets",
abstract = "In this paper we propose several approaches for automatic annotation of natural science spreadsheets using a combination of structural properties of the tables and external vocabularies. During the design process of their spreadsheets, domain scientists implicitly include their domain model in the content and structure of the spreadsheet tables. However, this domain model is essential to unambiguously interpret the spreadsheet data. The overall objective of this research is to make the underlying domain model explicit, to facilitate evaluation and reuse of these data. We present our annotation approaches by describing five structural properties of natural science spreadsheets, that may pose challenges to annotation, and at the same time, provide additional information on the content. For example, the main property we describe is that, within a spreadsheet table, semantically related terms are grouped in rectangular blocks. For each of the five structural properties we suggest an annotation approach, that combines heuristics on the property with knowledge from external vocabularies. We evaluate our approaches in a case study, with a set of existing natural science spreadsheets, by comparing the annotation results with a baseline based on purely lexical matching. Our case study results show that combining information on structural properties of spreadsheet tables with lexical matching to external vocabularies results in higher precision and recall of annotation of individual terms. We show that the semantic characterization of blocks of spreadsheet terms is an essential first step in the identification of relations between cells in a table. As such, the annotation approaches presented in this study provide the basic information that is needed to construct the domain model of scientific spreadsheets.",
keywords = "Domain Model, Implicit knowledge, Methodology, Spreadsheets, Units of measure, Vocabularies",
author = "{de Vos}, Martine and Jan Wielemaker and Hajo Rijgersberg and Guus Schreiber and Bob Wielinga and Jan Top",
year = "2017",
month = "7",
day = "1",
doi = "10.1016/j.ijhcs.2017.02.006",
language = "English",
volume = "103",
pages = "63--76",
journal = "International Journal of Human-computer Studies",
issn = "1071-5819",
publisher = "Academic Press Inc.",

}

Combining information on structure and content to automatically annotate natural science spreadsheets. / de Vos, Martine; Wielemaker, Jan; Rijgersberg, Hajo; Schreiber, Guus; Wielinga, Bob; Top, Jan.

In: International Journal of Human-computer Studies, Vol. 103, 01.07.2017, p. 63-76.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Combining information on structure and content to automatically annotate natural science spreadsheets

AU - de Vos,Martine

AU - Wielemaker,Jan

AU - Rijgersberg,Hajo

AU - Schreiber,Guus

AU - Wielinga,Bob

AU - Top,Jan

PY - 2017/7/1

Y1 - 2017/7/1

N2 - In this paper we propose several approaches for automatic annotation of natural science spreadsheets using a combination of structural properties of the tables and external vocabularies. During the design process of their spreadsheets, domain scientists implicitly include their domain model in the content and structure of the spreadsheet tables. However, this domain model is essential to unambiguously interpret the spreadsheet data. The overall objective of this research is to make the underlying domain model explicit, to facilitate evaluation and reuse of these data. We present our annotation approaches by describing five structural properties of natural science spreadsheets, that may pose challenges to annotation, and at the same time, provide additional information on the content. For example, the main property we describe is that, within a spreadsheet table, semantically related terms are grouped in rectangular blocks. For each of the five structural properties we suggest an annotation approach, that combines heuristics on the property with knowledge from external vocabularies. We evaluate our approaches in a case study, with a set of existing natural science spreadsheets, by comparing the annotation results with a baseline based on purely lexical matching. Our case study results show that combining information on structural properties of spreadsheet tables with lexical matching to external vocabularies results in higher precision and recall of annotation of individual terms. We show that the semantic characterization of blocks of spreadsheet terms is an essential first step in the identification of relations between cells in a table. As such, the annotation approaches presented in this study provide the basic information that is needed to construct the domain model of scientific spreadsheets.

AB - In this paper we propose several approaches for automatic annotation of natural science spreadsheets using a combination of structural properties of the tables and external vocabularies. During the design process of their spreadsheets, domain scientists implicitly include their domain model in the content and structure of the spreadsheet tables. However, this domain model is essential to unambiguously interpret the spreadsheet data. The overall objective of this research is to make the underlying domain model explicit, to facilitate evaluation and reuse of these data. We present our annotation approaches by describing five structural properties of natural science spreadsheets, that may pose challenges to annotation, and at the same time, provide additional information on the content. For example, the main property we describe is that, within a spreadsheet table, semantically related terms are grouped in rectangular blocks. For each of the five structural properties we suggest an annotation approach, that combines heuristics on the property with knowledge from external vocabularies. We evaluate our approaches in a case study, with a set of existing natural science spreadsheets, by comparing the annotation results with a baseline based on purely lexical matching. Our case study results show that combining information on structural properties of spreadsheet tables with lexical matching to external vocabularies results in higher precision and recall of annotation of individual terms. We show that the semantic characterization of blocks of spreadsheet terms is an essential first step in the identification of relations between cells in a table. As such, the annotation approaches presented in this study provide the basic information that is needed to construct the domain model of scientific spreadsheets.

KW - Domain Model

KW - Implicit knowledge

KW - Methodology

KW - Spreadsheets

KW - Units of measure

KW - Vocabularies

UR - http://www.scopus.com/inward/record.url?scp=85014078190&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85014078190&partnerID=8YFLogxK

U2 - 10.1016/j.ijhcs.2017.02.006

DO - 10.1016/j.ijhcs.2017.02.006

M3 - Article

VL - 103

SP - 63

EP - 76

JO - International Journal of Human-computer Studies

T2 - International Journal of Human-computer Studies

JF - International Journal of Human-computer Studies

SN - 1071-5819

ER -