Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

In this paper, we present a new method to obtain large volumes of high-quality text corpora with event data for studying identity and reference relations. We report on the current methods to create event reference data by annotating texts and deriving the event data a posteriori. Our method starts from event registries in which event data is defined a priori. From this data, we extract so-called Microworlds of referential data with the Reference Texts that report on these events. This makes it possible to easily establish referential relations with high precision and at a large scale. In a pilot, we successfully obtained data from these resources with extreme ambiguity and variation, while maintaining the identity and reference relations and without having to annotate large quantities of texts word-by-word. The data from this pilot was annotated using an annotation tool created specifically in order to validate our method and to enrich the reference texts with event coreference annotations. This annotation process resulted in the Gun Violence Corpus, whose development process and outcome are described in this paper.
Original languageEnglish
Title of host publication[Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
Place of PublicationMiyazaki
PublisherLREC
Pages3034-3042
Number of pages9
ISBN (Electronic)9791095546009
Publication statusPublished - 2018
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period7/05/1812/05/18

Fingerprint

event
violence
Annotation
resources
Referential

Keywords

  • Event coreference
  • Structured data
  • Text corpora

Cite this

Vossen, P., Ilievski, F., Postma, M., & Segers, R. H. (2018). Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), [Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 3034-3042). Miyazaki: LREC.
Vossen, Piek ; Ilievski, Filip ; Postma, Marten ; Segers, R.H. / Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data. [Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018). editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. Miyazaki : LREC, 2018. pp. 3034-3042
@inproceedings{4ad527e286804cd1aeef15921662a614,
title = "Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data",
abstract = "In this paper, we present a new method to obtain large volumes of high-quality text corpora with event data for studying identity and reference relations. We report on the current methods to create event reference data by annotating texts and deriving the event data a posteriori. Our method starts from event registries in which event data is defined a priori. From this data, we extract so-called Microworlds of referential data with the Reference Texts that report on these events. This makes it possible to easily establish referential relations with high precision and at a large scale. In a pilot, we successfully obtained data from these resources with extreme ambiguity and variation, while maintaining the identity and reference relations and without having to annotate large quantities of texts word-by-word. The data from this pilot was annotated using an annotation tool created specifically in order to validate our method and to enrich the reference texts with event coreference annotations. This annotation process resulted in the Gun Violence Corpus, whose development process and outcome are described in this paper.",
keywords = "Event coreference, Structured data, Text corpora",
author = "Piek Vossen and Filip Ilievski and Marten Postma and R.H. Segers",
year = "2018",
language = "English",
pages = "3034--3042",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "[Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018)",
publisher = "LREC",

}

Vossen, P, Ilievski, F, Postma, M & Segers, RH 2018, Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), [Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018). LREC, Miyazaki, pp. 3034-3042, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 7/05/18.

Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data. / Vossen, Piek; Ilievski, Filip; Postma, Marten; Segers, R.H.

[Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018). ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. Miyazaki : LREC, 2018. p. 3034-3042.

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data

AU - Vossen, Piek

AU - Ilievski, Filip

AU - Postma, Marten

AU - Segers, R.H.

PY - 2018

Y1 - 2018

N2 - In this paper, we present a new method to obtain large volumes of high-quality text corpora with event data for studying identity and reference relations. We report on the current methods to create event reference data by annotating texts and deriving the event data a posteriori. Our method starts from event registries in which event data is defined a priori. From this data, we extract so-called Microworlds of referential data with the Reference Texts that report on these events. This makes it possible to easily establish referential relations with high precision and at a large scale. In a pilot, we successfully obtained data from these resources with extreme ambiguity and variation, while maintaining the identity and reference relations and without having to annotate large quantities of texts word-by-word. The data from this pilot was annotated using an annotation tool created specifically in order to validate our method and to enrich the reference texts with event coreference annotations. This annotation process resulted in the Gun Violence Corpus, whose development process and outcome are described in this paper.

AB - In this paper, we present a new method to obtain large volumes of high-quality text corpora with event data for studying identity and reference relations. We report on the current methods to create event reference data by annotating texts and deriving the event data a posteriori. Our method starts from event registries in which event data is defined a priori. From this data, we extract so-called Microworlds of referential data with the Reference Texts that report on these events. This makes it possible to easily establish referential relations with high precision and at a large scale. In a pilot, we successfully obtained data from these resources with extreme ambiguity and variation, while maintaining the identity and reference relations and without having to annotate large quantities of texts word-by-word. The data from this pilot was annotated using an annotation tool created specifically in order to validate our method and to enrich the reference texts with event coreference annotations. This annotation process resulted in the Gun Violence Corpus, whose development process and outcome are described in this paper.

KW - Event coreference

KW - Structured data

KW - Text corpora

UR - http://www.scopus.com/inward/record.url?scp=85059912685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059912685&partnerID=8YFLogxK

UR - http://lrec2018.lrec-conf.org/en/

M3 - Conference contribution

SP - 3034

EP - 3042

BT - [Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - LREC

CY - Miyazaki

ER -

Vossen P, Ilievski F, Postma M, Segers RH. Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, [Proceedings of the] Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: LREC. 2018. p. 3034-3042