With the progress of the Web, significant amounts of information become available online. The information ranges from different data types such as news articles, tweets, cultural heritage objects to audio-visual archives and across various distribution channels such as traditional or social media. This democratization of information poses several challenges for search engines, information retrieval systems, and natural language processing systems, as they need to (1) extract meaningful information from any data modality (i.e., text, image, video) and (2) synthesize streams of data and information from various channels to provide concise pieces of information that answer the needs and requests of end-users. Events play an important role in understanding and contextualizing information, as well as influencing human interpretation. Nevertheless, by definition, events are complex entities, essential for querying, perceiving and consuming the meaning of the information surrounding us. Therefore, we need to understand what an event is, how to describe an event, and to what extent a document is meaningful or relevant for a given event or topic. Typically, events create context by introducing related properties or entities such as participants involved, locations where the event takes place or the period when the event takes place. The event space is typically represented in different data streams and channels. An event is likely mentioned both in news articles and tweets, as well as in textual and audio-visual media. Hence, besides relevance, we need to extend the event understanding with salience and novelty features to minimize redundancy and subjective semantics such as sentiments and sentiment intensities to account for the multitude of perspectives. The information extraction community acknowledges the importance of events. However, the accuracy of identifying events is still not optimal, as events (1) are vague, (2) carry multiple perspectives, and (3) have different granularity. The mainstream procedure for event annotation is through experts. However, even experts disagree to large extents. Crowdsourcing has emerged as a reliable, time and cost-efficient approach for gathering semantic annotations. However, a major crowdsourcing bottleneck is that most practices are not systematic and sustainable, while state-of-the-art methods exist only for a specific domain or input. Typically, solutions for assessing the quality of crowdsourced data are based on the hypothesis that there is only one right answer, which contradicts with the three angles of events, i.e., vagueness, multiple perspectives and granularities, and with the latent ambiguity in natural language. The recently proposed CrowdTruth methodology, however, addresses these issues. More precisely, experiments performed in the context of the CrowdTruth methodology showed that disagreement between workers and diversity between annotations are signals for identifying low-quality workers and better understanding data ambiguity. In this thesis, we investigate how diversity in crowdsourcing can improve the machine understanding of events and their characteristics. More precisely, we explore how events are perceived and represented across data modalities (e.g., text, image, video) and sources (e.g., news articles, tweets, video broadcasts). We systematically integrate machines and humans, i.e., with a focus on experimental methodologies and replicability, and sustainable way, i.e., with a focus on reusability of data, code and results. The research novelty is two-fold: (1) a context-sensitive approach to study and understand events, i.e., we do not study events in isolation, but we also study their properties (participating actors, locations, relevant information, salient information); (2) a diversity-driven methodology for gathering event and event-related ground truth, generalizable across domains and data modalities. In this thesis, we perform the research in the context of the CrowdTruth methodology and metrics.
|Award date||9 Mar 2022|
|Publication status||Published - 9 Mar 2022|
- events, event properties, diversity, human computation, crowdsourcing, human-machine approach, human interpretation, disagreement, CrowdTruth