A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. We have been investigating the use of crowdsourcing as an affordable alternative to using experts to clean noisy data, and have found that with the proper analysis, crowds can rival and even out-perform the precision and recall of experts, at a much lower cost. We have further found that the crowd, by virtue of its diversity, can help us find evidence of ambiguous sen-tences that are difficult to classify, and we have hypothesized that such sentences are likely just as difficult for machines to classify. In this pa-per we outline CrowdTruth, a previously presented method for scoring ambiguous sentences that suggests that existing modes of truth are in-adequate, and we present for the first time a set of weighted metrics for evaluating the performance of experts, the crowd, and a trained classiffier in light of ambiguity. We show that our theory of truth and our metrics are a more powerful way to evaluate NLP performance over traditional unweighted metrics like precision and recall, because they allow us to ac-count for the rather obvious fact that some sentences express the target relations more clearly than others.
|Number of pages||13|
|Journal||CEUR Workshop Proceedings|
|Publication status||Published - 2015|