Abstract
Urban Legends are the original offline “viral” stories: stories that widely and spontaneously spread from person to person. These legends have a weak factual basiswhile at the same time providing an air of plausibility (Mullen, 1972). Topics of urban legends concern specific anxieties about modern life, from “Stranger Danger”(Fine 1985) to food safety (Fine 1980, Best & Horiuchi, 1985). The Meertens Institute possesses an extensive collection of urban legends in Dutch culture, in the “Volksverhalenbank” database.
The database uses the BRUNVAND-type index as metadata for the urban legends (Brunvand, 2002), in order to categorize the individual story versions into types. For instance, “BRUN 03000” is “The Babysitter and the Man Upstairs” (Brunvand, 2002, Nguyen et. al. 2013). The Brunvand categorization has 10 main Urban Legend types (“HORROR”,“ANIMAL”, etc.), each consisting of 2-15 subtypes (e.g. “BABYSITTER”), which in turn each consist of 2-20 BRUN types (such as “The Babysitter and the ManUpstairs”). There are 175 BRUN labels in total in this last layer. Currently, employees(mostly interns) manually assess which BRUN type fits a story. The aim of the current research is to automatically predict BRUN-type numbers of new, out-of-database urban legends. We cast this as a classification problem, and base it on earlier work with the same database (Meder et. al. 2016, Nguyen et. al. 2013).
During the research project, we found that some types were strongly connected to one textual genre from the database (newspaper article, interview, email, etc.). This genre confound meant texts needed to be normalized for genre (or a model could be predicting “text genre” rather than “urban legend type”).
Our current solution is based on a hierarchical classification model. We attempted to utilize the hierarchical structure of the Urban Legend type index by training three separate layers of models in Python. All models (as of yet) are Support Vector Machines with as features 1- 5 character n-grams, 1-5 word n-grams, and word lemmas. The first layer is a model attempting to classify the text into one of 10 main types:“ACCIDENT”, “HORROR”, etc. The second layer has 10 models, one for each of the main types. Each tries to detect the probability of the text belonging to either of the 2-20 subcategories in that type.
The final layer has a separate SVM model for each of these 2-20 subcategories, and predicts which BRUN type number in that category has the highest probability given the text, given the text belongs to that subtype.
Early results (on training with 1055 legends with a random 20% development set) indicate that this improves earlier, non-hierarchical solutions of this BRUN classification problem (Nguyen et. al., 2013), as well as our own non-hierarchical model.
Interesting is that, in evaluation, not all classes are created equal. Some classes have more than 50 examples in the training set, some classes only a handful. However, not all frequent classes are recognized well (e.g. “Masturbating into Food”, 37 examples, in-class F1 = .33), while some more infrequent classes perform surprisingly well (e.g. “Poodle in Microwave”, 16 examples, in-class F1= .86). This tells us something about the (dis)similarity of different urban legend story types.
Currently, we are working on perfecting our model, and testing it on out-of-database urban legends. The result will aid future employees and people working with the “Volksverhalenbank” urban legends to swiftly and more accurately add new urban legends to the system.
References
Best, J. and Horiuchi, G. 1985. The Razor Blade in the Apple: TheSocial Construction of Urban Legends. Social Problems 32(5),488-499.
Brunvand, J.H. 2002. Encyclopedia of Urban Legends. W.W. Norton & Company.
Fine, G. 1980. The Kentucky fried rat. Journal of the Folklore Institute 17, 222-43.
Fine, G. 1985. The Goliath effect. Journal of American Folklore 98, 63-84.
Meder, T., Karsdorp, F., Nguyen, D., Theune, M.,Trieschnigg, D. and Muiser, I.2016. Automatic Enrichment and Classification of Folktales. Journal ofAmerican Folklore 129, 76–94.
Mullen, P. 1972. Modern Legends and Rumor Theory. Journal of the FolkloreInstitute, 92(3), 95-10
Nguyen, Dong, Trieschnigg, Dolf and Theune, M. 2013. Folktale classification using learning to rank. In 35th European Conference on IR Research, ECIR 2013,195-206.
The database uses the BRUNVAND-type index as metadata for the urban legends (Brunvand, 2002), in order to categorize the individual story versions into types. For instance, “BRUN 03000” is “The Babysitter and the Man Upstairs” (Brunvand, 2002, Nguyen et. al. 2013). The Brunvand categorization has 10 main Urban Legend types (“HORROR”,“ANIMAL”, etc.), each consisting of 2-15 subtypes (e.g. “BABYSITTER”), which in turn each consist of 2-20 BRUN types (such as “The Babysitter and the ManUpstairs”). There are 175 BRUN labels in total in this last layer. Currently, employees(mostly interns) manually assess which BRUN type fits a story. The aim of the current research is to automatically predict BRUN-type numbers of new, out-of-database urban legends. We cast this as a classification problem, and base it on earlier work with the same database (Meder et. al. 2016, Nguyen et. al. 2013).
During the research project, we found that some types were strongly connected to one textual genre from the database (newspaper article, interview, email, etc.). This genre confound meant texts needed to be normalized for genre (or a model could be predicting “text genre” rather than “urban legend type”).
Our current solution is based on a hierarchical classification model. We attempted to utilize the hierarchical structure of the Urban Legend type index by training three separate layers of models in Python. All models (as of yet) are Support Vector Machines with as features 1- 5 character n-grams, 1-5 word n-grams, and word lemmas. The first layer is a model attempting to classify the text into one of 10 main types:“ACCIDENT”, “HORROR”, etc. The second layer has 10 models, one for each of the main types. Each tries to detect the probability of the text belonging to either of the 2-20 subcategories in that type.
The final layer has a separate SVM model for each of these 2-20 subcategories, and predicts which BRUN type number in that category has the highest probability given the text, given the text belongs to that subtype.
Early results (on training with 1055 legends with a random 20% development set) indicate that this improves earlier, non-hierarchical solutions of this BRUN classification problem (Nguyen et. al., 2013), as well as our own non-hierarchical model.
Interesting is that, in evaluation, not all classes are created equal. Some classes have more than 50 examples in the training set, some classes only a handful. However, not all frequent classes are recognized well (e.g. “Masturbating into Food”, 37 examples, in-class F1 = .33), while some more infrequent classes perform surprisingly well (e.g. “Poodle in Microwave”, 16 examples, in-class F1= .86). This tells us something about the (dis)similarity of different urban legend story types.
Currently, we are working on perfecting our model, and testing it on out-of-database urban legends. The result will aid future employees and people working with the “Volksverhalenbank” urban legends to swiftly and more accurately add new urban legends to the system.
References
Best, J. and Horiuchi, G. 1985. The Razor Blade in the Apple: TheSocial Construction of Urban Legends. Social Problems 32(5),488-499.
Brunvand, J.H. 2002. Encyclopedia of Urban Legends. W.W. Norton & Company.
Fine, G. 1980. The Kentucky fried rat. Journal of the Folklore Institute 17, 222-43.
Fine, G. 1985. The Goliath effect. Journal of American Folklore 98, 63-84.
Meder, T., Karsdorp, F., Nguyen, D., Theune, M.,Trieschnigg, D. and Muiser, I.2016. Automatic Enrichment and Classification of Folktales. Journal ofAmerican Folklore 129, 76–94.
Mullen, P. 1972. Modern Legends and Rumor Theory. Journal of the FolkloreInstitute, 92(3), 95-10
Nguyen, Dong, Trieschnigg, Dolf and Theune, M. 2013. Folktale classification using learning to rank. In 35th European Conference on IR Research, ECIR 2013,195-206.
Original language | English |
---|---|
Publication status | Published - 2019 |
Event | DHBenelux 2019 - University of Liege, Liege, Belgium Duration: 11 Sept 2019 → 13 Sept 2019 |
Conference
Conference | DHBenelux 2019 |
---|---|
Country/Territory | Belgium |
City | Liege |
Period | 11/09/19 → 13/09/19 |