Applied Text Mining 1: Methods

Course

URL study guide

https://studiegids.vu.nl/en/courses/2024-2025/L_PAMATLW004

Course Objective

In this short and intensive course, students will gain hands-on experience conducting a machine learning experiment to solve an NLP task in a group work setting. In the process, students will also gain experience with all the following:
• Defining the steps needed to solve an NLP task.
• Design machine learning experiments to solve a task, taking informed decisions based on existing literature.
• Working with existing datasets that can be used to train and evaluate a machine learning system.
• Use existing machine learning packages to train a supervised classifier using, for example, Masked Language Models.
• Set up and perform classification experiments.
• Evaluate and interpret the results of a classifier.
• Perform error analysis and analyze the weaknesses and strengths of a system.
• Report about NLP experiments.

Course Content

In this course students will work closely as a group to conduct an NLP experiment. This is a full-time course, where students are expected to commit ~40h/week and progressively work towards the completion of an NLP experiment. Each week, the different phases of the experiment will be covered (with deliverables). Please note that the course is designed to be a practical course, and while some content will be reviewed, the students are expected to already have the necessary background to independently (as a group) conduct the NLP experiment. During Week 1: Students will work on understanding the task definition and formulating an experiment design. Students will also work with existing datasets, understand how the data was annotated, and the need to preprocess the data and format it in a way that it can be used for the experiment; During Weeks 2 and 3: Students will work on building supervised classifiers for the experiment at hand. They follow standard experimental protocols. Students will also need to learn how to describe the following elements: description of data, data partition, baseline system, classifiers to be used and motivation, features, evaluation, and error analysis. These experimental protocols will be relevant for both traditional and transformer-based classifiers and both can be part of the course. However, not all runs of the course will include implementation of traditional machine learning algorithms; Students will learn to report the results of their classification experiments using standard quantitative metrics (e.g., Precision, Recall, F1 measure). During Week 4: Students will work on conducting an error analysis for the task at hand. This includes understanding both linguistic and potentially other dimensions of the data that may be sources of error and/or important to provide a more in-depth evaluation of the results.

Teaching Methods

This course is mostly practical. The lectures will review the basic theory related to each methodological step, which the students will apply to solve the task. A few sessions will be devoted to theoretical aspects, while other sessions will be interactive and devoted to discuss progress, doubts and problems encountered. Students will work in groups. Students are expected to commit 40 hours per week (including both lectures and group work). All students should be present during two weekly lectures/practical sessions. In addition, mandatory group monitoring sessions will be scheduled between each group and the lecturer. These monitoring sessions can happen at any time during the week, and are expected to happen at least once per week. Contact hours: 2x2 hours per week + 1 group monitoring session per week

Method of Assessment

The course includes a group assignment component (with weekly deliverables) and an individual (potentially oral) exam related to the weekly assignments. All graded components must be graded at least 5.5.

Literature

The literature of the course is tied to the NLP task chosen for each run of the course. will mostly comprise academic papers describing a specific NLP task, and academic papers proposing systems to address that same task.

Target Audience

The target audience of this course is mainly students in the Text Mining track of the MA in Linguistics. Other students are welcome insofar they meet the entry requirements.

Additional Information

This is a concise course requiring 40 hours of weekly commitment. Assignments are mostly group assignments and students need to contribute to these assignments equally. Students who are not able to commit the required time should not register for this course.

Entry Requirements

Students need solid Machine Learning and NLP knowledge in order to successfully complete this course. As mentioned above, students are expected to have the necessary background to independently conduct an NLP experiment. To enroll in this course students must have participated in:
• Introduction to Human Language Technology (L_AAMPALG016) OR Natural Language Processing Technology (L_AAMAALG005) AND
• Language as Data (L_PAMATLW001) OR Subjectivity Mining (L_AAMPLIN018) AND
• Machine Learning for NLP (L_AAMPLIN024) OR Machine Learning for the Quantified Self (XM_40012) OR Advanced Machine Learning (XM_0010)

Recommended background knowledge

See Entry requirements.

Explanation Canvas

Course materials will be made available on Canvas.
Academic year1/09/2431/08/25
Course level6.00 EC

Language of Tuition

  • English

Study type

  • Master