- Dit evenement is voorbij.
ACLC Seminar | Lauren Fonteyn: Methods of semi-automatic data annotation with contextualized word embeddings
juni 9 @ 4:15 pm - 5:30 pm
Organized by the Amsterdam Center for Language and Communication! (ACLC)
- Speaker: Lauren Fonteyn (Meertens Instituut)
- Location: P.C. Hoofthuis
- Room: 1.15
- Title: Methods of semi-automatic data annotation with contextualized word embeddings
- Abstract: In corpus linguistics, the collection and annotation of data commonly involve a relatively balanced combination of computer-aided and manual labour. It is still common practice, for instance, to first retrieve data representing a particular linguistic phenomenon from an electronic corpus (e.g. by means of a concordancer tool or query script) and subsequently manually categorize the collected examples into different groups (e.g. animate/inanimate; literal/figurative; agent/patient/instrument/…). However, as the range of research questions that linguists aim to address by means of corpus data has expanded in complexity, there is a growing need for larger data samples, which is difficult to meet when we continue to approach data annotation manually. As such, it has become an important practical challenge in corpus linguistics to determine how data annotation practices can evolve along with the needs of researchers. In this talk, I suggest one way of approaching corpus data annotation (semi-)automatically by relying on Large Language Models (LLMs). More specifically, I will present a number example case studies to highlight how an LLM like BERT can be employed to annotate corpus data. The presentation focusses on the BERT-based models MacBERTh (Manjavacas & Fonteyn 2022a) and GysBERT (Manjavacas & Fonteyn 2022b), which – unlike the vast majority of available models – have been pre-trained to process historical English and Dutch respectively (date range: 1500-1950). What makes the approach I will discuss appealing is that it is fully customizable to the researcher’s needs. Of course, some corpora have been enriched with part-of-speech tags, or, more exceptionally, syntactic parsing and semantic tagging. Yet, not only are high-quality parsed (historical) corpora quite rare and limited in size, the extent to pre-set tags map onto the categories a researcher is interested in may also vary. The procedure presented in my case studies, then, offers a means of automatically classifying morphosyntactic structures in large, unparsed (and/or untagged) corpora following a custom annotation scheme. As such, the procedure can help to scale up the data set for (historical) corpus studies where a small portion of the data has been manually annotated, or to replicate a data annotation scheme adopted in prior work and apply it to new data.