BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//SCIO Studievereniging - ECPv6.0.13.1//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:SCIO Studievereniging
X-ORIGINAL-URL:https://sciostudievereniging.nl
X-WR-CALDESC:Evenementen voor SCIO Studievereniging
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:Europe/Amsterdam
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:20230326T010000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:20231029T010000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=Europe/Amsterdam:20230609T161500
DTEND;TZID=Europe/Amsterdam:20230609T173000
DTSTAMP:20260418T172746
CREATED:20230606T073232Z
LAST-MODIFIED:20230606T073232Z
UID:1894-1686327300-1686331800@sciostudievereniging.nl
SUMMARY:ACLC Seminar | Lauren Fonteyn: Methods of semi-automatic data annotation with contextualized word embeddings
DESCRIPTION:Organized by the Amsterdam Center for Language and Communication! (ACLC) \n\nSpeaker: Lauren Fonteyn (Meertens Instituut)\nLocation: P.C. Hoofthuis\nRoom: 1.15\nTitle: Methods of semi-automatic data annotation with contextualized word embeddings\nAbstract: In corpus linguistics\, the collection and annotation of data commonly involve a relatively balanced combination of computer-aided and manual labour. It is still common practice\, for instance\, to first retrieve data representing a particular linguistic phenomenon from an electronic corpus (e.g. by means of a concordancer tool or query script) and subsequently manually categorize the collected examples into different groups (e.g. animate/inanimate; literal/figurative; agent/patient/instrument/…). However\, as the range of research questions that linguists aim to address by means of corpus data has expanded in complexity\, there is a growing need for larger data samples\, which is difficult to meet when we continue to approach data annotation manually. As such\, it has become an important practical challenge in corpus linguistics to determine how data annotation practices can evolve along with the needs of researchers. In this talk\, I suggest one way of approaching corpus data annotation (semi-)automatically by relying on Large Language Models (LLMs). More specifically\, I will present a number example case studies to highlight how an LLM like BERT can be employed to annotate corpus data. The presentation focusses on the BERT-based models MacBERTh (Manjavacas & Fonteyn 2022a) and GysBERT (Manjavacas & Fonteyn 2022b)\, which – unlike the vast majority of available models – have been pre-trained to process historical English and Dutch respectively (date range: 1500-1950). What makes the approach I will discuss appealing is that it is fully customizable to the researcher’s needs. Of course\, some corpora have been enriched with part-of-speech tags\, or\, more exceptionally\, syntactic parsing and semantic tagging. Yet\, not only are high-quality parsed (historical) corpora quite rare and limited in size\, the extent to pre-set tags map onto the categories a researcher is interested in may also vary. The procedure presented in my case studies\, then\, offers a means of automatically classifying morphosyntactic structures in large\, unparsed (and/or untagged) corpora following a custom annotation scheme. As such\, the procedure can help to scale up the data set for (historical) corpus studies where a small portion of the data has been manually annotated\, or to replicate a data annotation scheme adopted in prior work and apply it to new data.
URL:https://sciostudievereniging.nl/event/aclc-seminar-lauren-fonteyn-methods-of-semi-automatic-data-annotation-with-contextualized-word-embeddings/
LOCATION:P.C. Hoofthuis\, Spuistraat 134\, Amsterdam\, Nederland
CATEGORIES:Activiteiten Buiten SCIO
END:VEVENT
END:VCALENDAR