Title: Recent methodological insights for word frequency data: keywords and lexical diversity
Speaker: Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg)
Word frequency data play a central role in applied corpus linguistics, especially in the form of keywords, collocations and lexical diversity. Keywords are characterised by their unusually high frequency in a given text or subcorpus, when compared against a reference corpus. They capture the aboutness of a text, highlight domain- or genre-specific vocabulary, and have been used for systematic corpus comparison. Collocations are unusually frequent co-occurrences of words, often in a direct syntactic relation such as verb-object or adjective-noun. They are a key concept in studies of phraseology and formulaic language, form the basis for distributional accounts of word meaning, and enable advanced second-language learners to become truly fluent. In the form of word sketches, they are omnipresent in modern computational lexicography. Measures of lexical diversity quantify the type-richness of word frequency distributions. They have been use to assess the size of an author's vocabulary, the stylometric complexity of literary texts, and the productivity of morphological and syntactic patterns.
For the identification of collocations, a plethora of quantitative techniques and statstical measures have been suggested, discussed, and evaluated thoroughly in empirical studies. However, appropriate methodological approaches to keywords and lexical diversity are far less well-established, not widely known among corpus linguists, and often have little empirical support. In this talk, I will present recent methodological research on keywords and lexical diversity, including an overview and assessment of state-of-the-art approaches as well as preliminary results from ongoing empirical studies.