Course Syllabus

Welcome to ENC2036. This is a grad-level course, which is devoted to one of the most active sub-fields of applied linguistics, i.e., Corpus Linguistics. Have you decided to embark on a digital journey to your future career, there are a series of courses provided in the Department of English, NTNU, Taiwan, offering necessary inter-disciplinary skills and knowledge for quantitative and computational analysis of language.

This course requires as prerequisite basic-level knowledge of coding skills. Students are expected to have taken ENC2055 or other equivalents of introductory programming courses before taking this course. Please see the FAQ of the course website for more information about the prerequisite.

Course Objective

This course aims to introduce theories and practices of Corpus Linguistics as a scientific discipline of its own. Corpus Linguistics has now been considered an interdisciplinary subject, requiring knowledge of linguistic theories, quantitative statistics and data processing. Therefore, this course aims to provide the necessary foundation as well as computational skills for students who are interested in conducting corpus-based linguistic research or language-related research. Students are expected to learn:

  • the methodological foundations of Corpus Linguistics
  • the theoretical bases of Corpus Linguistics
  • the technical designs and configuration of standard corpora
  • how to adopt corpus linguistics as a scientific method in terms of:
    • corpus creation
    • operationalization
    • data retrieval
    • quantifying research questions
    • significance testing
  • the common applications of corpus-linguistic methodology:
    • concordances
    • frequency lists
    • collocations
    • keywords
    • lexical bundles
    • word clouds
    • vector-space representation of words and texts

This course is extremely hands-on and will guide the students through classic examples of these corpus-based applications via in-class tutorial sessions and take-home assignments. The main objective of this course is to provide students enough computational skills to perform similar corpus-based analyses on their own data or research questions. Also, it will provide specific hands-on tutorials to equip students with the necessary skills of text and statistical processing. This course will be a prerequisite for more advanced courses such as Computational Linguistics and Quantitative Corpus Linguistics.

This course introduces corpus-based processing techniques. Because of this focus, the corpus data used in exercises and exam questions does not always come from English. Some tasks use Chinese corpus data, which requires basic knowledge of Chinese. If Chinese is not your native language, please consider this point carefully before enrolling.

Reading Materials

The course lectures will follow the materials provided on the course website. Please preview the lecture notes before the class.

Also, for each topic, please review the assigned readings on your own. These readings will be part of the midterm and final exams as well.

In particular, we will be referring to two useful reference books as our main reading materials– Stefanowitsch (2019), Gries (2016).

In addition, there are a few more reference books listed at the end of the section (See References), which we will refer to for specific topics, including Gries (2021), Baayen (2008), Brezina (2018), McEnery & Hardie (2011), Wynne (2006), Winter (2020), Hunston (2022), Bird et al. (2009).

In particular, assigned readings for each chapter are listed as shown below.

Chapter Assigned Readings
1 Gries (2021), ch1; McEnery & Hardie (2011), ch1; Stefanowitsch (2019), ch1-2
2 Check ENC2055; Wickham & Grolemund (2017)
3 Wynne (2006), ch1-2; Munzert et al. (2014), ch2&9
4 Hunston (2022), Ch2-3; Stefanowitsch (2019), Ch4; Gries (2016), Ch3; Evert (2007)
5 Bird et al. (2009), ch3&5; Gries (2016), Ch3; Stefanowitsch (2019), Ch7-8
6 Stefanowitsch (2019), Ch10
7 Lecture Notes Only
8 Stefanowitsch & Gries (2003); Stefanowitsch & Gries (2005); Gries & Stefanowitsch (2004)
9 Lecture Notes Only
10 Wynne (2006), ch3,4,6;
11 Gries (2016), ch3; Munzert et al. (2014), ch3-4
12 Jurafsky & Martin (2022), ch6

Grading Policy

Exams

The midterm and final exams will be a timed in-class open-book examination. Questions will include both coding exercises as well as essay questions (on the course materials). All exams will be considered individual work with no inter-personal communication. Please see Academic Integrity and AI Policy below.

Course Website

We have a course website. You may need a password to access the course materials (i.e., the data sets). If you are an officially enrolled student, please ask the instructor for the pass code.

Please read the FAQ of the course website before course registration.

Contributing to the Lecture Notes

Although I have tried every possible way to make sure that the contents are correct, I may still accidentally make mistakes in the materials. If you spot any errors and would like make suggestions for better solutions, I would really appreciate it.

To contribute your ideas, you can annotate the lecture notes using Hypothes.is, which is an amazing tool for website annotations.

  • To add an annotation, highlight the target text in the lecture notes and then click the icon in the pop-up menu to leave your comments or corrections.
  • To review annotations from others, click the icon located in the upper right-hand corner of the page.

Course Demo Data

Dropbox Demo Data Directory (Password protected)

AI Policy

Students may use AI tools to help them complete the exercises in each unit. However, midterm and final exams require independent work, so AI tools are not allowed.

Academic Integrity

All coding assignments are to be individual work, but discussion or collaboration with others is allowed. However, direct copying of others’ codes is absolutely forbidden. Any incidents of academic misconduct such as cheating, plagiarism, copying others’ work, or other inappropriate assistance on projects or examinations will be treated with zero tolerance and will result in a grade of “F” for the course.

In particular, the midterm and final exams are taken seriously as individual work with absolutely no outside help or assistance. Breaches of academic integrity may also result in other action being taken by the University.

House-keeping Guidelines

  1. Please direct all your course-related questions to the Discussion Forum on Moodle. Do not send your coding questions to the TA via email. By posting all the questions on Moodle, we can also make sure that those with similar questions would get proper assistance as well.
  2. If you need to consult TA (Howard Su) for technical help, please make an appointment with TA first. A recommended session is the hour after each week’s meeting.
  3. Please submit your coding assignments on time. Late submissions are subject to points deduction as a late penalty. Assignments are typically due one week after we have completed the corresponding chapter.
  4. All assignments must be submitted via Moodle.
  5. Please make sure that your access to Moodle is active. It is the student’s responsiblity to keep themselves posted of the most recent announcements.

Necessary Packages

In this course, we will need the following R packages for tutorials and exercises.

library(dplyr)
library(ggplot2)
library(ggpubr)
library(ggrepel)
library(gutenbergr)
library(htmlwidgets)
library(jiebaR)
library(kableExtra)
library(parallel)
library(purrr)
library(quanteda)
library(RColorBrewer)
library(readtext)
library(reticulate)
library(Rtsne)
library(rvest)
library(showtext)
library(spacyr)
library(stringr)
library(text2vec)
library(textdata)
library(tidyr)
library(tidytext)
library(tidyverse)
library(wordcloud)
library(wordcloud2)

Environment Settings

The R Version to produce the lecture notes: R version 4.3.1 (2023-06-16).

The RStudio Version to produce the lecture notes: RStudio 2026.01.1+403

If your R version is older than the above one, please consider updating your R. Details about updating R can be found in:

References

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python: Analyzing text with the natural language toolkit. " O’Reilly Media, Inc.". https://www.nltk.org/book/
Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge University Press.
Evert, S. (2007). Corpora and collocations. https://stephanie-evert.de/PUB/Evert2007HSK_extended_manuscript.pdf
Gries, S. T. (2016). Quantitative corpus linguistics with R: A practical introduction (2nd ed.). Routledge.
Gries, S. T. (2021). Statistics for linguistics with R: A practical introduction. 3rd edition. (3rd ed.). Walter de Gruyter.
Gries, S. T., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspective onalternations’. International Journal of Corpus Linguistics, 9(1), 97–129.
Hunston, S. (2022). Corpora in applied linguistics (2nd ed.). Cambridge University Press.
Jurafsky, D., & Martin, J. H. (2022). Speech and language processing. " O’Reilly Media, Inc.". https://web.stanford.edu/~jurafsky/slp3/
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.
Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.
Stefanowitsch, A. (2019). Corpus linguistics: A guide to the methodology. Language Science Press. http://langsci-press.org/catalog/book/148
Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.
Stefanowitsch, A., & Gries, S. T. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, (1), 1–43.
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data (1st ed.). O’Reilly Media, Inc.
Winter, B. (2020). Statistics for linguists: An introduction using R. Routledge.
Wynne, M. (Ed.). (2006). Developing linguistic corpora—a guide to good practice. EADH: The European Association for Digital Humanities. https://users.ox.ac.uk/~martinw/dlc/index.htm