This site is last-updated on 2022-01-05


Important course information will be posted on this web page and announced in class. You are responsible for all material that appears here and should check this page for updates frequently.

Course Description

This course aims to introduce theories and practices of Corpus Linguistics as a scientific discipline of its own. Corpus Linguistics has now been considered an interdisciplinary subject, requiring knowledge of linguistic theories, quantitative statistics and data processing. Therefore, this course will provide not only the necessary theoretical foundation but also practical computational skills for students who are interested in conducting corpus-based linguistic research or language-related research. Students are expected to learn:

This course is extremely hands-on and will lead the students through classic examples of these corpus-based applications via in-class theme-based tutorial sessions. Unlike most of the corpus linguistic courses offered in other universities, we do not introduce you to ready-made applications or software packages (e.g., Wordsmith, AntConc, Sketch Engine, LancBox, etc.). We hope to point you to the power of coding in computational text analytics. By the end of this course, students should be expected to have achieved the necessary computational proficiency and be able to perform comparable corpus-based analyses on their own data or research questions. In particular, the weekly tutorials will stress the issues of computational text analytics (e.g., web crawling, regular expression, tokenization, segmentation, concordances, collocations, keyword analysis) and statistical processing (e.g., parametric statistical analyses) via hands-on implementation with the language . This course will be a prerequisite for more advanced courses such as ENC2045 Computational Linguistics and ENC2056 Topics on Quantitative Corpus Linguistics.

Course Schedule

(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)

Week Date Topic
Week 1 2022-02-18 Quantitative Corpus Linguistics: A Primer
Week 2 2022-02-25 R Fundamentals
Week 3 2022-03-04 Data Collection and Corpus Creation: HTML and Web Crawler
Week 4 2022-03-11 Concordances, Frequency Lists, N-grams, Collocations
Week 5 2022-03-18 Tokenization and Parts-of-speech Tagging
Week 6 2022-03-25 Data Retrieval and Coding: Regular Expressions
Week 7 2022-04-01 Keywords Analysis
Week 8 2022-04-08 Midterm
Week 9 2022-04-15 Chinese Text Analytics + Chinese Word Segmentation
Week 10 2022-04-22 Constructions and Idioms
Week 11 2022-04-29 Structured Corpora Processing
Week 12 2022-05-06 XML
Week 13 2022-05-13 Near-Synonyms and Distcintive Collexemes
Week 14 2022-05-20 Vector-space Representation
Week 15 2022-05-27 Final Exam
Week 16 2022-06-03 Holiday

Course Requirement

Course Materials

All the course materials are available on the course website. Please consult the instructor for the direct link to the course materials. They will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website. These teaching materials are based on the following recommended readings.

  1. Stefanowitsch, A. (2019). Corpus Linguistics: A Guide to the Methodology. (In Press)
  2. Gries, Stefan T. (2017) Quantitative Corpus Linguistics with R: A Practical Introduction. Second Edition. Routledge.
  3. Winter, Bodo. (2019). Statistics for Linguists: An Introduction Using R. NY:Routledge.
  4. Desagulier, Guillaume. (2018). Corpus Linguistics and Statistics with R. Cham, Switzerland: Springer.
  5. Brezina, Vaclav. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Course Website

Email Address

You can reach me at

Office Hours

By appointment.


If you have any further questions related to the course, please consult FAQ on our course website or write me at any time at

Disclaimer & Agreement

While I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any errors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with no guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty of any kind, express or implied.

You may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent, you cannot disclose confidential information of the website (e.g., log-in username and password) to any third party.