This site is last-updated on 2022-01-05
Important course information will be posted on this web page and announced in class. You are responsible for all material that appears here and should check this page for updates frequently.
This course aims to introduce theories and practices of Corpus Linguistics as a scientific discipline of its own. Corpus Linguistics has now been considered an interdisciplinary subject, requiring knowledge of linguistic theories, quantitative statistics and data processing. Therefore, this course will provide not only the necessary theoretical foundation but also practical computational skills for students who are interested in conducting corpus-based linguistic research or language-related research. Students are expected to learn:
This course is extremely hands-on and will lead the students through classic examples of these corpus-based applications via in-class theme-based tutorial sessions. Unlike most of the corpus linguistic courses offered in other universities, we do not introduce you to ready-made applications or software packages (e.g., Wordsmith, AntConc, Sketch Engine, LancBox, etc.). We hope to point you to the power of coding in computational text analytics. By the end of this course, students should be expected to have achieved the necessary computational proficiency and be able to perform comparable corpus-based analyses on their own data or research questions. In particular, the weekly tutorials will stress the issues of computational text analytics (e.g., web crawling, regular expression, tokenization, segmentation, concordances, collocations, keyword analysis) and statistical processing (e.g., parametric statistical analyses) via hands-on implementation with the language . This course will be a prerequisite for more advanced courses such as ENC2045 Computational Linguistics and ENC2056 Topics on Quantitative Corpus Linguistics.
(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)
Week | Date | Topic |
---|---|---|
Week 1 | 2022-02-18 | Quantitative Corpus Linguistics: A Primer |
Week 2 | 2022-02-25 | R Fundamentals |
Week 3 | 2022-03-04 | Data Collection and Corpus Creation: HTML and Web Crawler |
Week 4 | 2022-03-11 | Concordances, Frequency Lists, N-grams, Collocations |
Week 5 | 2022-03-18 | Tokenization and Parts-of-speech Tagging |
Week 6 | 2022-03-25 | Data Retrieval and Coding: Regular Expressions |
Week 7 | 2022-04-01 | Keywords Analysis |
Week 8 | 2022-04-08 | Midterm |
Week 9 | 2022-04-15 | Chinese Text Analytics + Chinese Word Segmentation |
Week 10 | 2022-04-22 | Constructions and Idioms |
Week 11 | 2022-04-29 | Structured Corpora Processing |
Week 12 | 2022-05-06 | XML |
Week 13 | 2022-05-13 | Near-Synonyms and Distcintive Collexemes |
Week 14 | 2022-05-20 | Vector-space Representation |
Week 15 | 2022-05-27 | Final Exam |
Week 16 | 2022-06-03 | Holiday |
All the course materials are available on the course website. Please consult the instructor for the direct link to the course materials. They will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website. These teaching materials are based on the following recommended readings.
You can reach me at alvinchen@ntnu.edu.tw.
By appointment.
If you have any further questions related to the course, please consult FAQ on our course website or write me at any time at alvinchen@ntnu.edu.tw.
While I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any errors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with no guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty of any kind, express or implied.
You may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent, you cannot disclose confidential information of the website (e.g., log-in username and password) to any third party.