ENC2045 Computational Linguistics#

Last Updated: 2024-02-08

Attention

The course materials are being updated as we proceed with the course schedule in Spring 2024. Usually the lecture notes for each topic will be finalized with revisions on its secheduled week in the syllabus.

Contributing to the Lecture Notes!

This website now supports Hypothe.is, which is an amazing tool for website annotations. If you would like to contribute to the lecture notes (e.g., error corrections), please leave your comments and suggestions as annotations. Corrections of the materials will be made on a regular basis. And your contributions are always greatly appreciated:)

Course Syllabus#

Introduction#

Computational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text analytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural language data (in spoken, textual, or even multimodal formats). Therefore, this course aims to introduce fundamental concepts, useful strategies and common workflows that have been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will mainly focus on textual data.

Course Objectives#

A selective collection of potential topics may include:

A Pipeline for Natural Language Processing
  • Text Normalization

  • Text Tokenization

  • Parsing and Chunking

Chinese Processing
  • Issues for Chinese Language Processing (Word Segmentation)

  • Different word segmentation tools/packages

Machine Learning Basics
  • Feature Engineering and Text Vectorization

  • Traditional Machine Learning

  • Classification Models (Naive Bayes, SVM, Logistic Regression)

Common Computational Tasks
  • Sentiment Analysis

  • Tex Clustering and Topic Modeling

  • Emsemble Learning

Deep Learning NLP
  • Neural Language Model

  • Sequence Models (RNN, LSTM, GRU)

  • Sequence-to-sequence Model

  • Attention-based Models

  • Explainable Artificial Intelligence and Computational Linguistics

Large Language Models
  • Applications of LLM (e.g., BERT, ChatGPT)

  • Transfer Learning

  • Retrieval-Augmentated Generation

  • Multimodal Data Processing

This course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-class theme-based tutorial sessions. The main coding language used in this course is Python. We will make extensive use of the language. It is assumed that you know or will quickly learn how to code in Python . In fact, this course assumes that every enrolled student has considerable working knowledge of Python. (If you are not sure whether you fulfill the prerequisite, please contact the instructor first).

To be more specific, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python: Learn Just Enough Python to Build Useful Tools. If the materials covered in the book are all very novel to you, I would strongly advise you not to take this course.

On the other hand, please note that this course is designed specifically for linguistics majors in humanities. For computer science majors, this course will not feature a thorough description of the mathematical operations behind the algorithms. Nor will we focus on the most recent developments in the field of deep learning based NLP. We focus more on the fundamentals and their practical implementations.

Course Website#

All the course materials will be available on the course website ENC2045. You may need a password to access the course materials/data. If you are an officially enrolled student, please ask the instructor for the passcode.

We will also have a Moodle course site for assignment submission.

Please read the FAQ of the course website before course registration.

Readings#

Major Readings#

The course materials are mainly based on the following readings.

ntlk
ntlk
text-analytics
sklearn
dl

Online Resources#

In addition to books, there are many wonderful on-line resources, esp. professional blogs, providing useful tutorials and intuitive understanding of many complex ideas in NLP and AI development. Among them, here is a list of my favorites:

YouTube Channels #

Python Setup#

Conda Environment#

In this class, I will be using Jupyter Notebook for code demonstrations and tutorials. It is therefore hoped that you can create the same Python environments as illustrated below.

Please note that many alternative IDE applications support jupyter notebooks files as well. Another recommended application for python developers would be Visual Studi Code.

In this class, we will be working with Jupyter notebook files (*.ipyrnb). You can either run the notebook files with the default Jupyter notebook from the terminal or using other third-party IDEs that support notebook files (e.g., Visual Studi Code).

Attention

The following installation codes are terminal codes.

  1. We assume that you have created and set up your python environment as follows:

    • Install the python with Anaconda.

    • Create a conda environment named python-notes with python 3.9:

      conda create --name python-notes python=3.9
      
    • Activate the conda environment:

      conda activate python-notes
      
    • Install Jupyter notebook in the newly created conda environment:

      pip install notebook
      
    • Install ipykernel package in the newly created conda enviornment:

      pip install ipykernel
      
    • Add this new conda environment to IPython kernel list:

      python -m ipykernel install --user --name=python-notes
      
    • Run the notebooks provided in the course materials in this newly created conda environment:

      jupyter notebook
      
  2. We will use jupyter notebook for python scripting in this course. All the assignments have to be submitted in the format of jupyter notebooks. (See Jupyter Notebook installation documentation).

  3. We assume you have created a conda virtual environment, named, python-notes, for all the scripting, and you are able to run the notebooks in the conda environment kernel in Jupyter.

  4. You can also run the notebooks directly in Google Colab, or alternatively download the .ipynb notebook files onto your hard drive or Google Drive for further changes.

Tip

If you cannot find the conda environment kernal in your Jupyter Notebook. Please install the module ipykernel. For more details on how to use specific conda environments in Jupyter Notebook, please see this article.

Module Requirements#

Not all modules used in this course come with the default installation of the python environment. Please remember to install these packages in use if you get a module-not-found type of errors.

The following is a list of packages we will use (non-exhaustive):

  • beautifulsoup4

  • ckip-transformers

  • gensim

  • jieba

  • jupyter

  • jupyter-nbextensions-configurator

  • Keras

  • matplotlib

  • nltk

  • numpy

  • pandas

  • scikit-learn

  • seaborn

  • spacy

  • scipy

  • tensorflow

  • transformers

Please google these packages for more details on their installation.

Coding Assignments Reminders#

  1. You have to submit your assignments via Moodle.

  2. Please name your files in the following format: Assignment-X-NAME.ipynb and Assignment-X-NAME.html.

  3. Please always submit both the Jupyter notebook file and its HTML version.

Attention

Unless otherwise specified in class, all assignments will be due on the date/time given on Moodle. The due is usually a week after the topic is finished. Late work within 7 calendar days of the original due date will be accepted by the instructor at the instructor’s discretion. After that, no late work will be accepted.

Google Colab#

  • The course materials are being revised so that all the codes can be executed directly on Google Colab. Ideally, it is hoped that students do not have to go through the painful experience of installing python as their start of the NLP !

  • Right now, we can check the packages available on Google Colab with the following commands:

!pip list | wc
!pip list -v

References#

1

Paul Gerrard. Lean Python: Learn Just Enough Python to Build Useful Tools. Apress, 2016.

2

Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.

3

Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. Third Edition. O'Reilly Media, 2023.

4

Dipanjan Sarkar. Text analytics with Python: a practitioner's guide to natural language processing. Apress, 2019.

5

Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit (See https://www.nltk.org/book/). " O'Reilly Media, Inc.", 2009.

6

Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. O'Reilly Media, 2020.

7

Jacob Perkins. Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd, 2014.

8

Bhargav Srinivasa-Desikan. Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018.

9

Chollet Francois. Deep learning with Python. Manning Publications Company, 2017.

10

Ruslan Mitkov. The Oxford handbook of computational linguistics. Second edition. Oxford University Press, 2022.