ENC2045 Computational Linguistics

Last Updated: 2021-07-20

Contributing to the Lecture Notes!

This website now supports Hypothe.is, which is an amazing tool for website annotations. If you would like to contribute to the lecture notes (e.g., error corrections), please leave your comments and suggestions as annotations. Corrections of the materials will be made on a regular basis. And your contributions will be greatly appreciated:)

Introduction

Computational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text analytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural language data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common workflows that have been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on textual data.

Course Objectives

A selective collection of potential topics may include:

A Pipeline for Natural Language Processing

  • Text Normalization

  • Text Tokenization

  • Parsing and Chunking

Chinese Processing

  • Issues for Chinese Language Processing (Word Segmentation)

Machine Learning Basics

  • Feature Engineering and Text Vectorization

  • Traditional Machine Learning

  • Classification Models (Naive Bayes, SVM, Logistic Regression)

Common Computational Tasks

  • Sentiment Analysis

  • Tex Clustering and Topic Modeling

Deep Learning NLP

  • Neural Language Model

  • Sequence Models

  • RNN

  • LSTM/GRU

  • Sequence-to-sequence Model

  • Attention-based Models

  • Explainable Artificial Intelligence and Computational Linguistics

This course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-class theme-based tutorial sessions. The main coding language used in this course is Python. We will make extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course assumes that every enrolled student has working knowledge of Python. (If you are not sure whether you fulfill the prerequisite, please contact the instructor first.)

A test on Python Basics will be conducted on the second week of the class to ensure that every enrolled student fulfills the prerequisite. (To be more specific, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python: Learn Just Enough Python to Build Useful Tools). Those who fail on the Python basics test are NOT advised to take this course.

Please note that this course is designed specifically for linguistics majors in humanities. For computer science majors, this course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the practical implementation.

Course Website

All the course materials will be available on the course website ENC2045. You may need a password to access the course materials/data. If you are an officially enrolled student, please ask the instructor for the passcode.

We will also have a Moodle course site for assignment submission.

Please read the FAQ of the course website before course registration.

Major Readings

The course materials are mainly based on the following readings.

ntlk
ntlk
text-analytics
sklearn
dl

Online Resources

In addition to books, there are many wonderful on-line resources, esp. professional blogs, providing useful tutorials and intuitive understanding of many complex ideas in NLP and AI development. Among them, here is a list of my favorites:

Environment Setup

  1. We assume that you have created and set up your python environment as follows:

    • Install the python with Anaconda

    • Create a conda environment named python-notes with python 3.7

    • Run the notebooks provided in the course materials in this self-defined conda environment

  2. We will use jupyter notebook for python scripting in this course. All the assignments have to be submitted in the format of jupyter notebooks. (See Jupyter Notebook installation documentation).

  3. We assume you have created a conda virtual environment, named, python-notes, for all the scripting, and you are able to run the notebooks in the conda environment kernel in Jupyter.

  4. You can also run the notebooks directly in Google Colab, or alternatively download the .ipynb notebook files onto your hard drive or Google Drive for further changes.

When you run the installation of the python modules and other packages in the terminal, please remember to change the arguments of the parameters when you follow the instructions provided online.

For example, when you create the conda environment:

conda create --name XXX

You need to specify the conda environment name XXX on your own.

Similarly, when you add the self-defined conda environment to the notebook kernel list:

python -m ipykernel install --user --name=XXX

You need to specify the conda environment name XXX.

Warning

There are several important things here:

  • You need to install the relevant modules AFTER you activate the conda environment in the terminal.

  • You need to add the kernel name with python -m ipykernel install --user --name=XXX within the conda enviroment as well.

  • In other words, you need to install the module ipykernel in the target conda environment as well.

  • After a few trial-and-errors, I think the best environment setting is that you only add the kernel name (conda environment) to ipykernel within the conda environment. Do not add the conda environment again in your base python environment.

  • What’s even better is to install jupyter in your conda environment (python-notes) and run your notebook from this python-notes as well.

For Windows users, it is highly recommended to run the installation of python-related modules in Anaconda Prompt instead of cmd.

Windows Users

Because some of the window’s users are suffering from the problem that the notebook cannot find the correct path of your python-notes environment. Could you please try the following?

Tip

We assume you run the following steps in Anaconda Powershell Prompt.

  • Create the new conda environment, python-notes (if you have done this, ignore)

$ conda create --name python-notes python=3.7
  • Activate the newly-created python-notes conda environment

$ conda activate python-notes
  • Within the python-notes environment, install the module ipykernel

$ conda install ipykernel
  • Deactivate the conda environment

$ conda deactivate
  • Install the module nb_conda_kernels in your Anaconda base python (This is crucial to the use of self-defined conda environments in notebook.)

$ conda install nb_conda_kernels
  • Then check if the system detects your python-notes environment

$ python -m nb_conda_kernels list

Note

Ideally, you should see your python-notes popping up in the results like:

conda-env-python-notes-py    c:\users\alvinchen\anaconda3\envs\python-notes\share\jupyter\kernels\python3
  • If you see the new conda environment name, then it should work. Next, initiate your jupyter notebook

$ jupyter notebook
  • Now you should be able to see you conda environment python-notes in your notebook, but the kernel name would be something like python [conda evn: python-notes].

Module Requirements

Not all modules used in this course come with the default installation of the python environment. Please remember to install these packages in use if you get a module-not-found type of errors.

The following is a list of packages we will use (non-exhaustive):

  • beautifulsoup4

  • ckip-transformers

  • gensim

  • jieba

  • jupyter

  • jupyter-nbextensions-configurator

  • Keras

  • matplotlib

  • nltk

  • numpy

  • pandas

  • scikit-learn

  • seaborn

  • spacy

  • scipy

  • tensorflow

  • transformers

Please google these packages for more details on their installation.

Coding Assignments Reminders

  1. You have to submit your assignments via Moodle.

  2. Please name your files in the following format: Assignment-X-NAME.ipynb and Assignment-X-NAME.html.

  3. Please always submit both the Jupyter notebook file and its HTML version.

References

1

Paul Gerrard. Lean Python: Learn Just Enough Python to Build Useful Tools. Apress, 2016.

2

Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.

3

Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media, 2019.

4

Dipanjan Sarkar. Text analytics with Python: a practitioner's guide to natural language processing. Apress, 2019.

5

Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

6

Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. O'Reilly Media, 2020.

7

Jacob Perkins. Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd, 2014.

8

Bhargav Srinivasa-Desikan. Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018.

9

Chollet Francois. Deep learning with Python. Manning Publications Company, 2017.