# ENC2045 Computational Linguistics

In [1]:
from datetime import datetime
print('Last Updated: '+ datetime.today().strftime('%Y-%m-%d'))

Last Updated: 2024-02-23


:::{attention}
We're updating the course materials as we progress through the Spring 2024 schedule. Typically, the lecture notes for each topic will be finalized and revised during its scheduled week on the syllabus.
:::

```{admonition} Contributing to the Lecture Notes!
:class: note
This website now supports [Hypothe.is](https://web.hypothes.is/), which is an amazing tool for website annotations. If you would like to contribute to the lecture notes (e.g., error corrections), please leave your comments and suggestions as annotations. Corrections of the materials will be made on a regular basis. And your contributions are always greatly appreciated:)
```

```{contents}
```

## Course Syllabus

### Introduction

Computational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text analytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural language data (in spoken, textual, or even multimodal formats). Therefore, this course aims to introduce fundamental concepts, useful strategies and common workflows that have been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will mainly focus on **textual** data.

### Course Objectives

A selective collection of potential topics may include:

::::{grid}
:gutter: 3

:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> A Pipeline for Natural Language Processing
:columns: 6

-   Text Normalization
-   Text Tokenization
-   Parsing and Chunking
:::


:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> Chinese Processing
:columns: 6

-   Issues for Chinese Language Processing (Word Segmentation)
-   Different word segmentation tools/packages
:::


:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> Machine Learning Basics
:columns: 6

-   Feature Engineering and Text Vectorization
-   Traditional Machine Learning
-   Classification Models (Naive Bayes, SVM, Logistic Regression)
:::

:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> Common Computational Tasks
:columns: 6

-   Sentiment Analysis
-   Tex Clustering and Topic Modeling
-   Emsemble Learning
:::

:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> Deep Learning NLP
:columns: 6

-   Neural Language Model
-   Sequence Models (RNN, LSTM, GRU)
-   Sequence-to-sequence Model
-   Attention-based Models
-   Explainable Artificial Intelligence and Computational Linguistics
:::

:::{grid-item-card} <i class="fa fa-check fa-1x" style="color:DarkTurquoise;margin-right:5px"></i> Large Language Models
:columns: 6

-   Generative AI
-   Applications of LLM (e.g., BERT, ChatGPT)
-   Transfer Learning and Fine-Tuning
-   Prompt Engineering
-   Retrieval-Augmentated Generation
-   Multimodal Data Processing 
:::
::::

This course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-class theme-based tutorial sessions. The main coding language used in this course is Python. 

It is assumed that you know or will quickly learn how to code in **Python** <i class="fa-brands fa-python"></i>. In fact, this course assumes that every enrolled student has considerable working knowledge of Python. (If you are not sure whether you fulfill the prerequisite, please contact the instructor first).

To be more specific, you are assumed to have already had working knowledge of all the concepts included in the book, *Lean Python: Learn Just Enough Python to Build Useful Tools*. If the materials covered in the book are all very novel to you, I would strongly advise you not to take this course.

On the other hand, please note that this course is designed specifically for linguistics majors in humanities. For computer science majors, this course will not feature a thorough description of the mathematical operations behind the algorithms. Nor will we focus on the most recent developments in the field of deep learning based NLP. We focus more on the fundamentals and their practical implementations.

### Course Website

All the course materials will be available on the [course website ENC2045](https://alvinntnu.github.io/NTNU_ENC2045/). You may need a password to access the course materials/data. If you are an officially enrolled student, please ask the instructor for the passcode.

We will also have a Moodle course site for assignment submission.

Please read the FAQ of the course website before course registration.

## Readings

### Major Readings

The course materials are mainly based on the following readings.

::::{grid}

:::{grid-item-card}
:columns: 4
```{image} images/book-nltk.jpg
:alt: ntlk
:width: 100%
:align: center
```
:::

:::{grid-item-card}
:columns: 4
```{image} images/book-pnlp.jpg
:alt: ntlk
:width: 100%
:align: center
```
:::

:::{grid-item-card}
:columns: 4
```{image} images/book-text-analytics.jpg
:alt: text-analytics
:width: 100%
:align: center
```
:::
::::


::::{grid}


:::{grid-item-card}
:columns: 4
```{image} images/book-sklearn.jpg
:alt: sklearn
:width: 100%
:align: center
```
:::

:::{grid-item-card}
:columns: 4
```{image} images/book-dl.jpg
:alt: dl
:width: 100%
:align: center
```
:::
::::

### Recommended Books

Also, there are many other useful reference books, which are listed as follows in terms of three categories:

-   Python Basics {cite}`gerrard2016lean`
-   Data Analysis with Python {cite}`mckinney2012python,geron2019hands`
-   Natural Language Processing with Python {cite}`sarkar2019text,bird2009natural,vajjala2020,perkins2014python,srinivasa2018natural`
-   Deep Learning {cite}`francois2017deep`
-   The Oxford Handbook of Computational Linguistics (Second edition) {cite}`mitkov2022oxford`

### Online Resources

In addition to books, there are many wonderful on-line resources, esp. professional blogs, providing useful tutorials and intuitive understanding of many complex ideas in NLP and AI development. Among them, here is a list of my favorites:

-   [Toward Data Science](https://towardsdatascience.com/)
-   [LeeMeng](https://leemeng.tw/)
-   [Dipanzan Sarkar’s articles](https://towardsdatascience.com/@dipanzan.sarkar)
-   [Python Graph Libraries](https://python-graph-gallery.com/)
-   [KGPTalkie NLP](https://kgptalkie.com/category/natural-language-processing-nlp/)
-   [Jason Brownlee’s Blog: Machine Learning Mastery](https://machinelearningmastery.com/)
-   [Jay Alammar’s Blog](https://jalammar.github.io/)
-   [Chris McCormich’s Blog](https://mccormickml.com/)
-   [GLUE: General Language Understanding Evaluation Benchmark](https://gluebenchmark.com/)
-   [SuperGLUE](https://super.gluebenchmark.com/)

### YouTube Channels <i class="fa-brands fa-square-youtube"></i>

-   [Chris McCormick AI](https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw/videos)
-   [Corey Schafer](https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g)
-   [Edureka!](https://www.youtube.com/channel/UCkw4JCwteGrDHIsyIIKo4tQ)
-   [Deeplearning.ai](https://www.youtube.com/channel/UCcIXc5mJsHVYTZR1maL5l9w)
-   [Free Code Camp](https://www.youtube.com/c/Freecodecamp/videos)
-   [Giant Neural Network](https://www.youtube.com/watch?v=ZzWaow1Rvho&list=PLxt59R_fWVzT9bDxA76AHm3ig0Gg9S3So)
-   [Tech with Tim](https://www.youtube.com/channel/UC4JX40jDee_tINbkjycV4Sg)
-   [PyData](https://www.youtube.com/user/PyDataTV/featured)
-   [Python Programmer](https://www.youtube.com/channel/UC68KSmHePPePCjW4v57VPQg)
-   [Keith Galli](https://www.youtube.com/channel/UCq6XkhO5SZ66N04IcPbqNcw)
-   [Data Science Dojo](https://www.youtube.com/c/Datasciencedojo/featured)
-   [Calculus: Single Variable by Professor Robert Ghrist](https://www.youtube.com/playlist?list=PLKc2XOQp0dMwj9zAXD5LlWpriIXIrGaNb)

## Python Setup

### Notebooks

In this class, I will use Jupyter Notebook for code demonstrations and tutorials. It's important for you to set up Python environments similarly.

Please note that many alternative IDE applications also support jupyter notebooks files. Another great option for Python developers is [Visual Studio Code](https://code.visualstudio.com/). 

In this class, we will mainly use Jupyter notebook files (`*.ipyrnb`), including all your coding assignments. You can run these files either with the default Jupyter notebook from the terminal or with other third-party IDEs that support notebook files, such as Visual Studio Code. Additionally, running the course materials on Google Colab is recommended.

### Conda Environment

:::{attention}
The following installation codes are terminal codes.   
:::

1.  We assume that you have created and set up your python environment as follows:

    -   Install the python with [Anaconda](https://www.anaconda.com/products/individual).

    -   Create a conda environment named `python-notes` with `python 3.9`:

        ```bash
        conda create --name python-notes python=3.9
        ```

    -   Activate the conda environment:

        ```bash
        conda activate python-notes
        ```

    -   Install Jupyter notebook in the newly created conda environment:

        ```bash
        pip install notebook
        ````

    -   Install ipykernel package in the newly created conda enviornment:

        ```bash
        pip install ipykernel
        ```

    -   Add this new conda environment to IPython kernel list:

        ```bash
        python -m ipykernel install --user --name=python-notes
        ```

    -   Run the notebooks provided in the course materials in this newly created conda environment:

        ```bash
        jupyter notebook
        ```   

2. In this course, we'll use `jupyter notebook` for Python scripting, and all assignments must be submitted in this format. (Refer to the [Jupyter Notebook installation documentation](https://jupyter.org/install) if needed).

3. You're expected to have set up a Conda virtual environment named `python-notes` for scripting and be able to run notebooks using this environment's kernel in Jupyter.

4. Alternatively, you can run the notebooks directly in Google Colab, or download the `.ipynb` files to your hard drive or Google Drive for further modifications.

```{tip}
If you cannot find the conda environment kernal in your Jupyter Notebook. Please install the module `ipykernel`. For more details on how to use specific conda environments in Jupyter Notebook, please see [this article](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084).
```

:::{admonition} Set your parameters!!
:class: warning, dropdown

When you run the installation of the python modules and other packages in the terminal, please remember to change the arguments of the parameters when you follow the instructions provided online.

For example, when you create the conda environment:

```bash
conda create --name XXX python=3.9
```

You need to specify the conda environment name `XXX` on your own.

Similarly, when you add the self-defined conda environment to the notebook kernel list:

```bash
python -m ipykernel install --user --name=XXX
```

You need to specify the conda environment name `XXX`.
:::


:::{admonition} Installation tips!
:class: tip, dropdown

There are several important things here:

- You need to activate the conda environment (`python-notes`) in the terminal before installing the necessary modules.
- You need to add the kernel name with `python -m ipykernel install --user --name=XXX` within the conda enviroment as well (i.e., actviate the conda environment first).
- In summary, activate the Conda environment `python-notes`, install the `ipykernel` module, and add the `python-notes` kernel to the IPython kernel list.
- After a few trial-and-errors, I think the best environment setting is that you **only** add the kernel name (conda environment) to `ipykernel` within the `python-notes` conda environment. **Do not** add the conda environment **again** in your base python environment.
- Make sure to install `jupyter` in your Conda environment (`python-notes`) and run your notebook from this `python-notes` environment.
- Rule of thumb: Activate the newly created conda environment first!!
- For Windows users, it's strongly advised to install python-related modules in `Anaconda Prompt` rather than `cmd`.

::::



::::{admonition} Notes for Windows Users
:class: info, dropdown
Because some of the Window's users are suffering from the problem that the notebook cannot find the correct path of your `python-notes` environment. Could you please try the following?

```{attention}
We assume you run the following steps in Anaconda Powershell Prompt.
```

- Create the new conda environment, `python-notes` (if you have done this, ignore)

    ```bash
    $ conda create --name python-notes python=3.9
    ```

- Activate the newly-created `python-notes` conda environment

    ```bash
    $ conda activate python-notes
    ```

- Within the `python-notes` environment, install the module `ipykernel`

    ```bash
    $ conda install -c conda-forge ipykernel
    ```

- Within the `python-notes` environment, install Jupyter notebook:

    ```bash
    conda install -c conda-froge notebook
    ````

-   Add this new conda environment to IPython kernel list:

    ```bash
    python -m ipykernel install --user --name=python-notes
    ```

- Deactivate the conda environment

    ```bash
    $ conda deactivate
    ```

- Install the module `nb_conda_kernels` in your Anaconda base python (This is crucial to the use of self-defined conda environments in notebook.)

    ```bash
    $ conda install nb_conda_kernels
    ```

- Then check if the system detects your `python-notes` environment

    ```bash
    $ python -m nb_conda_kernels list
    ```

:::{note}
Ideally, you should see your `python-notes` popping up in the results like:
```
conda-env-python-notes-py    c:\users\alvinchen\anaconda3\envs\python-notes\share\jupyter\kernels\python3
```
:::


- If you see the new conda environment name, then it should work. Next, initiate your jupyter notebook

    ```bash
    $ jupyter notebook
    ```

- Now you should be able to see you conda environment `python-notes` in your notebook, but the kernel name would be something like `python [conda evn: python-notes]`.

::::

### Module Requirements

Not all modules used in this course come with the default installation of the Python environment. Please remember to install these packages in use if you get a module-not-found type of errors.

The following is a list of packages we will use (non-exhaustive):

-   `beautifulsoup4`
-   `ckip-transformers`
-   `gensim`
-   `jieba`
-   `jupyter`
-   `jupyter-nbextensions-configurator`
-   `Keras`
-   `matplotlib`
-   `nltk`
-   `numpy`
-   `pandas`
-   `scikit-learn`
-   `seaborn`
-   `spacy`
-   `scipy`
-   `tensorflow`
-   `transformers`

Please google these packages for more details on their installation.

## Coding Assignments Reminders

1.  You have to submit your assignments via Moodle.
2.  Please name your files in the following format: `Assignment-X-NAME.ipynb` and `Assignment-X-NAME.html`.
3.  Please always submit both the Jupyter notebook file and its HTML version.
4. On the assignment due date, students will present their solutions to the exercises, with each presentation limited to 20-30 minutes. 
5. The class will be divided into small groups for these exercise/assignment presentations, with the group size determined by the final enrollment. (Each group is expected to make two presentations throughout the semester.)

```{attention}
Unless otherwise specified in class, all assignments will be due on the date/time given on Moodle. The due is usually a week after the topic is finished. Late work within **7 calendar days** of the original due date will be accepted by the instructor at the instructor's discretion. After that, no late work will be accepted.
```

## Annotated Bibliography

1. Find a minimum of THREE articles published in the section of [ArXiv in Computation and Language](https://arxiv.org/list/cs.CL/recent) over the last five years and compile an annotated bibliography.
2. For each source reference in the annotated bibliography, provide a succinct summary and evaluation.
3. The annotations should offer a compact overview of the content, as well as assess the relevance and contributions of the study to a specific topic or field, along with its quality.
4. Each annotation should be concise yet critical, comprising no more than 250 words for each source. Your annotations should provide a comprehensive overview of the study's contributions, enabling readers who haven't read the paper to grasp its significance. Remember, plagiarism is strictly prohibited.
5. It's advisable to choose articles related to a specific topic. For inspiration on potential topics, please refer to Part 3 Language Processing Tasks and/or Part 4 NLP Applications in the Oxford Handbook of Computational Linguistics {cite}`mitkov2022oxford`.
6. The deadline for the annotated bibliography will be announced after the midterm exam.
7. Submission guidelines:
    - Please submit your document (i.e., annotated bibliography) along with the full-text PDFs of the articles reviewed.
    - You can include your notes and highlights in the PDFs.
    - All submissions need to be made online through Moodle. No printed copies are needed.
    - Please check the deadline on Moodle.

## Google Colab

- The course materials are being revised so that all the codes can be executed directly on Google Colab. Ideally, it is hoped that students do not have to go through the painful experience of installing python as their start of the NLP <i class="fa-regular fa-face-laugh-squint"></i>!
- Right now, we can check the packages available on Google Colab with the following commands:

In [None]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

!pip list | wc
!pip list -v

- To execute the code provided in the course materials on Google Colab, you might have to install certain packages not present in the default Colab environment.

- Installing packages or modules in Google Colab is simple and can be accomplished using the `!pip` command, enabling you to execute shell commands directly within a Colab notebook.

In [None]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

! pip install requests_html
! pip install pytesseract

- We expect everyone to create a directory named `ENC2045_demo_data` at the root of their own Google Drive. All datasets intended for future tutorials should be downloaded and stored in this designated folder. This setup allows easy access to the datasets on Google Colab using the provided code snippet:


In [None]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

from google.colab import drive
drive.mount('/content/drive')

- Once you've mounted Google Drive on Google Colab, the path to access the course materials will be: `/content/drive/MyDrive/ENC2045_demo_data/`.

## Other Logistics

Please refer to [Course Website ENC2045](https://alvinntnu.github.io/NTNU_ENC2045/) for a tentative course schedule and the grading policies.

## Disclaimer

While I strive to cover the basics of computational linguistics to the best of my ability, please note that I am not an expert in Python. Therefore, there may be more advanced packages or more efficient algorithms available that can accomplish the tasks discussed in class. However, it is our intention that the fundamentals covered in this course will provide you with a solid foundation to delve into the latest advancements in the field.

## References

```{bibliography}
:filter: docname in docnames
:style: unsrt
```