Google Colab#

  • Google Colab is a free tool by Google where beginners can write and run Python code without needing to set up anything on their computer. It gives access to powerful computers for running code and allows easy collaboration with others.

  • As we are working with more and more data, we may need GPU computing for quicker processing.

  • This lecture note shows how we can capitalize on the free GPU computing provided by Google Colab and speed up the Chinese word segmentation of ckip-transformers.

Prepare Google Drive#

  • Create a working directory under your Google Drive, named ENC2045_DEMO_DATA.

  • Save the corpus files needed in that Google Drive directory.

  • We can access the files on our Google Drive from Google Colab. This can be useful when you need to load your own data in Google Colab.

Note

You can of course name the directory in which ever ways you like. The key is that we need to put the data files on the Google Drive so that we can access these files through Google Colab.

Run Notebook in Google Colab#

  • Click on the button on top of the lecture notes website to open this notebook in Google Colab.

Setting Google Colab Environment#

  • Important Steps for Google Colab Environment Setting

    • Change the Runtime for GPU

    • Install Modules

    • Mount Google Drive

    • Set Working Directory

Change Runtime for GPU#

  • [Runtime] -> [Change runtime type]

  • For [Hardware accelerator], choose [GPU]

!nvidia-smi
Wed Feb 21 01:23:27 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Install Modules#

  • Google Colab has been pre-instralled with several popular modules for machine learning and deep learning (e.g., nltk, sklearn, tensorflow, pytorch,numpy, spacy).

  • We can check the pre-installed modules here.

!pip list
Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.3
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array-record                     0.5.0
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.14.0
backcall                         0.2.0
beautifulsoup4                   4.12.3
bidict                           0.23.0
bigframes                        0.21.0
bleach                           6.1.0
blinker                          1.4
blis                             0.7.11
blosc2                           2.0.0
bokeh                            3.3.4
bqplot                           0.12.42
branca                           0.7.1
build                            1.0.3
CacheControl                     0.14.0
cachetools                       5.3.2
catalogue                        2.0.10
certifi                          2024.2.2
cffi                             1.16.0
chardet                          5.2.0
charset-normalizer               3.3.2
chex                             0.1.85
click                            8.1.7
click-plugins                    1.1.1
cligj                            0.7.2
cloudpathlib                     0.16.0
cloudpickle                      2.2.1
cmake                            3.27.9
cmdstanpy                        1.2.1
colorcet                         3.0.1
colorlover                       0.3.0
colour                           0.1.5
community                        1.0.0b1
confection                       0.1.4
cons                             0.4.6
contextlib2                      21.6.0
contourpy                        1.2.0
cryptography                     42.0.3
cufflinks                        0.17.3
cupy-cuda12x                     12.2.0
cvxopt                           1.3.2
cvxpy                            1.3.3
cycler                           0.12.1
cymem                            2.0.8
Cython                           3.0.8
dask                             2023.8.1
datascience                      0.17.6
db-dtypes                        1.2.0
dbus-python                      1.2.18
debugpy                          1.6.6
decorator                        4.4.2
defusedxml                       0.7.1
distributed                      2023.8.1
distro                           1.7.0
dlib                             19.24.2
dm-tree                          0.1.8
docutils                         0.18.1
dopamine-rl                      4.0.6
duckdb                           0.9.2
earthengine-api                  0.1.390
easydict                         1.12
ecos                             2.0.13
editdistance                     0.6.2
eerepr                           0.0.4
en-core-web-sm                   3.7.1
entrypoints                      0.4
et-xmlfile                       1.1.0
etils                            1.7.0
etuples                          0.3.9
exceptiongroup                   1.2.0
fastai                           2.7.14
fastcore                         1.5.29
fastdownload                     0.0.7
fastjsonschema                   2.19.1
fastprogress                     1.0.3
fastrlock                        0.8.2
filelock                         3.13.1
fiona                            1.9.5
firebase-admin                   5.3.0
Flask                            2.2.5
flatbuffers                      23.5.26
flax                             0.8.1
folium                           0.14.0
fonttools                        4.49.0
frozendict                       2.4.0
frozenlist                       1.4.1
fsspec                           2023.6.0
future                           0.18.3
gast                             0.5.4
gcsfs                            2023.6.0
GDAL                             3.6.4
gdown                            4.7.3
geemap                           0.31.0
gensim                           4.3.2
geocoder                         1.38.1
geographiclib                    2.0
geopandas                        0.13.2
geopy                            2.3.0
gin-config                       0.5.0
glob2                            0.7
google                           2.0.3
google-ai-generativelanguage     0.4.0
google-api-core                  2.11.1
google-api-python-client         2.84.0
google-auth                      2.27.0
google-auth-httplib2             0.1.1
google-auth-oauthlib             1.2.0
google-cloud-aiplatform          1.42.1
google-cloud-bigquery            3.12.0
google-cloud-bigquery-connection 1.12.1
google-cloud-bigquery-storage    2.24.0
google-cloud-core                2.3.3
google-cloud-datastore           2.15.2
google-cloud-firestore           2.11.1
google-cloud-functions           1.13.3
google-cloud-iam                 2.14.1
google-cloud-language            2.13.1
google-cloud-resource-manager    1.12.1
google-cloud-storage             2.8.0
google-cloud-translate           3.11.3
google-colab                     1.0.0
google-crc32c                    1.5.0
google-generativeai              0.3.2
google-pasta                     0.2.0
google-resumable-media           2.7.0
googleapis-common-protos         1.62.0
googledrivedownloader            0.4
graphviz                         0.20.1
greenlet                         3.0.3
grpc-google-iam-v1               0.13.0
grpcio                           1.60.1
grpcio-status                    1.48.2
gspread                          3.4.2
gspread-dataframe                3.3.1
gym                              0.25.2
gym-notices                      0.0.8
h5netcdf                         1.3.0
h5py                             3.9.0
holidays                         0.42
holoviews                        1.17.1
html5lib                         1.1
httpimport                       1.3.1
httplib2                         0.22.0
huggingface-hub                  0.20.3
humanize                         4.7.0
hyperopt                         0.2.7
ibis-framework                   7.1.0
idna                             3.6
imageio                          2.31.6
imageio-ffmpeg                   0.4.9
imagesize                        1.4.1
imbalanced-learn                 0.10.1
imgaug                           0.4.0
importlib-metadata               7.0.1
importlib-resources              6.1.1
imutils                          0.5.4
inflect                          7.0.0
iniconfig                        2.0.0
intel-openmp                     2023.2.3
ipyevents                        2.0.2
ipyfilechooser                   0.6.0
ipykernel                        5.5.6
ipyleaflet                       0.18.2
ipython                          7.34.0
ipython-genutils                 0.2.0
ipython-sql                      0.5.0
ipytree                          0.2.2
ipywidgets                       7.7.1
itsdangerous                     2.1.2
jax                              0.4.23
jaxlib                           0.4.23+cuda12.cudnn89
jeepney                          0.7.1
jieba                            0.42.1
Jinja2                           3.1.3
joblib                           1.3.2
jsonpickle                       3.0.2
jsonschema                       4.19.2
jsonschema-specifications        2023.12.1
jupyter-client                   6.1.12
jupyter-console                  6.1.0
jupyter_core                     5.7.1
jupyter-server                   1.24.0
jupyterlab_pygments              0.3.0
jupyterlab_widgets               3.0.10
kaggle                           1.5.16
kagglehub                        0.1.9
keras                            2.15.0
keyring                          23.5.0
kiwisolver                       1.4.5
langcodes                        3.3.0
launchpadlib                     1.10.16
lazr.restfulclient               0.14.4
lazr.uri                         1.0.6
lazy_loader                      0.3
libclang                         16.0.6
librosa                          0.10.1
lightgbm                         4.1.0
linkify-it-py                    2.0.3
llvmlite                         0.41.1
locket                           1.0.0
logical-unification              0.4.6
lxml                             4.9.4
malloy                           2023.1067
Markdown                         3.5.2
markdown-it-py                   3.0.0
MarkupSafe                       2.1.5
matplotlib                       3.7.1
matplotlib-inline                0.1.6
matplotlib-venn                  0.11.10
mdit-py-plugins                  0.4.0
mdurl                            0.1.2
miniKanren                       1.0.3
missingno                        0.5.2
mistune                          0.8.4
mizani                           0.9.3
mkl                              2023.2.0
ml-dtypes                        0.2.0
mlxtend                          0.22.0
more-itertools                   10.1.0
moviepy                          1.0.3
mpmath                           1.3.0
msgpack                          1.0.7
multidict                        6.0.5
multipledispatch                 1.0.0
multitasking                     0.0.11
murmurhash                       1.0.10
music21                          9.1.0
natsort                          8.4.0
nbclassic                        1.0.0
nbclient                         0.9.0
nbconvert                        6.5.4
nbformat                         5.9.2
nest-asyncio                     1.6.0
networkx                         3.2.1
nibabel                          4.0.2
nltk                             3.8.1
notebook                         6.5.5
notebook_shim                    0.2.4
numba                            0.58.1
numexpr                          2.9.0
numpy                            1.25.2
oauth2client                     4.1.3
oauthlib                         3.2.2
opencv-contrib-python            4.8.0.76
opencv-python                    4.8.0.76
opencv-python-headless           4.9.0.80
openpyxl                         3.1.2
opt-einsum                       3.3.0
optax                            0.1.9
orbax-checkpoint                 0.4.4
osqp                             0.6.2.post8
packaging                        23.2
pandas                           1.5.3
pandas-datareader                0.10.0
pandas-gbq                       0.19.2
pandas-stubs                     1.5.3.230304
pandocfilters                    1.5.1
panel                            1.3.8
param                            2.0.2
parso                            0.8.3
parsy                            2.1
partd                            1.4.1
pathlib                          1.0.1
patsy                            0.5.6
peewee                           3.17.1
pexpect                          4.9.0
pickleshare                      0.7.5
Pillow                           9.4.0
pins                             0.8.4
pip                              23.1.2
pip-tools                        6.13.0
platformdirs                     4.2.0
plotly                           5.15.0
plotnine                         0.12.4
pluggy                           1.4.0
polars                           0.20.2
pooch                            1.8.0
portpicker                       1.5.2
prefetch-generator               1.0.3
preshed                          3.0.9
prettytable                      3.9.0
proglog                          0.1.10
progressbar2                     4.2.0
prometheus_client                0.20.0
promise                          2.3
prompt-toolkit                   3.0.43
prophet                          1.1.5
proto-plus                       1.23.0
protobuf                         3.20.3
psutil                           5.9.5
psycopg2                         2.9.9
ptyprocess                       0.7.0
py-cpuinfo                       9.0.0
py4j                             0.10.9.7
pyarrow                          14.0.2
pyarrow-hotfix                   0.6
pyasn1                           0.5.1
pyasn1-modules                   0.3.0
pycocotools                      2.0.7
pycparser                        2.21
pyct                             0.5.0
pydantic                         2.6.1
pydantic_core                    2.16.2
pydata-google-auth               1.8.2
pydot                            1.4.2
pydot-ng                         2.0.0
pydotplus                        2.0.2
PyDrive                          1.3.1
PyDrive2                         1.6.3
pyerfa                           2.0.1.1
pygame                           2.5.2
Pygments                         2.16.1
PyGObject                        3.42.1
PyJWT                            2.3.0
pymc                             5.7.2
pymystem3                        0.2.0
PyOpenGL                         3.1.7
pyOpenSSL                        24.0.0
pyparsing                        3.1.1
pyperclip                        1.8.2
pyproj                           3.6.1
pyproject_hooks                  1.0.0
pyshp                            2.3.1
PySocks                          1.7.1
pytensor                         2.14.2
pytest                           7.4.4
python-apt                       0.0.0
python-box                       7.1.1
python-dateutil                  2.8.2
python-louvain                   0.16
python-slugify                   8.0.4
python-utils                     3.8.2
pytz                             2023.4
pyviz_comms                      3.0.1
PyWavelets                       1.5.0
PyYAML                           6.0.1
pyzmq                            23.2.1
qdldl                            0.1.7.post0
qudida                           0.0.4
ratelim                          0.1.6
referencing                      0.33.0
regex                            2023.12.25
requests                         2.31.0
requests-oauthlib                1.3.1
requirements-parser              0.5.0
rich                             13.7.0
rpds-py                          0.18.0
rpy2                             3.4.2
rsa                              4.9
safetensors                      0.4.2
scikit-image                     0.19.3
scikit-learn                     1.2.2
scipy                            1.11.4
scooby                           0.9.2
scs                              3.2.4.post1
seaborn                          0.13.1
SecretStorage                    3.3.1
Send2Trash                       1.8.2
sentencepiece                    0.1.99
setuptools                       67.7.2
shapely                          2.0.3
six                              1.16.0
sklearn-pandas                   2.2.0
smart-open                       6.4.0
sniffio                          1.3.0
snowballstemmer                  2.2.0
sortedcontainers                 2.4.0
soundfile                        0.12.1
soupsieve                        2.5
soxr                             0.3.7
spacy                            3.7.4
spacy-legacy                     3.0.12
spacy-loggers                    1.0.5
Sphinx                           5.0.2
sphinxcontrib-applehelp          1.0.8
sphinxcontrib-devhelp            1.0.6
sphinxcontrib-htmlhelp           2.0.5
sphinxcontrib-jsmath             1.0.1
sphinxcontrib-qthelp             1.0.7
sphinxcontrib-serializinghtml    1.1.10
SQLAlchemy                       2.0.27
sqlglot                          19.9.0
sqlparse                         0.4.4
srsly                            2.4.8
stanio                           0.3.0
statsmodels                      0.14.1
sympy                            1.12
tables                           3.8.0
tabulate                         0.9.0
tbb                              2021.11.0
tblib                            3.0.0
tenacity                         8.2.3
tensorboard                      2.15.2
tensorboard-data-server          0.7.2
tensorflow                       2.15.0
tensorflow-datasets              4.9.4
tensorflow-estimator             2.15.0
tensorflow-gcs-config            2.15.0
tensorflow-hub                   0.16.1
tensorflow-io-gcs-filesystem     0.36.0
tensorflow-metadata              1.14.0
tensorflow-probability           0.23.0
tensorstore                      0.1.45
termcolor                        2.4.0
terminado                        0.18.0
text-unidecode                   1.3
textblob                         0.17.1
tf-keras                         2.15.0
tf-slim                          1.1.0
thinc                            8.2.3
threadpoolctl                    3.3.0
tifffile                         2024.2.12
tinycss2                         1.2.1
tokenizers                       0.15.2
toml                             0.10.2
tomli                            2.0.1
toolz                            0.12.1
torch                            2.1.0+cu121
torchaudio                       2.1.0+cu121
torchdata                        0.7.0
torchsummary                     1.5.1
torchtext                        0.16.0
torchvision                      0.16.0+cu121
tornado                          6.3.2
tqdm                             4.66.2
traitlets                        5.7.1
traittypes                       0.2.1
transformers                     4.37.2
triton                           2.1.0
tweepy                           4.14.0
typer                            0.9.0
types-pytz                       2024.1.0.20240203
types-setuptools                 69.1.0.20240217
typing_extensions                4.9.0
tzlocal                          5.2
uc-micro-py                      1.0.3
uritemplate                      4.1.1
urllib3                          2.0.7
vega-datasets                    0.9.0
wadllib                          1.3.6
wasabi                           1.1.2
wcwidth                          0.2.13
weasel                           0.3.4
webcolors                        1.13
webencodings                     0.5.1
websocket-client                 1.7.0
Werkzeug                         3.0.1
wheel                            0.42.0
widgetsnbextension               3.6.6
wordcloud                        1.9.3
wrapt                            1.14.1
xarray                           2023.7.0
xarray-einstats                  0.7.0
xgboost                          2.0.3
xlrd                             2.0.1
xxhash                           3.4.1
xyzservices                      2023.10.1
yarl                             1.9.4
yellowbrick                      1.5
yfinance                         0.2.36
zict                             3.0.0
zipp                             3.17.0
  • We only need to install modules that are not pre-installed in Google Colab (e.g., ckip-transformers).

  • This installation has to be done every time we work with Google Colab. But don’t worry. It’s quick.

  • This is how we install the package on Google Colab, exactly the same as we do in our terminal.

## Google Drive Setting
!pip install ckip-transformers

Mount Google Drive#

  • To mount our Google Drive to the current Google Colab server, we need the following codes.

  • The default directory of Google Colab is /content/. (There is a sub-directory by default, i.e., /content/sample_data.)

  • We specify the mount point as /content/drive, where you can find your root directory of your Google Drive (i.e., /content/drive/MyDrive).

from google.colab import drive
drive.mount("/content/drive")
Mounted at /content/drive
  • After we run the above codes, we need to click on the link presented, log in with our Google Account in the new window and get the authorization code.

  • Then copy the authorization code from the new window and paste it back to the text box in the notebook window.

Set Working Directory#

  • Change Colab working directory to the ENC2045_demo_data of the Google Drive

import os
os.chdir('/content/drive/MyDrive/ENC2045_demo_data')
print(os.getcwd())
/content/drive/MyDrive/ENC2045_demo_data

Try ckip-transformers with GPU#

Initialize the ckip-transformers#

import ckip_transformers
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
# Initialize drivers
ws_driver = CkipWordSegmenter(model="bert-base", device=0)
pos_driver = CkipPosTagger(model="bert-base", device=0)
def my_tokenizer(doc):
    # `doc`: a list of corpus documents (each element is a document long string)
    cur_ws = ws_driver(doc, use_delim = True, delim_set='\n')
    cur_pos = pos_driver(cur_ws)
    doc_seg = [[(x,y) for (x,y) in zip(w,p)]  for (w,p) in zip(cur_ws, cur_pos)]
    return doc_seg

Tokenization Chinese Texts#

import pandas as pd

df = pd.read_csv('dcard-top100.csv')
df.head()
corpus = df['content']
corpus[:10]
0    部分回應在B117 \n謝謝各位的留言,我都有看完\n好的不好的,我都接受謝謝大家🙇‍♀️\...
1    https://i.imgur.com/REIEzSd.jpg\n\n身高195公分的男大生...
2    看過這麼多在Dcard、PTT上的感情渣事和創作文\n從沒想過如此荒謬像八點檔的事情居然會發...
3    剛剛吃小火鍋,跟店員說不要金針菇(怕卡牙縫),於是店員幫我換其他配料..…\n\n沒想到餐一...
4    已經約好見面,到了當天晚上七點半才回,我是被耍了嗎 \n如下圖\n\n\nhttps://i...
5    嗨!巨砲哥 答應你的文來了😆\n這是一段與約砲小哥哥談心的奇幻旅程\n\n可憐的我情人節當天...
6    https://i.imgur.com/HCTwyAH.jpg\n(圖片非本人)\n今天逛街...
7    https://i.imgur.com/RWJLK2v.jpg\n\n因為馬鞍很寬\n想請問...
8    手機排版請見諒😖🙏🏻(圖多)\n先說這不是我第一次訂購訂製蛋糕\n也了解訂製蛋糕不可能跟圖上...
9    https://i.imgur.com/6Yk9etg.jpg\n想在這裡問大家有沒有接到這...
Name: content, dtype: object
%%time
corpus_seg = my_tokenizer(corpus)
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 400.33it/s]
Inference: 100%|██████████| 16/16 [01:54<00:00,  7.18s/it]
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 422.44it/s]
Inference: 100%|██████████| 10/10 [01:13<00:00,  7.38s/it]
CPU times: user 3min 8s, sys: 498 ms, total: 3min 9s
Wall time: 3min 10s
corpus_seg[0][:50]
[('部分', 'Neqa'),
 ('回應', 'VC'),
 ('在', 'P'),
 ('B117 \n', 'FW'),
 ('謝謝', 'VJ'),
 ('各位', 'Nh'),
 ('的', 'DE'),
 ('留言', 'Na'),
 (',', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('有', 'D'),
 ('看完', 'VC'),
 ('\n', 'WHITESPACE'),
 ('好', 'VH'),
 ('的', 'DE'),
 ('不', 'D'),
 ('好', 'VH'),
 ('的', 'T'),
 (',', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('接受', 'VC'),
 ('謝謝', 'VJ'),
 ('大家', 'Nh'),
 ('🙇', 'FW'),
 ('\u200d♀️\n', 'DASHCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('\n', 'WHITESPACE'),
 ('(', 'PARENTHESISCATEGORY'),
 ('第三', 'Neu'),
 ('次', 'Nf'),
 ('更新', 'VC'),
 ('在', 'P'),
 ('這邊', 'Ncd'),
 (')', 'PARENTHESISCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('B258 ', 'FW'),
 ('這邊', 'Ncd'),
 ('也', 'D'),
 ('有', 'V_2'),
 ('講到', 'VE'),
 ('怎麼', 'D'),
 ('逃生', 'VA'),
 ('\n', 'WHITESPACE'),
 ('很多', 'Neqa'),
 ('人', 'Na'),
 ('好奇', 'VH'),
 ('我', 'Nh'),
 ('是', 'SHI')]