Google Colab#
Google Colab is a free tool by Google where beginners can write and run Python code without needing to set up anything on their computer. It gives access to powerful computers for running code and allows easy collaboration with others.
As we are working with more and more data, we may need GPU computing for quicker processing.
This lecture note shows how we can capitalize on the free GPU computing provided by Google Colab and speed up the Chinese word segmentation of
ckip-transformers
.
Prepare Google Drive#
Create a working directory under your Google Drive, named
ENC2045_DEMO_DATA
.Save the corpus files needed in that Google Drive directory.
We can access the files on our Google Drive from Google Colab. This can be useful when you need to load your own data in Google Colab.
Note
You can of course name the directory in which ever ways you like. The key is that we need to put the data files on the Google Drive so that we can access these files through Google Colab.
Run Notebook in Google Colab#
Click on the button on top of the lecture notes website to open this notebook in Google Colab.
Setting Google Colab Environment#
Important Steps for Google Colab Environment Setting
Change the Runtime for GPU
Install Modules
Mount Google Drive
Set Working Directory
Change Runtime for GPU#
[Runtime] -> [Change runtime type]
For [Hardware accelerator], choose [GPU]
!nvidia-smi
Wed Feb 21 01:23:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 36C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Install Modules#
Google Colab has been pre-instralled with several popular modules for machine learning and deep learning (e.g.,
nltk
,sklearn
,tensorflow
,pytorch
,numpy
,spacy
).We can check the pre-installed modules here.
!pip list
Package Version
-------------------------------- ---------------------
absl-py 1.4.0
aiohttp 3.9.3
aiosignal 1.3.1
alabaster 0.7.16
albumentations 1.3.1
altair 4.2.2
annotated-types 0.6.0
anyio 3.7.1
appdirs 1.4.4
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
array-record 0.5.0
arviz 0.15.1
astropy 5.3.4
astunparse 1.6.3
async-timeout 4.0.3
atpublic 4.0
attrs 23.2.0
audioread 3.0.1
autograd 1.6.2
Babel 2.14.0
backcall 0.2.0
beautifulsoup4 4.12.3
bidict 0.23.0
bigframes 0.21.0
bleach 6.1.0
blinker 1.4
blis 0.7.11
blosc2 2.0.0
bokeh 3.3.4
bqplot 0.12.42
branca 0.7.1
build 1.0.3
CacheControl 0.14.0
cachetools 5.3.2
catalogue 2.0.10
certifi 2024.2.2
cffi 1.16.0
chardet 5.2.0
charset-normalizer 3.3.2
chex 0.1.85
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.16.0
cloudpickle 2.2.1
cmake 3.27.9
cmdstanpy 1.2.1
colorcet 3.0.1
colorlover 0.3.0
colour 0.1.5
community 1.0.0b1
confection 0.1.4
cons 0.4.6
contextlib2 21.6.0
contourpy 1.2.0
cryptography 42.0.3
cufflinks 0.17.3
cupy-cuda12x 12.2.0
cvxopt 1.3.2
cvxpy 1.3.3
cycler 0.12.1
cymem 2.0.8
Cython 3.0.8
dask 2023.8.1
datascience 0.17.6
db-dtypes 1.2.0
dbus-python 1.2.18
debugpy 1.6.6
decorator 4.4.2
defusedxml 0.7.1
distributed 2023.8.1
distro 1.7.0
dlib 19.24.2
dm-tree 0.1.8
docutils 0.18.1
dopamine-rl 4.0.6
duckdb 0.9.2
earthengine-api 0.1.390
easydict 1.12
ecos 2.0.13
editdistance 0.6.2
eerepr 0.0.4
en-core-web-sm 3.7.1
entrypoints 0.4
et-xmlfile 1.1.0
etils 1.7.0
etuples 0.3.9
exceptiongroup 1.2.0
fastai 2.7.14
fastcore 1.5.29
fastdownload 0.0.7
fastjsonschema 2.19.1
fastprogress 1.0.3
fastrlock 0.8.2
filelock 3.13.1
fiona 1.9.5
firebase-admin 5.3.0
Flask 2.2.5
flatbuffers 23.5.26
flax 0.8.1
folium 0.14.0
fonttools 4.49.0
frozendict 2.4.0
frozenlist 1.4.1
fsspec 2023.6.0
future 0.18.3
gast 0.5.4
gcsfs 2023.6.0
GDAL 3.6.4
gdown 4.7.3
geemap 0.31.0
gensim 4.3.2
geocoder 1.38.1
geographiclib 2.0
geopandas 0.13.2
geopy 2.3.0
gin-config 0.5.0
glob2 0.7
google 2.0.3
google-ai-generativelanguage 0.4.0
google-api-core 2.11.1
google-api-python-client 2.84.0
google-auth 2.27.0
google-auth-httplib2 0.1.1
google-auth-oauthlib 1.2.0
google-cloud-aiplatform 1.42.1
google-cloud-bigquery 3.12.0
google-cloud-bigquery-connection 1.12.1
google-cloud-bigquery-storage 2.24.0
google-cloud-core 2.3.3
google-cloud-datastore 2.15.2
google-cloud-firestore 2.11.1
google-cloud-functions 1.13.3
google-cloud-iam 2.14.1
google-cloud-language 2.13.1
google-cloud-resource-manager 1.12.1
google-cloud-storage 2.8.0
google-cloud-translate 3.11.3
google-colab 1.0.0
google-crc32c 1.5.0
google-generativeai 0.3.2
google-pasta 0.2.0
google-resumable-media 2.7.0
googleapis-common-protos 1.62.0
googledrivedownloader 0.4
graphviz 0.20.1
greenlet 3.0.3
grpc-google-iam-v1 0.13.0
grpcio 1.60.1
grpcio-status 1.48.2
gspread 3.4.2
gspread-dataframe 3.3.1
gym 0.25.2
gym-notices 0.0.8
h5netcdf 1.3.0
h5py 3.9.0
holidays 0.42
holoviews 1.17.1
html5lib 1.1
httpimport 1.3.1
httplib2 0.22.0
huggingface-hub 0.20.3
humanize 4.7.0
hyperopt 0.2.7
ibis-framework 7.1.0
idna 3.6
imageio 2.31.6
imageio-ffmpeg 0.4.9
imagesize 1.4.1
imbalanced-learn 0.10.1
imgaug 0.4.0
importlib-metadata 7.0.1
importlib-resources 6.1.1
imutils 0.5.4
inflect 7.0.0
iniconfig 2.0.0
intel-openmp 2023.2.3
ipyevents 2.0.2
ipyfilechooser 0.6.0
ipykernel 5.5.6
ipyleaflet 0.18.2
ipython 7.34.0
ipython-genutils 0.2.0
ipython-sql 0.5.0
ipytree 0.2.2
ipywidgets 7.7.1
itsdangerous 2.1.2
jax 0.4.23
jaxlib 0.4.23+cuda12.cudnn89
jeepney 0.7.1
jieba 0.42.1
Jinja2 3.1.3
joblib 1.3.2
jsonpickle 3.0.2
jsonschema 4.19.2
jsonschema-specifications 2023.12.1
jupyter-client 6.1.12
jupyter-console 6.1.0
jupyter_core 5.7.1
jupyter-server 1.24.0
jupyterlab_pygments 0.3.0
jupyterlab_widgets 3.0.10
kaggle 1.5.16
kagglehub 0.1.9
keras 2.15.0
keyring 23.5.0
kiwisolver 1.4.5
langcodes 3.3.0
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lazy_loader 0.3
libclang 16.0.6
librosa 0.10.1
lightgbm 4.1.0
linkify-it-py 2.0.3
llvmlite 0.41.1
locket 1.0.0
logical-unification 0.4.6
lxml 4.9.4
malloy 2023.1067
Markdown 3.5.2
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.7.1
matplotlib-inline 0.1.6
matplotlib-venn 0.11.10
mdit-py-plugins 0.4.0
mdurl 0.1.2
miniKanren 1.0.3
missingno 0.5.2
mistune 0.8.4
mizani 0.9.3
mkl 2023.2.0
ml-dtypes 0.2.0
mlxtend 0.22.0
more-itertools 10.1.0
moviepy 1.0.3
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.5
multipledispatch 1.0.0
multitasking 0.0.11
murmurhash 1.0.10
music21 9.1.0
natsort 8.4.0
nbclassic 1.0.0
nbclient 0.9.0
nbconvert 6.5.4
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.2.1
nibabel 4.0.2
nltk 3.8.1
notebook 6.5.5
notebook_shim 0.2.4
numba 0.58.1
numexpr 2.9.0
numpy 1.25.2
oauth2client 4.1.3
oauthlib 3.2.2
opencv-contrib-python 4.8.0.76
opencv-python 4.8.0.76
opencv-python-headless 4.9.0.80
openpyxl 3.1.2
opt-einsum 3.3.0
optax 0.1.9
orbax-checkpoint 0.4.4
osqp 0.6.2.post8
packaging 23.2
pandas 1.5.3
pandas-datareader 0.10.0
pandas-gbq 0.19.2
pandas-stubs 1.5.3.230304
pandocfilters 1.5.1
panel 1.3.8
param 2.0.2
parso 0.8.3
parsy 2.1
partd 1.4.1
pathlib 1.0.1
patsy 0.5.6
peewee 3.17.1
pexpect 4.9.0
pickleshare 0.7.5
Pillow 9.4.0
pins 0.8.4
pip 23.1.2
pip-tools 6.13.0
platformdirs 4.2.0
plotly 5.15.0
plotnine 0.12.4
pluggy 1.4.0
polars 0.20.2
pooch 1.8.0
portpicker 1.5.2
prefetch-generator 1.0.3
preshed 3.0.9
prettytable 3.9.0
proglog 0.1.10
progressbar2 4.2.0
prometheus_client 0.20.0
promise 2.3
prompt-toolkit 3.0.43
prophet 1.1.5
proto-plus 1.23.0
protobuf 3.20.3
psutil 5.9.5
psycopg2 2.9.9
ptyprocess 0.7.0
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyarrow 14.0.2
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycocotools 2.0.7
pycparser 2.21
pyct 0.5.0
pydantic 2.6.1
pydantic_core 2.16.2
pydata-google-auth 1.8.2
pydot 1.4.2
pydot-ng 2.0.0
pydotplus 2.0.2
PyDrive 1.3.1
PyDrive2 1.6.3
pyerfa 2.0.1.1
pygame 2.5.2
Pygments 2.16.1
PyGObject 3.42.1
PyJWT 2.3.0
pymc 5.7.2
pymystem3 0.2.0
PyOpenGL 3.1.7
pyOpenSSL 24.0.0
pyparsing 3.1.1
pyperclip 1.8.2
pyproj 3.6.1
pyproject_hooks 1.0.0
pyshp 2.3.1
PySocks 1.7.1
pytensor 2.14.2
pytest 7.4.4
python-apt 0.0.0
python-box 7.1.1
python-dateutil 2.8.2
python-louvain 0.16
python-slugify 8.0.4
python-utils 3.8.2
pytz 2023.4
pyviz_comms 3.0.1
PyWavelets 1.5.0
PyYAML 6.0.1
pyzmq 23.2.1
qdldl 0.1.7.post0
qudida 0.0.4
ratelim 0.1.6
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
requests-oauthlib 1.3.1
requirements-parser 0.5.0
rich 13.7.0
rpds-py 0.18.0
rpy2 3.4.2
rsa 4.9
safetensors 0.4.2
scikit-image 0.19.3
scikit-learn 1.2.2
scipy 1.11.4
scooby 0.9.2
scs 3.2.4.post1
seaborn 0.13.1
SecretStorage 3.3.1
Send2Trash 1.8.2
sentencepiece 0.1.99
setuptools 67.7.2
shapely 2.0.3
six 1.16.0
sklearn-pandas 2.2.0
smart-open 6.4.0
sniffio 1.3.0
snowballstemmer 2.2.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.5
soxr 0.3.7
spacy 3.7.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
Sphinx 5.0.2
sphinxcontrib-applehelp 1.0.8
sphinxcontrib-devhelp 1.0.6
sphinxcontrib-htmlhelp 2.0.5
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.7
sphinxcontrib-serializinghtml 1.1.10
SQLAlchemy 2.0.27
sqlglot 19.9.0
sqlparse 0.4.4
srsly 2.4.8
stanio 0.3.0
statsmodels 0.14.1
sympy 1.12
tables 3.8.0
tabulate 0.9.0
tbb 2021.11.0
tblib 3.0.0
tenacity 8.2.3
tensorboard 2.15.2
tensorboard-data-server 0.7.2
tensorflow 2.15.0
tensorflow-datasets 4.9.4
tensorflow-estimator 2.15.0
tensorflow-gcs-config 2.15.0
tensorflow-hub 0.16.1
tensorflow-io-gcs-filesystem 0.36.0
tensorflow-metadata 1.14.0
tensorflow-probability 0.23.0
tensorstore 0.1.45
termcolor 2.4.0
terminado 0.18.0
text-unidecode 1.3
textblob 0.17.1
tf-keras 2.15.0
tf-slim 1.1.0
thinc 8.2.3
threadpoolctl 3.3.0
tifffile 2024.2.12
tinycss2 1.2.1
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchdata 0.7.0
torchsummary 1.5.1
torchtext 0.16.0
torchvision 0.16.0+cu121
tornado 6.3.2
tqdm 4.66.2
traitlets 5.7.1
traittypes 0.2.1
transformers 4.37.2
triton 2.1.0
tweepy 4.14.0
typer 0.9.0
types-pytz 2024.1.0.20240203
types-setuptools 69.1.0.20240217
typing_extensions 4.9.0
tzlocal 5.2
uc-micro-py 1.0.3
uritemplate 4.1.1
urllib3 2.0.7
vega-datasets 0.9.0
wadllib 1.3.6
wasabi 1.1.2
wcwidth 0.2.13
weasel 0.3.4
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
Werkzeug 3.0.1
wheel 0.42.0
widgetsnbextension 3.6.6
wordcloud 1.9.3
wrapt 1.14.1
xarray 2023.7.0
xarray-einstats 0.7.0
xgboost 2.0.3
xlrd 2.0.1
xxhash 3.4.1
xyzservices 2023.10.1
yarl 1.9.4
yellowbrick 1.5
yfinance 0.2.36
zict 3.0.0
zipp 3.17.0
We only need to install modules that are not pre-installed in Google Colab (e.g.,
ckip-transformers
).This installation has to be done every time we work with Google Colab. But don’t worry. It’s quick.
This is how we install the package on Google Colab, exactly the same as we do in our terminal.
## Google Drive Setting
!pip install ckip-transformers
Mount Google Drive#
To mount our Google Drive to the current Google Colab server, we need the following codes.
The default directory of Google Colab is
/content/
. (There is a sub-directory by default, i.e.,/content/sample_data
.)We specify the mount point as
/content/drive
, where you can find your root directory of your Google Drive (i.e.,/content/drive/MyDrive
).
from google.colab import drive
drive.mount("/content/drive")
Mounted at /content/drive
After we run the above codes, we need to click on the link presented, log in with our Google Account in the new window and get the authorization code.
Then copy the authorization code from the new window and paste it back to the text box in the notebook window.
Set Working Directory#
Change Colab working directory to the
ENC2045_demo_data
of the Google Drive
import os
os.chdir('/content/drive/MyDrive/ENC2045_demo_data')
print(os.getcwd())
/content/drive/MyDrive/ENC2045_demo_data
Try ckip-transformers
with GPU#
Initialize the ckip-transformers
#
import ckip_transformers
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
# Initialize drivers
ws_driver = CkipWordSegmenter(model="bert-base", device=0)
pos_driver = CkipPosTagger(model="bert-base", device=0)
def my_tokenizer(doc):
# `doc`: a list of corpus documents (each element is a document long string)
cur_ws = ws_driver(doc, use_delim = True, delim_set='\n')
cur_pos = pos_driver(cur_ws)
doc_seg = [[(x,y) for (x,y) in zip(w,p)] for (w,p) in zip(cur_ws, cur_pos)]
return doc_seg
Tokenization Chinese Texts#
import pandas as pd
df = pd.read_csv('dcard-top100.csv')
df.head()
corpus = df['content']
corpus[:10]
0 部分回應在B117 \n謝謝各位的留言,我都有看完\n好的不好的,我都接受謝謝大家🙇♀️\...
1 https://i.imgur.com/REIEzSd.jpg\n\n身高195公分的男大生...
2 看過這麼多在Dcard、PTT上的感情渣事和創作文\n從沒想過如此荒謬像八點檔的事情居然會發...
3 剛剛吃小火鍋,跟店員說不要金針菇(怕卡牙縫),於是店員幫我換其他配料..…\n\n沒想到餐一...
4 已經約好見面,到了當天晚上七點半才回,我是被耍了嗎 \n如下圖\n\n\nhttps://i...
5 嗨!巨砲哥 答應你的文來了😆\n這是一段與約砲小哥哥談心的奇幻旅程\n\n可憐的我情人節當天...
6 https://i.imgur.com/HCTwyAH.jpg\n(圖片非本人)\n今天逛街...
7 https://i.imgur.com/RWJLK2v.jpg\n\n因為馬鞍很寬\n想請問...
8 手機排版請見諒😖🙏🏻(圖多)\n先說這不是我第一次訂購訂製蛋糕\n也了解訂製蛋糕不可能跟圖上...
9 https://i.imgur.com/6Yk9etg.jpg\n想在這裡問大家有沒有接到這...
Name: content, dtype: object
%%time
corpus_seg = my_tokenizer(corpus)
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 400.33it/s]
Inference: 100%|██████████| 16/16 [01:54<00:00, 7.18s/it]
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 422.44it/s]
Inference: 100%|██████████| 10/10 [01:13<00:00, 7.38s/it]
CPU times: user 3min 8s, sys: 498 ms, total: 3min 9s
Wall time: 3min 10s
corpus_seg[0][:50]
[('部分', 'Neqa'),
('回應', 'VC'),
('在', 'P'),
('B117 \n', 'FW'),
('謝謝', 'VJ'),
('各位', 'Nh'),
('的', 'DE'),
('留言', 'Na'),
(',', 'COMMACATEGORY'),
('我', 'Nh'),
('都', 'D'),
('有', 'D'),
('看完', 'VC'),
('\n', 'WHITESPACE'),
('好', 'VH'),
('的', 'DE'),
('不', 'D'),
('好', 'VH'),
('的', 'T'),
(',', 'COMMACATEGORY'),
('我', 'Nh'),
('都', 'D'),
('接受', 'VC'),
('謝謝', 'VJ'),
('大家', 'Nh'),
('🙇', 'FW'),
('\u200d♀️\n', 'DASHCATEGORY'),
('\n', 'WHITESPACE'),
('\n', 'WHITESPACE'),
('(', 'PARENTHESISCATEGORY'),
('第三', 'Neu'),
('次', 'Nf'),
('更新', 'VC'),
('在', 'P'),
('這邊', 'Ncd'),
(')', 'PARENTHESISCATEGORY'),
('\n', 'WHITESPACE'),
('B258 ', 'FW'),
('這邊', 'Ncd'),
('也', 'D'),
('有', 'V_2'),
('講到', 'VE'),
('怎麼', 'D'),
('逃生', 'VA'),
('\n', 'WHITESPACE'),
('很多', 'Neqa'),
('人', 'Na'),
('好奇', 'VH'),
('我', 'Nh'),
('是', 'SHI')]