Unicode#
Dealing with unicode texts can be tedious sometimes.
It is good to have a basic understanding of the Unicode Character Database
In particular, this notebook focuses on the Python module
unicodedata
.
Character Name#
import unicodedata
print(unicodedata.name('A'))
print(unicodedata.name('我'))
LATIN CAPITAL LETTER A
CJK UNIFIED IDEOGRAPH-6211
Characrer to Numbers#
print(unicodedata.numeric('四')) # any character
print(unicodedata.numeric('壹')) # any character
#print(unicodedata.digit('四')) # digits only
#print(unicodedata.decimal('六'))
4.0
1.0
Look-up By Name#
print(unicodedata.lookup('CJK UNIFIED IDEOGRAPH-6211'))
print(unicodedata.lookup('LEFT CURLY BRACKET'))
我
{
Unicode Category#
print(unicodedata.category('a'))
print(unicodedata.category('A'))
print(unicodedata.category('{'))
print(unicodedata.category('。'))
print(unicodedata.category('$'))
print(unicodedata.category('我'))
Ll
Lu
Ps
Po
Sc
Lo
Normalization#
Ways of normalization: NFD, NFC, NFKD, NFKC
Suggested use:NFKC
Meaning:
D = Decomposition (will change the length of the original form)
C = Composition
K = Compatibility (will change the original form)
## Chinese characters with full-width English letters and punctuations
text = '中英文abc,,。..ABC123'
print(unicodedata.normalize('NFKD', text))
print(unicodedata.normalize('NFKC', text)) # recommended method
print(unicodedata.normalize('NFC', text))
print(unicodedata.normalize('NFD', text))
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
text = 'English characters with full-wdiths ABC。'
## Encode the string in ASCII and find compatible characters
print(
unicodedata.normalize('NFKC',
text).encode('ascii',
'ignore').decode('utf-8', 'ignore'))
print(
unicodedata.normalize('NFKD',
text).encode('ascii',
'ignore').decode('utf-8', 'ignore'))
## Encode the string in ASCII and but remove ASCII-incompatible chars
print(
unicodedata.normalize('NFC',
text).encode('ascii',
'ignore').decode('utf-8', 'ignore'))
print(
unicodedata.normalize('NFD',
text).encode('ascii',
'ignore').decode('utf-8', 'ignore'))
English characters with full-wdiths ABC
English characters with full-wdiths ABC
English characters with full-wdiths
English characters with full-wdiths
text = 'Klüft skräms inför på fédéral électoral große'
unicodedata.normalize('NFKD', text).encode('ascii',
'ignore').decode('utf-8', 'ignore')
'Kluft skrams infor pa federal electoral groe'
Normalizing Texts#
text = "中文CHINESE。!==.= ^o^ 2020/5/20 alvin@gmal.cob@%&*"
# remove puncs/symbols
print(''.join(
[c for c in text if unicodedata.category(c)[0] not in ["P", "S"]]))
# select letters
print(''.join([c for c in text if unicodedata.category(c)[0] in ["L"]]))
# remove alphabets
print(''.join(
[c for c in text if unicodedata.category(c)[:2] not in ["Lu", 'Ll']]))
# select Chinese chars?
print(''.join([c for c in text if unicodedata.category(c)[:2] in ["Lo"]]))
中文CHINESE o 2020520 alvingmalcob
中文CHINESEoalvingmalcob
中文。!==.= ^^ 2020/5/20 @.@%&*
中文
Note
It seems that the unicode catetory Lo is good to identify Chinese characters?