Unicode#

  • Dealing with unicode texts can be tedious sometimes.

  • It is good to have a basic understanding of the Unicode Character Database

  • In particular, this notebook focuses on the Python module unicodedata.

Character Name#

import unicodedata

print(unicodedata.name('A'))
print(unicodedata.name('我'))
LATIN CAPITAL LETTER A
CJK UNIFIED IDEOGRAPH-6211

Characrer to Numbers#

print(unicodedata.numeric('四'))  # any character
print(unicodedata.numeric('壹'))  # any character
#print(unicodedata.digit('四')) # digits only
#print(unicodedata.decimal('六'))
4.0
1.0

Look-up By Name#

print(unicodedata.lookup('CJK UNIFIED IDEOGRAPH-6211'))
print(unicodedata.lookup('LEFT CURLY BRACKET'))
我
{

Unicode Category#

print(unicodedata.category('a'))
print(unicodedata.category('A'))
print(unicodedata.category('{'))
print(unicodedata.category('。'))
print(unicodedata.category('$'))
print(unicodedata.category('我'))
Ll
Lu
Ps
Po
Sc
Lo

Normalization#

  • Ways of normalization: NFD, NFC, NFKD, NFKC

  • Suggested use:NFKC

  • Meaning:

    • D = Decomposition (will change the length of the original form)

    • C = Composition

    • K = Compatibility (will change the original form)

## Chinese characters with full-width English letters and punctuations
text = '中英文abc,,。..ABC123'
print(unicodedata.normalize('NFKD', text))
print(unicodedata.normalize('NFKC', text))  # recommended method
print(unicodedata.normalize('NFC', text))
print(unicodedata.normalize('NFD', text))
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
text = 'English characters with full-wdiths ABC。'

## Encode the string in ASCII and find compatible characters
print(
    unicodedata.normalize('NFKC',
                          text).encode('ascii',
                                       'ignore').decode('utf-8', 'ignore'))
print(
    unicodedata.normalize('NFKD',
                          text).encode('ascii',
                                       'ignore').decode('utf-8', 'ignore'))

## Encode the string in ASCII and but remove ASCII-incompatible chars

print(
    unicodedata.normalize('NFC',
                          text).encode('ascii',
                                       'ignore').decode('utf-8', 'ignore'))
print(
    unicodedata.normalize('NFD',
                          text).encode('ascii',
                                       'ignore').decode('utf-8', 'ignore'))
English characters with full-wdiths ABC
English characters with full-wdiths ABC
English characters with full-wdiths 
English characters with full-wdiths 
text = 'Klüft skräms inför på fédéral électoral große'

unicodedata.normalize('NFKD', text).encode('ascii',
                                           'ignore').decode('utf-8', 'ignore')
'Kluft skrams infor pa federal electoral groe'

Normalizing Texts#

text = "中文CHINESE。!==.= ^o^ 2020/5/20 alvin@gmal.cob@%&*"

# remove puncs/symbols
print(''.join(
    [c for c in text if unicodedata.category(c)[0] not in ["P", "S"]]))

# select letters
print(''.join([c for c in text if unicodedata.category(c)[0] in ["L"]]))

# remove alphabets
print(''.join(
    [c for c in text if unicodedata.category(c)[:2] not in ["Lu", 'Ll']]))

# select Chinese chars?
print(''.join([c for c in text if unicodedata.category(c)[:2] in ["Lo"]]))
中文CHINESE o 2020520 alvingmalcob
中文CHINESEoalvingmalcob
中文。!==.= ^^ 2020/5/20 @.@%&*
中文

Note

It seems that the unicode catetory Lo is good to identify Chinese characters?