Python: solving unicode hell with unidecode -
i have been working on ways flatten text ascii. ā -> a , ñ -> n, etc.
unidecode has been fantastic this.
# -*- coding: utf-8 -*- unidecode import unidecode print(unidecode(u"ā, ī, ū, ś, ñ")) print(unidecode(u"estado de são paulo")) produces:
a, i, u, s, n estado de sao paulo however, can't duplicate result data input file.
content of test.txt file:
ā, ī, ū, ś, ñ estado de são paulo # -*- coding: utf-8 -*- unidecode import unidecode open("test.txt", 'r') inf: line in inf: print unidecode(line.strip()) produces:
a, a<<, a<<, a, a+- estado de sapso paulo and:
runtimewarning: argument not unicode object. passing encoded string have unexpected results.
question: how can read these lines in unicode can pass them unidecode?
use codecs.open
with codecs.open("test.txt", 'r', 'utf-8') inf:
Comments
Post a Comment