Python: solving unicode hell with unidecode -
i have been working on ways flatten text ascii. ā -> a , ñ -> n, etc.
unidecode
has been fantastic this.
# -*- coding: utf-8 -*- unidecode import unidecode print(unidecode(u"ā, ī, ū, ś, ñ")) print(unidecode(u"estado de são paulo"))
produces:
a, i, u, s, n estado de sao paulo
however, can't duplicate result data input file.
content of test.txt file:
ā, ī, ū, ś, ñ estado de são paulo
# -*- coding: utf-8 -*- unidecode import unidecode open("test.txt", 'r') inf: line in inf: print unidecode(line.strip())
produces:
a, a<<, a<<, a, a+- estado de sapso paulo
and:
runtimewarning: argument not unicode object. passing encoded string have unexpected results.
question: how can read these lines in unicode can pass them unidecode
?
use codecs.open
with codecs.open("test.txt", 'r', 'utf-8') inf:
Comments
Post a Comment