Python: solving unicode hell with unidecode -


i have been working on ways flatten text ascii. ā -> a , ñ -> n, etc.

unidecode has been fantastic this.

# -*- coding: utf-8 -*- unidecode import unidecode print(unidecode(u"ā, ī, ū, ś, ñ")) print(unidecode(u"estado de são paulo")) 

produces:

a, i, u, s, n estado de sao paulo 

however, can't duplicate result data input file.

content of test.txt file:

ā, ī, ū, ś, ñ estado de são paulo 

# -*- coding: utf-8 -*- unidecode import unidecode open("test.txt", 'r') inf:     line in inf:         print unidecode(line.strip()) 

produces:

a, a<<, a<<, a, a+- estado de sapso paulo 

and:

runtimewarning: argument not unicode object. passing encoded string have unexpected results.

question: how can read these lines in unicode can pass them unidecode?

use codecs.open

with codecs.open("test.txt", 'r', 'utf-8') inf: 

Comments

Popular posts from this blog

php - Magento - Deleted Base url key -

javascript - Tooltipster plugin not firing jquery function when button or any click even occur -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -