- Cannot write text as UTF-8 to file using python
- >>> print r.info()
- Content-Type: text/html; charset=ISO-8859-1
- Connection: close
- Cache-Control: no-cache
- Date: Sun, 20 Feb 2011 15:16:31 GMT
- Server: Apache/2.0.40 (Red Hat Linux)
- X-Accel-Cache-Control: no-cache
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
- with open('../results/1.html','r') as f:
- page = f.read()
- ...
- with open('../parsed.txt','w') as f:
- for key in fieldD:
- f.write(key+'t'+fieldD[key]+'n')
- with codecs.open('../results/1.html','r','utf-8') as f:
- page = f.read()
- ...
- with codecs.open('../parsed.txt','w','utf-8') as f:
- for key in fieldD:
- f.write(key+'t'+fieldD[key]+'n')
- with codecs.open('../results/1.html','r','iso_8859_1') as f:
- page = f.read()
- ...
- with codecs.open('../parsed.txt','w','utf-8') as f:
- for key in fieldD:
- f.write(key+'t'+fieldD[key]+'n')
- >>> from unicodedata import name
- >>> oacute = u"xf3"
- >>> print name(oacute)
- LATIN SMALL LETTER O WITH ACUTE
- >>> guff = oacute.encode('utf8').decode('latin1').encode('utf8')
- >>> guff
- 'xc3x83xc2xb3'
- >>> for c in guff.decode('macroman'):
- ... print name(c)
- ...
- SQUARE ROOT
- LATIN CAPITAL LETTER E WITH ACUTE
- NOT SIGN
- GREATER-THAN OR EQUAL TO
- >>>
- >>> data = open('g0.htm', 'rb').read()
- >>> uc = data.decode('utf8')
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "c:python27libencodingsutf_8.py", line 16, in decode
- return codecs.utf_8_decode(input, errors, True)
- UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1130: invalid start byte
- >>> pos = data.find("Iglesia Cat")
- >>> data[pos:pos+20]
- 'Iglesia Catxf3lica</a>'
- >>> # Looks like one of ISO-8859-1 and its cousins to me.
- >>> url = 'http://213.97.164.119/ABSYS/abwebp.cgi/X5104/ID31295/G0?ACC=DCT1'
- >>> data = urllib2.urlopen(url).read()[4016:4052]; data
- 'Iglesia+Cat%f3lica">Iglesia Catxf3lica'
- >>> data.decode('latin-1')
- u'Iglesia+Cat%f3lica">Iglesia Catxf3lica'
- >>> data.decode('latin-1').encode('utf-8')
- 'Iglesia+Cat%f3lica">Iglesia Catxc3xb3lica'