Advertisement
evandrix

get_encoding.py

Sep 12th, 2011
69
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 0.50 KB | None | 0 0
  1. #!/usr/bin/env python
  2. import urllib
  3. import chardet
  4. import requests
  5. import re
  6. urlread = lambda url: urllib.urlopen(url).read()
  7. f = open('/media/data/rss.csv/wrong_encoding_urls.txt', 'r')
  8. for line in f.readlines():
  9.     line = line.strip()
  10.     if line:
  11.         req = requests.get(line)
  12.         content = req.content
  13.         encoding = re.split(r'encoding="([^"]+)"', content)[1]
  14.         max_value = max(map(ord, content))
  15.         if max_value > 127:
  16.             print line, encoding, max_value
  17. f.close()
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement