Advertisement
reversekun

★リバース辞郎★ 英辞郎の【Logophileバイナリ形式ファイル】を従来のテキスト形式に戻すためのツール

Aug 19th, 2015
658
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 11.83 KB | None | 0 0
  1. # coding=utf-8
  2. #
  3. # NOTE: Unfortunately the 英辞郎 maintainer seems to have stopped selling the
  4. # Logophile version, apparently right after they found out that this
  5. # conversion script had appeared... Hopefully it'll still be useful for those
  6. # who already bought that version. I wish I could do something about this
  7. # person's misguided way of thinking, but I can't.
  8. #
  9. # ★リバース辞郎 by リバースくん★
  10. #
  11. # Eijiro/英辞郎 Logophile binary to plain text converter.
  12. # 英辞郎の【Logophileバイナリ形式ファイル】を従来のテキスト形式に戻すためのツール。
  13. # Version: 0.2
  14. #
  15. # Requirements: Python 2.7
  16. #
  17. # Usage: python lpb2txt.py [-h] [-v] lpbdir [outfile]
  18. # Examples:
  19. #   python lpb2txt.py C:\temp\eijiro-sample out.txt
  20. #   python lpb2txt.py -v C:\temp\eijiro > eijiro.txt
  21. #   python lpb2txt.py C:\Users\myname\Documents\eijiro | grep "reverse"
  22. #   python lpb2txt.py "C:\Documents and Settings\me\My Documents\ei" temp.txt
  23. #
  24. # This is free software.
  25. # You can use, modify and distribute it however you like.
  26. # これはフリーソフトです。自由に使ったり、配布したり、変更したりすることができます。
  27. #
  28. #
  29. # VERSION HISTORY
  30. #
  31. # V0.2 (08/20/2015)
  32. # Fixed a bug with headwords containing '\' messing up regex
  33. # that wasn't really necessary anyway.
  34. # Processing time is over 3 times faster now.
  35. #
  36. # V0.1 (08/19/2015)
  37. # Initial version
  38. #
  39. #
  40. # 日本語の説明は?
  41. # 残念ながら、こんな技術的な説明を日本語で書くのはちょっと複雑すぎるので、ご了承よろしくお願い致します。
  42. # Google Translateとか英辞郎とかを利用して理解することができなくはないと思います。m(_ _)m
  43. #
  44. # WHAT IS THIS?
  45. # This is a proof-of-concept Python 2.x script for converting
  46. # Eijiro/英辞郎 dictionary data back to the original text format
  47. # from the Logophile binary format it is now exclusivaly
  48. # distributed in.
  49. #
  50. # WHY MAKE THIS?
  51. # No offense to the 英辞郎 maintainer, whose hard work we all very
  52. # much appreciate, but once someone buys your dictionary data,
  53. # they should be able to use it for personal use however they like,
  54. # without getting locked into a specific software program.
  55. #
  56. # IS THIS A PIRACY TOOL?
  57. # This script has absolutely nothing to do with piracy. It is
  58. # meant for those who have legally purchased their 英辞郎 data in
  59. # the new binary format, but for example want to convert it to
  60. # EPWING for use with EBPocket or a similar program.
  61. #
  62. # IS THIS WELL-TESTED?
  63. # No, this has NOT been extensively tested, but it seems to work OK
  64. # with the freely-provided eijiro-sample files.
  65. # Tested ONLY with Python 2.7 under Windows, and at this point
  66. # NOT tested AT ALL with the full version of binary Eijiro&Ryakujiro/
  67. # 英辞郎&略辞郎, nor with binary Reijiro/例辞郎 or Waeijiro/和英辞郎
  68. # (which isn't even available yet).
  69. # Thus, it is entirely possible that some data may accidentally get
  70. # substituted out during conversion, so USE AT YOUR OWN RISK.
  71. # And, if you're at all capable, please fix the bugs you find and
  72. # share the improved code.
  73. #
  74. # HOW CAN I KNOW THIS WORKS BEFORE I BUY THE BINARY 英辞郎?
  75. # If you have an earlier version of 英辞郎, you can take the a-c
  76. # section of your EIJI-???.TXT, append to it the A-C and a-c
  77. # sections of your RYAKU???.TXT and run the resulting file through
  78. #   python -c "import sys, re;data=sys.stdin.read();data = re.sub(u'{.*?}(?! : )', '', data, flags=re.U);sys.stdout.write(data)" < input.txt > output.txt
  79. # on the command line to remove the yomigana (by the way, those are
  80. # full-width {} braces in this regex, not ASCII {}).
  81. # Compare the result (using WinMerge, for example) with what you get
  82. # using this here script on eijiro-sample, and you should see no
  83. # differences other than the deliberate edits the maintainer has
  84. # made since your version.
  85. #
  86. # HOW LONG DOES THE CONVERSION TAKE? HOW MUCH MEMORY?
  87. # Converting the a-c sample takes roughly 20 seconds on my test PC,
  88. # so you can probably estimate 3-4 minutes for the full 英辞郎
  89. # on most PCs made within the last 6-7 years.
  90. # Available memory requirement should be around the combined sizes
  91. # of your entry.dat and data.lpb plus ~15-20 MB, depending on
  92. # your Python interpreter.
  93. #
  94. #
  95. # TECHNICAL STUFF
  96. #
  97. # 英辞郎 Logophile binary distribution consists of two files,
  98. # 'entry.dat', which is a sort of an index file, and 'data.lpb',
  99. # which contains the dictionary data entries, each one
  100. # individually zlib compressed. Aside from "hiding" a few initial
  101. # bytes of the compressed data streams via hard coding and the
  102. # index file, there is no data protection used. To reiterate,
  103. # NO ENCRYPTION of any kind is used, so this can be considered
  104. # a straight-forward conversion from an undocumented format.
  105. #
  106. #
  107. # entry.dat contains N 14-byte index records
  108. #
  109. # Offset Bytes Contents
  110. #  0     4     data record offset in data.lpb (little-endian)
  111. #  4     4     P, data record packed size (LE)
  112. #  8     4     U, data record unpacked size (LE)
  113. # 12     2     extra bytes to prepend for zlib decompression
  114. #
  115. # To decompress, we need to prepend 78h, 9Ch and these two extra
  116. # bytes to the data record.
  117. #
  118. #
  119. # data.lpb contains N P-byte data records that decompress to
  120. # U bytes in size.
  121. #
  122. # Offset Bytes  Contents
  123. #  0     4      01000000 (LE, there might be other types)
  124. #  4     4      Ilen, length of text for indexing (LE)
  125. #  8     Ilen   headword(s) text for indexing (don't need this)
  126. #  ...   4      00000000 (LE, there might be other types)
  127. #  ...   4      Hlen, length of entry headword(s) (LE)
  128. #  ...   Hlen   headword(s) text for display
  129. #  ...   4      Clen, length of entry content (LE)
  130. #  ...   Clen   entry content text
  131. #
  132. # Unlike with the 英辞郎 text format, there is only one
  133. # multi-line HTML-ified entry per headword. The examples are on
  134. # their own lines instead of being separated by '■・'. Fortunately the
  135. # '{名-1} :'-style identifiers remain intact. The text is in UTF-8.
  136. #
  137. # Long story short, we need to strip the HTML, reattach the examples,
  138. # prepend the '■' headword(s) to each line and convert to Shift-JIS.
  139.  
  140. from __future__ import print_function
  141. import argparse
  142. import os
  143. import re
  144. import struct
  145. import sys
  146.  
  147. __version__ = '0.2'
  148.  
  149. # check that we're properly pointed to a dictionary
  150. def valid_lpb_dir(lpb_dir):
  151.   if not os.path.isdir(lpb_dir):
  152.     msg = '{0} is not a directory'.format(lpb_dir)
  153.     raise argparse.ArgumentTypeError(msg)
  154.   global entryfile
  155.   entryfile = os.path.join(lpb_dir, 'entry.dat')
  156.   if not os.path.exists(entryfile):
  157.     msg = 'No file entry.dat in directory {0}'.format(lpb_dir)
  158.     raise argparse.ArgumentTypeError(msg)
  159.   global datafile
  160.   datafile = os.path.join(lpb_dir, 'data.lpb')
  161.   if not os.path.exists(datafile):
  162.     msg = 'No file data.lpb in directory {0}'.format(lpb_dir)
  163.     raise argparse.ArgumentTypeError(msg)
  164.   return lpb_dir
  165.  
  166. # parse command line parameters
  167. desc = u'★リバース辞郎 by リバースくん★ V{0}\n'\
  168.        u'英辞郎 Logophile binary to plain '\
  169.        u'text converter'.format(__version__)
  170. parser = argparse.ArgumentParser(description=desc,
  171.                                formatter_class=argparse.RawTextHelpFormatter)
  172. parser.add_argument('-v', '--verbose', action='store_true',
  173.                     help='print some informational messages (to stderr)')
  174. parser.add_argument('lpbdir', type=valid_lpb_dir,
  175.                     help='directory with Logophile binary dictionary files')
  176. parser.add_argument('outfile', nargs='?', type=argparse.FileType('wb'),
  177.                     default=sys.stdout,
  178.                     help='output file (defaults to stdout)')
  179. if len(sys.argv) < 2:
  180.   parser.print_help()
  181.   sys.exit(1)
  182. args = parser.parse_args()
  183.  
  184. if args.outfile is sys.stdout:
  185.   if sys.platform == 'win32':
  186.       import msvcrt
  187.       msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
  188.  
  189. with open(entryfile, 'rb') as f:
  190.   entry = f.read()
  191. with open(datafile, 'rb') as f:
  192.   data = f.read()
  193.  
  194. prepend1 = '\x78\x9C' # for zlib
  195. entrypos = 0
  196. nument = len(entry)/14
  197. if args.verbose:
  198.   print('Found Logophile binary dictionary in '\
  199.         '\"{0}\"'.format(args.lpbdir), file=sys.stderr)
  200.   print('Processing {0} records\n'.format(nument), file=sys.stderr)
  201.  
  202. for num in range(nument):
  203.   datapos = struct.unpack('I', entry[entrypos:entrypos+4])[0]
  204.   packedsz = struct.unpack('I', entry[entrypos+4:entrypos+8])[0]
  205.   unpacksz = struct.unpack('I', entry[entrypos+8:entrypos+12])[0]
  206.   prepend2 = entry[entrypos+12:entrypos+14] # for zlib
  207.   entrypos += 14
  208.  
  209.   packed = data[datapos:datapos+packedsz]
  210.   item = ''.join([prepend1, prepend2, packed]).decode('zlib') # decompress
  211.  
  212.   itempos = 0
  213.   # itemdummy1 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need
  214.   itempos += 4
  215.   itemidxlen = struct.unpack('I', item[itempos:itempos+4])[0]
  216.   itempos += 4
  217.   # itemidxtxt = item[itempos:itempos+itemidxlen] # don't need
  218.   itempos += itemidxlen
  219.   # itemdummy2 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need
  220.   itempos += 4
  221.   itemheadlen = struct.unpack('I', item[itempos:itempos+4])[0]
  222.   itempos += 4
  223.   itemheadtxt = item[itempos:itempos+itemheadlen]
  224.   itempos += itemheadlen
  225.   itemtxtlen = struct.unpack('I', item[itempos:itempos+4])[0]
  226.   itempos += 4
  227.   itemtxt = item[itempos:itempos+itemtxtlen]
  228.   itempos += itemtxtlen
  229.   if itempos <> unpacksz:
  230.     print('Error unpacking entry', num, file=sys.stderr)
  231.     break
  232.  
  233.   itemheadtxt = itemheadtxt.decode('utf-8')
  234.   itemtxt = itemtxt.decode('utf-8')
  235.  
  236.   # reattach the examples
  237.   itemtxt = itemtxt.replace(u'\r\n・', u'■・')
  238.   # strip HTML tags
  239.   itemtxt = itemtxt.replace('<p>\r\n', '')
  240.   itemtxt = itemtxt.replace('</p>\r\n', '')
  241.   itemtxt = itemtxt.replace('<br />', '')
  242.   itemtxt = re.sub('<a href=.*?>', '', itemtxt)
  243.   itemtxt = itemtxt.replace('</a>', '')
  244.   # re-convert HTML entities
  245.   itemtxt = itemtxt.replace('&quot;', '\"')
  246.   itemtxt = itemtxt.replace('&lt;', '<')
  247.   itemtxt = itemtxt.replace('&gt;', '>')
  248.   itemtxt = itemtxt.replace('&amp;', '&')
  249.  
  250.   for line in itemtxt.split('\r\n'):
  251.     if line:
  252.       # prepend the headword(s) to each line
  253.       # the {名-1}-style lines already have the colon separator
  254.       if line[0] == '{':
  255.         line = ''.join([u'■', itemheadtxt, '  ', line, '\r\n'])
  256.       else:
  257.         line = ''.join([u'■', itemheadtxt, ' : ', line, '\r\n'])
  258.      
  259.       while (1):
  260.         try:
  261.           ln = line.encode('cp932').decode('shift_jis').encode('shift_jis')
  262.           # pass through cp932 or characters like '~' will have problems.
  263.           # (this is more convenient than doing individual substitutions)
  264.           # see http://tanakahisateru.hatenablog.jp/entry/20080728/1217216409
  265.           #     http://d.hatena.ne.jp/hirothin/20080819/1219123920
  266.         except UnicodeDecodeError as ude:
  267.           # turns out the 英辞郎 source contains a few characters not in
  268.           # JIS X 0208 that make the shift_jis decoder choke.
  269.           # this code assumes we only get problems duting shift_jis decoding,
  270.           # so start/end are based on ude.object being cp932 from above.
  271.           prob = ude.object
  272.           repl = (ude.object[ude.start:ude.end]).decode('cp932')
  273.           repl = repl.encode('unicode_escape').replace('\\u','U+').upper()
  274.           prob = '{0}[{1}]{2}'.format(prob[:ude.start], repl, prob[ude.end:])
  275.           line = prob.decode('cp932')
  276.           if args.verbose:
  277.             # print('Encoding problem:', ude, file=sys.stderr)
  278.             print('Replacing gaiji \'{0}\' with Unicode notation placeholder'\
  279.                   ' \'[{1}]\''.format(ude.object[ude.start:ude.end], repl),
  280.                   file=sys.stderr)
  281.             print('Text:', ude.object, file=sys.stderr)
  282.           continue
  283.         else:
  284.           # should be JIS X 0208 compliant now
  285.           args.outfile.write(ln)
  286.           break
  287.   #
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement