★リバース辞郎★ 英辞郎の【Logophileバイナリ形式ファイル】を従来のテキスト形式に戻すためのツール

# coding=utf-8
#
# NOTE: Unfortunately the 英辞郎 maintainer seems to have stopped selling the
# Logophile version, apparently right after they found out that this
# conversion script had appeared... Hopefully it'll still be useful for those
# who already bought that version. I wish I could do something about this
# person's misguided way of thinking, but I can't.
#
# ★リバース辞郎 by リバースくん★
#
# Eijiro/英辞郎 Logophile binary to plain text converter.
# 英辞郎の【Logophileバイナリ形式ファイル】を従来のテキスト形式に戻すためのツール。
# Version: 0.2
#
# Requirements: Python 2.7
#
# Usage: python lpb2txt.py [-h] [-v] lpbdir [outfile]
# Examples:
#   python lpb2txt.py C:\temp\eijiro-sample out.txt
#   python lpb2txt.py -v C:\temp\eijiro > eijiro.txt
#   python lpb2txt.py C:\Users\myname\Documents\eijiro | grep "reverse"
#   python lpb2txt.py "C:\Documents and Settings\me\My Documents\ei" temp.txt
#
# This is free software.
# You can use, modify and distribute it however you like.
# これはフリーソフトです。自由に使ったり、配布したり、変更したりすることができます。
#
#
# VERSION HISTORY
#
# V0.2 (08/20/2015)
# Fixed a bug with headwords containing '\' messing up regex
# that wasn't really necessary anyway.
# Processing time is over 3 times faster now.
#
# V0.1 (08/19/2015)
# Initial version
#
#
# 日本語の説明は？
# 残念ながら、こんな技術的な説明を日本語で書くのはちょっと複雑すぎるので、ご了承よろしくお願い致します。
# Google Translateとか英辞郎とかを利用して理解することができなくはないと思います。m(＿ ＿)m
#
# WHAT IS THIS?
# This is a proof-of-concept Python 2.x script for converting
# Eijiro/英辞郎 dictionary data back to the original text format
# from the Logophile binary format it is now exclusivaly
# distributed in.
#
# WHY MAKE THIS?
# No offense to the 英辞郎 maintainer, whose hard work we all very
# much appreciate, but once someone buys your dictionary data,
# they should be able to use it for personal use however they like,
# without getting locked into a specific software program.
#
# IS THIS A PIRACY TOOL?
# This script has absolutely nothing to do with piracy. It is
# meant for those who have legally purchased their 英辞郎 data in
# the new binary format, but for example want to convert it to
# EPWING for use with EBPocket or a similar program.
#
# IS THIS WELL-TESTED?
# No, this has NOT been extensively tested, but it seems to work OK
# with the freely-provided eijiro-sample files.
# Tested ONLY with Python 2.7 under Windows, and at this point
# NOT tested AT ALL with the full version of binary Eijiro&Ryakujiro/
# 英辞郎&略辞郎, nor with binary Reijiro/例辞郎 or Waeijiro/和英辞郎
# (which isn't even available yet).
# Thus, it is entirely possible that some data may accidentally get
# substituted out during conversion, so USE AT YOUR OWN RISK.
# And, if you're at all capable, please fix the bugs you find and
# share the improved code.
#
# HOW CAN I KNOW THIS WORKS BEFORE I BUY THE BINARY 英辞郎?
# If you have an earlier version of 英辞郎, you can take the a-c
# section of your EIJI-???.TXT, append to it the A-C and a-c
# sections of your RYAKU???.TXT and run the resulting file through
#   python -c "import sys, re;data=sys.stdin.read();data = re.sub(u'｛.*?｝(?! : )', '', data, flags=re.U);sys.stdout.write(data)" < input.txt > output.txt
# on the command line to remove the yomigana (by the way, those are
# full-width ｛｝　braces in this regex, not ASCII {}).
# Compare the result (using WinMerge, for example) with what you get
# using this here script on eijiro-sample, and you should see no
# differences other than the deliberate edits the maintainer has
# made since your version.
#
# HOW LONG DOES THE CONVERSION TAKE? HOW MUCH MEMORY?
# Converting the a-c sample takes roughly 20 seconds on my test PC,
# so you can probably estimate 3-4 minutes for the full 英辞郎
# on most PCs made within the last 6-7 years.
# Available memory requirement should be around the combined sizes
# of your entry.dat and data.lpb plus ~15-20 MB, depending on
# your Python interpreter.
#
#
# TECHNICAL STUFF
#
# 英辞郎 Logophile binary distribution consists of two files,
# 'entry.dat', which is a sort of an index file, and 'data.lpb',
# which contains the dictionary data entries, each one
# individually zlib compressed. Aside from "hiding" a few initial
# bytes of the compressed data streams via hard coding and the
# index file, there is no data protection used. To reiterate,
# NO ENCRYPTION of any kind is used, so this can be considered
# a straight-forward conversion from an undocumented format.
#
#
# entry.dat contains N 14-byte index records
#
# Offset Bytes Contents
#  0     4     data record offset in data.lpb (little-endian)
#  4     4     P, data record packed size (LE)
#  8     4     U, data record unpacked size (LE)
# 12     2     extra bytes to prepend for zlib decompression
#
# To decompress, we need to prepend 78h, 9Ch and these two extra
# bytes to the data record.
#
#
# data.lpb contains N P-byte data records that decompress to
# U bytes in size.
#
# Offset Bytes  Contents
#  0     4      01000000 (LE, there might be other types)
#  4     4      Ilen, length of text for indexing (LE)
#  8     Ilen   headword(s) text for indexing (don't need this)
#  ...   4      00000000 (LE, there might be other types)
#  ...   4      Hlen, length of entry headword(s) (LE)
#  ...   Hlen   headword(s) text for display
#  ...   4      Clen, length of entry content (LE)
#  ...   Clen   entry content text
#
# Unlike with the 英辞郎 text format, there is only one
# multi-line HTML-ified entry per headword. The examples are on
# their own lines instead of being separated by '■・'. Fortunately the
# '{名-1} :'-style identifiers remain intact. The text is in UTF-8.
#
# Long story short, we need to strip the HTML, reattach the examples,
# prepend the '■' headword(s) to each line and convert to Shift-JIS.

from __future__ import print_function
import argparse
import os
import re
import struct
import sys

__version__ = '0.2'

# check that we're properly pointed to a dictionary
def valid_lpb_dir(lpb_dir):
  if not os.path.isdir(lpb_dir):
    msg = '{0} is not a directory'.format(lpb_dir)
    raise argparse.ArgumentTypeError(msg)
  global entryfile
  entryfile = os.path.join(lpb_dir, 'entry.dat')
  if not os.path.exists(entryfile):
    msg = 'No file entry.dat in directory {0}'.format(lpb_dir)
    raise argparse.ArgumentTypeError(msg)
  global datafile
  datafile = os.path.join(lpb_dir, 'data.lpb')
  if not os.path.exists(datafile):
    msg = 'No file data.lpb in directory {0}'.format(lpb_dir)
    raise argparse.ArgumentTypeError(msg)
  return lpb_dir

# parse command line parameters
desc = u'★リバース辞郎 by リバースくん★ V{0}\n'\
       u'英辞郎 Logophile binary to plain '\
       u'text converter'.format(__version__)
parser = argparse.ArgumentParser(description=desc,
                               formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument('-v', '--verbose', action='store_true',
                    help='print some informational messages (to stderr)')
parser.add_argument('lpbdir', type=valid_lpb_dir,
                    help='directory with Logophile binary dictionary files')
parser.add_argument('outfile', nargs='?', type=argparse.FileType('wb'),
                    default=sys.stdout,
                    help='output file (defaults to stdout)')
if len(sys.argv) < 2:
  parser.print_help()
  sys.exit(1)
args = parser.parse_args()

if args.outfile is sys.stdout:
  if sys.platform == 'win32':
      import msvcrt
      msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

with open(entryfile, 'rb') as f:
  entry = f.read()
with open(datafile, 'rb') as f:
  data = f.read()

prepend1 = '\x78\x9C' # for zlib
entrypos = 0
nument = len(entry)/14
if args.verbose:
  print('Found Logophile binary dictionary in '\
        '\"{0}\"'.format(args.lpbdir), file=sys.stderr)
  print('Processing {0} records\n'.format(nument), file=sys.stderr)

for num in range(nument):
  datapos = struct.unpack('I', entry[entrypos:entrypos+4])[0]
  packedsz = struct.unpack('I', entry[entrypos+4:entrypos+8])[0]
  unpacksz = struct.unpack('I', entry[entrypos+8:entrypos+12])[0]
  prepend2 = entry[entrypos+12:entrypos+14] # for zlib
  entrypos += 14

  packed = data[datapos:datapos+packedsz]
  item = ''.join([prepend1, prepend2, packed]).decode('zlib') # decompress

  itempos = 0
  # itemdummy1 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need
  itempos += 4
  itemidxlen = struct.unpack('I', item[itempos:itempos+4])[0]
  itempos += 4
  # itemidxtxt = item[itempos:itempos+itemidxlen] # don't need
  itempos += itemidxlen
  # itemdummy2 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need
  itempos += 4
  itemheadlen = struct.unpack('I', item[itempos:itempos+4])[0]
  itempos += 4
  itemheadtxt = item[itempos:itempos+itemheadlen]
  itempos += itemheadlen
  itemtxtlen = struct.unpack('I', item[itempos:itempos+4])[0]
  itempos += 4
  itemtxt = item[itempos:itempos+itemtxtlen]
  itempos += itemtxtlen
  if itempos <> unpacksz:
    print('Error unpacking entry', num, file=sys.stderr)
    break

  itemheadtxt = itemheadtxt.decode('utf-8')
  itemtxt = itemtxt.decode('utf-8')

  # reattach the examples
  itemtxt = itemtxt.replace(u'\r\n・', u'■・')
  # strip HTML tags
  itemtxt = itemtxt.replace('<p>\r\n', '')
  itemtxt = itemtxt.replace('</p>\r\n', '')
  itemtxt = itemtxt.replace('<br />', '')
  itemtxt = re.sub('<a href=.*?>', '', itemtxt)
  itemtxt = itemtxt.replace('</a>', '')
  # re-convert HTML entities
  itemtxt = itemtxt.replace('&quot;', '\"')
  itemtxt = itemtxt.replace('&lt;', '<')
  itemtxt = itemtxt.replace('&gt;', '>')
  itemtxt = itemtxt.replace('&amp;', '&')

  for line in itemtxt.split('\r\n'):
    if line:
      # prepend the headword(s) to each line
      # the {名-1}-style lines already have the colon separator
      if line[0] == '{':
        line = ''.join([u'■', itemheadtxt, '  ', line, '\r\n'])
      else:
        line = ''.join([u'■', itemheadtxt, ' : ', line, '\r\n'])

      while (1):
        try:
          ln = line.encode('cp932').decode('shift_jis').encode('shift_jis')
          # pass through cp932 or characters like '～' will have problems.
          # (this is more convenient than doing individual substitutions)
          # see http://tanakahisateru.hatenablog.jp/entry/20080728/1217216409
          #     http://d.hatena.ne.jp/hirothin/20080819/1219123920
        except UnicodeDecodeError as ude:
          # turns out the 英辞郎 source contains a few characters not in
          # JIS X 0208 that make the shift_jis decoder choke.
          # this code assumes we only get problems duting shift_jis decoding,
          # so start/end are based on ude.object being cp932 from above.
          prob = ude.object
          repl = (ude.object[ude.start:ude.end]).decode('cp932')
          repl = repl.encode('unicode_escape').replace('\\u','U+').upper()
          prob = '{0}[{1}]{2}'.format(prob[:ude.start], repl, prob[ude.end:])
          line = prob.decode('cp932')
          if args.verbose:
            # print('Encoding problem:', ude, file=sys.stderr)
            print('Replacing gaiji \'{0}\' with Unicode notation placeholder'\
                  ' \'[{1}]\''.format(ude.object[ude.start:ude.end], repl),
                  file=sys.stderr)
            print('Text:', ude.object, file=sys.stderr)
          continue
        else:
          # should be JIS X 0208 compliant now
          args.outfile.write(ln)
          break
  #