Pastebin
API
tools
faq
paste
Login
Sign up
Please fix the following errors:
New Paste
Syntax Highlighting
# coding=utf-8 # # NOTE: Unfortunately the 英辞郎 maintainer seems to have stopped selling the # Logophile version, apparently right after they found out that this # conversion script had appeared... Hopefully it'll still be useful for those # who already bought that version. I wish I could do something about this # person's misguided way of thinking, but I can't. # # ★リバース辞郎 by リバースくん★ # # Eijiro/英辞郎 Logophile binary to plain text converter. # 英辞郎の【Logophileバイナリ形式ファイル】を従来のテキスト形式に戻すためのツール。 # Version: 0.2 # # Requirements: Python 2.7 # # Usage: python lpb2txt.py [-h] [-v] lpbdir [outfile] # Examples: # python lpb2txt.py C:\temp\eijiro-sample out.txt # python lpb2txt.py -v C:\temp\eijiro > eijiro.txt # python lpb2txt.py C:\Users\myname\Documents\eijiro | grep "reverse" # python lpb2txt.py "C:\Documents and Settings\me\My Documents\ei" temp.txt # # This is free software. # You can use, modify and distribute it however you like. # これはフリーソフトです。自由に使ったり、配布したり、変更したりすることができます。 # # # VERSION HISTORY # # V0.2 (08/20/2015) # Fixed a bug with headwords containing '\' messing up regex # that wasn't really necessary anyway. # Processing time is over 3 times faster now. # # V0.1 (08/19/2015) # Initial version # # # 日本語の説明は? # 残念ながら、こんな技術的な説明を日本語で書くのはちょっと複雑すぎるので、ご了承よろしくお願い致します。 # Google Translateとか英辞郎とかを利用して理解することができなくはないと思います。m(_ _)m # # WHAT IS THIS? # This is a proof-of-concept Python 2.x script for converting # Eijiro/英辞郎 dictionary data back to the original text format # from the Logophile binary format it is now exclusivaly # distributed in. # # WHY MAKE THIS? # No offense to the 英辞郎 maintainer, whose hard work we all very # much appreciate, but once someone buys your dictionary data, # they should be able to use it for personal use however they like, # without getting locked into a specific software program. # # IS THIS A PIRACY TOOL? # This script has absolutely nothing to do with piracy. It is # meant for those who have legally purchased their 英辞郎 data in # the new binary format, but for example want to convert it to # EPWING for use with EBPocket or a similar program. # # IS THIS WELL-TESTED? # No, this has NOT been extensively tested, but it seems to work OK # with the freely-provided eijiro-sample files. # Tested ONLY with Python 2.7 under Windows, and at this point # NOT tested AT ALL with the full version of binary Eijiro&Ryakujiro/ # 英辞郎&略辞郎, nor with binary Reijiro/例辞郎 or Waeijiro/和英辞郎 # (which isn't even available yet). # Thus, it is entirely possible that some data may accidentally get # substituted out during conversion, so USE AT YOUR OWN RISK. # And, if you're at all capable, please fix the bugs you find and # share the improved code. # # HOW CAN I KNOW THIS WORKS BEFORE I BUY THE BINARY 英辞郎? # If you have an earlier version of 英辞郎, you can take the a-c # section of your EIJI-???.TXT, append to it the A-C and a-c # sections of your RYAKU???.TXT and run the resulting file through # python -c "import sys, re;data=sys.stdin.read();data = re.sub(u'{.*?}(?! : )', '', data, flags=re.U);sys.stdout.write(data)" < input.txt > output.txt # on the command line to remove the yomigana (by the way, those are # full-width {} braces in this regex, not ASCII {}). # Compare the result (using WinMerge, for example) with what you get # using this here script on eijiro-sample, and you should see no # differences other than the deliberate edits the maintainer has # made since your version. # # HOW LONG DOES THE CONVERSION TAKE? HOW MUCH MEMORY? # Converting the a-c sample takes roughly 20 seconds on my test PC, # so you can probably estimate 3-4 minutes for the full 英辞郎 # on most PCs made within the last 6-7 years. # Available memory requirement should be around the combined sizes # of your entry.dat and data.lpb plus ~15-20 MB, depending on # your Python interpreter. # # # TECHNICAL STUFF # # 英辞郎 Logophile binary distribution consists of two files, # 'entry.dat', which is a sort of an index file, and 'data.lpb', # which contains the dictionary data entries, each one # individually zlib compressed. Aside from "hiding" a few initial # bytes of the compressed data streams via hard coding and the # index file, there is no data protection used. To reiterate, # NO ENCRYPTION of any kind is used, so this can be considered # a straight-forward conversion from an undocumented format. # # # entry.dat contains N 14-byte index records # # Offset Bytes Contents # 0 4 data record offset in data.lpb (little-endian) # 4 4 P, data record packed size (LE) # 8 4 U, data record unpacked size (LE) # 12 2 extra bytes to prepend for zlib decompression # # To decompress, we need to prepend 78h, 9Ch and these two extra # bytes to the data record. # # # data.lpb contains N P-byte data records that decompress to # U bytes in size. # # Offset Bytes Contents # 0 4 01000000 (LE, there might be other types) # 4 4 Ilen, length of text for indexing (LE) # 8 Ilen headword(s) text for indexing (don't need this) # ... 4 00000000 (LE, there might be other types) # ... 4 Hlen, length of entry headword(s) (LE) # ... Hlen headword(s) text for display # ... 4 Clen, length of entry content (LE) # ... Clen entry content text # # Unlike with the 英辞郎 text format, there is only one # multi-line HTML-ified entry per headword. The examples are on # their own lines instead of being separated by '■・'. Fortunately the # '{名-1} :'-style identifiers remain intact. The text is in UTF-8. # # Long story short, we need to strip the HTML, reattach the examples, # prepend the '■' headword(s) to each line and convert to Shift-JIS. from __future__ import print_function import argparse import os import re import struct import sys __version__ = '0.2' # check that we're properly pointed to a dictionary def valid_lpb_dir(lpb_dir): if not os.path.isdir(lpb_dir): msg = '{0} is not a directory'.format(lpb_dir) raise argparse.ArgumentTypeError(msg) global entryfile entryfile = os.path.join(lpb_dir, 'entry.dat') if not os.path.exists(entryfile): msg = 'No file entry.dat in directory {0}'.format(lpb_dir) raise argparse.ArgumentTypeError(msg) global datafile datafile = os.path.join(lpb_dir, 'data.lpb') if not os.path.exists(datafile): msg = 'No file data.lpb in directory {0}'.format(lpb_dir) raise argparse.ArgumentTypeError(msg) return lpb_dir # parse command line parameters desc = u'★リバース辞郎 by リバースくん★ V{0}\n'\ u'英辞郎 Logophile binary to plain '\ u'text converter'.format(__version__) parser = argparse.ArgumentParser(description=desc, formatter_class=argparse.RawTextHelpFormatter) parser.add_argument('-v', '--verbose', action='store_true', help='print some informational messages (to stderr)') parser.add_argument('lpbdir', type=valid_lpb_dir, help='directory with Logophile binary dictionary files') parser.add_argument('outfile', nargs='?', type=argparse.FileType('wb'), default=sys.stdout, help='output file (defaults to stdout)') if len(sys.argv) < 2: parser.print_help() sys.exit(1) args = parser.parse_args() if args.outfile is sys.stdout: if sys.platform == 'win32': import msvcrt msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) with open(entryfile, 'rb') as f: entry = f.read() with open(datafile, 'rb') as f: data = f.read() prepend1 = '\x78\x9C' # for zlib entrypos = 0 nument = len(entry)/14 if args.verbose: print('Found Logophile binary dictionary in '\ '\"{0}\"'.format(args.lpbdir), file=sys.stderr) print('Processing {0} records\n'.format(nument), file=sys.stderr) for num in range(nument): datapos = struct.unpack('I', entry[entrypos:entrypos+4])[0] packedsz = struct.unpack('I', entry[entrypos+4:entrypos+8])[0] unpacksz = struct.unpack('I', entry[entrypos+8:entrypos+12])[0] prepend2 = entry[entrypos+12:entrypos+14] # for zlib entrypos += 14 packed = data[datapos:datapos+packedsz] item = ''.join([prepend1, prepend2, packed]).decode('zlib') # decompress itempos = 0 # itemdummy1 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need itempos += 4 itemidxlen = struct.unpack('I', item[itempos:itempos+4])[0] itempos += 4 # itemidxtxt = item[itempos:itempos+itemidxlen] # don't need itempos += itemidxlen # itemdummy2 = struct.unpack('I', item[itempos:itempos+4])[0] # don't need itempos += 4 itemheadlen = struct.unpack('I', item[itempos:itempos+4])[0] itempos += 4 itemheadtxt = item[itempos:itempos+itemheadlen] itempos += itemheadlen itemtxtlen = struct.unpack('I', item[itempos:itempos+4])[0] itempos += 4 itemtxt = item[itempos:itempos+itemtxtlen] itempos += itemtxtlen if itempos <> unpacksz: print('Error unpacking entry', num, file=sys.stderr) break itemheadtxt = itemheadtxt.decode('utf-8') itemtxt = itemtxt.decode('utf-8') # reattach the examples itemtxt = itemtxt.replace(u'\r\n・', u'■・') # strip HTML tags itemtxt = itemtxt.replace('<p>\r\n', '') itemtxt = itemtxt.replace('</p>\r\n', '') itemtxt = itemtxt.replace('<br />', '') itemtxt = re.sub('<a href=.*?>', '', itemtxt) itemtxt = itemtxt.replace('</a>', '') # re-convert HTML entities itemtxt = itemtxt.replace('"', '\"') itemtxt = itemtxt.replace('<', '<') itemtxt = itemtxt.replace('>', '>') itemtxt = itemtxt.replace('&', '&') for line in itemtxt.split('\r\n'): if line: # prepend the headword(s) to each line # the {名-1}-style lines already have the colon separator if line[0] == '{': line = ''.join([u'■', itemheadtxt, ' ', line, '\r\n']) else: line = ''.join([u'■', itemheadtxt, ' : ', line, '\r\n']) while (1): try: ln = line.encode('cp932').decode('shift_jis').encode('shift_jis') # pass through cp932 or characters like '~' will have problems. # (this is more convenient than doing individual substitutions) # see http://tanakahisateru.hatenablog.jp/entry/20080728/1217216409 # http://d.hatena.ne.jp/hirothin/20080819/1219123920 except UnicodeDecodeError as ude: # turns out the 英辞郎 source contains a few characters not in # JIS X 0208 that make the shift_jis decoder choke. # this code assumes we only get problems duting shift_jis decoding, # so start/end are based on ude.object being cp932 from above. prob = ude.object repl = (ude.object[ude.start:ude.end]).decode('cp932') repl = repl.encode('unicode_escape').replace('\\u','U+').upper() prob = '{0}[{1}]{2}'.format(prob[:ude.start], repl, prob[ude.end:]) line = prob.decode('cp932') if args.verbose: # print('Encoding problem:', ude, file=sys.stderr) print('Replacing gaiji \'{0}\' with Unicode notation placeholder'\ ' \'[{1}]\''.format(ude.object[ude.start:ude.end], repl), file=sys.stderr) print('Text:', ude.object, file=sys.stderr) continue else: # should be JIS X 0208 compliant now args.outfile.write(ln) break #
Optional Paste Settings
Category:
None
Cryptocurrency
Cybersecurity
Fixit
Food
Gaming
Haiku
Help
History
Housing
Jokes
Legal
Money
Movies
Music
Pets
Photo
Science
Software
Source Code
Spirit
Sports
Travel
TV
Writing
Tags:
Syntax Highlighting:
None
Bash
C
C#
C++
CSS
HTML
JSON
Java
JavaScript
Lua
Markdown (PRO members only)
Objective C
PHP
Perl
Python
Ruby
Swift
4CS
6502 ACME Cross Assembler
6502 Kick Assembler
6502 TASM/64TASS
ABAP
AIMMS
ALGOL 68
APT Sources
ARM
ASM (NASM)
ASP
ActionScript
ActionScript 3
Ada
Apache Log
AppleScript
Arduino
Asymptote
AutoIt
Autohotkey
Avisynth
Awk
BASCOM AVR
BNF
BOO
Bash
Basic4GL
Batch
BibTeX
Blitz Basic
Blitz3D
BlitzMax
BrainFuck
C
C (WinAPI)
C Intermediate Language
C for Macs
C#
C++
C++ (WinAPI)
C++ (with Qt extensions)
C: Loadrunner
CAD DCL
CAD Lisp
CFDG
CMake
COBOL
CSS
Ceylon
ChaiScript
Chapel
Clojure
Clone C
Clone C++
CoffeeScript
ColdFusion
Cuesheet
D
DCL
DCPU-16
DCS
DIV
DOT
Dart
Delphi
Delphi Prism (Oxygene)
Diff
E
ECMAScript
EPC
Easytrieve
Eiffel
Email
Erlang
Euphoria
F#
FO Language
Falcon
Filemaker
Formula One
Fortran
FreeBasic
FreeSWITCH
GAMBAS
GDB
GDScript
Game Maker
Genero
Genie
GetText
Go
Godot GLSL
Groovy
GwBasic
HQ9 Plus
HTML
HTML 5
Haskell
Haxe
HicEst
IDL
INI file
INTERCAL
IO
ISPF Panel Definition
Icon
Inno Script
J
JCL
JSON
Java
Java 5
JavaScript
Julia
KSP (Kontakt Script)
KiXtart
Kotlin
LDIF
LLVM
LOL Code
LScript
Latex
Liberty BASIC
Linden Scripting
Lisp
Loco Basic
Logtalk
Lotus Formulas
Lotus Script
Lua
M68000 Assembler
MIX Assembler
MK-61/52
MPASM
MXML
MagikSF
Make
MapBasic
Markdown (PRO members only)
MatLab
Mercury
MetaPost
Modula 2
Modula 3
Motorola 68000 HiSoft Dev
MySQL
Nagios
NetRexx
Nginx
Nim
NullSoft Installer
OCaml
OCaml Brief
Oberon 2
Objeck Programming Langua
Objective C
Octave
Open Object Rexx
OpenBSD PACKET FILTER
OpenGL Shading
Openoffice BASIC
Oracle 11
Oracle 8
Oz
PARI/GP
PCRE
PHP
PHP Brief
PL/I
PL/SQL
POV-Ray
ParaSail
Pascal
Pawn
Per
Perl
Perl 6
Phix
Pic 16
Pike
Pixel Bender
PostScript
PostgreSQL
PowerBuilder
PowerShell
ProFTPd
Progress
Prolog
Properties
ProvideX
Puppet
PureBasic
PyCon
Python
Python for S60
QBasic
QML
R
RBScript
REBOL
REG
RPM Spec
Racket
Rails
Rexx
Robots
Roff Manpage
Ruby
Ruby Gnuplot
Rust
SAS
SCL
SPARK
SPARQL
SQF
SQL
SSH Config
Scala
Scheme
Scilab
SdlBasic
Smalltalk
Smarty
StandardML
StoneScript
SuperCollider
Swift
SystemVerilog
T-SQL
TCL
TeXgraph
Tera Term
TypeScript
TypoScript
UPC
Unicon
UnrealScript
Urbi
VB.NET
VBScript
VHDL
VIM
Vala
Vedit
VeriLog
Visual Pro Log
VisualBasic
VisualFoxPro
WHOIS
WhiteSpace
Winbatch
XBasic
XML
XPP
Xojo
Xorg Config
YAML
YARA
Z80 Assembler
ZXBasic
autoconf
jQuery
mIRC
newLISP
q/kdb+
thinBasic
Paste Expiration:
Never
Burn after read
10 Minutes
1 Hour
1 Day
1 Week
2 Weeks
1 Month
6 Months
1 Year
Paste Exposure:
Public
Unlisted
Private
Folder:
(members only)
Password
NEW
Enabled
Disabled
Burn after read
NEW
Paste Name / Title:
Create New Paste
Hello
Guest
Sign Up
or
Login
Sign in with Facebook
Sign in with Twitter
Sign in with Google
You are currently not logged in, this means you can not edit or delete anything you paste.
Sign Up
or
Login
Public Pastes
Bitcoin is Amazing
22 min ago | 0.12 KB
Roblox Scripts
5 hours ago | 0.02 KB
December smells like money
5 hours ago | 0.06 KB
Decentralized Moneys
7 hours ago | 0.42 KB
Bitcoin
7 hours ago | 0.23 KB
Untitled
7 hours ago | 13.75 KB
Untitled
8 hours ago | 0.06 KB
Untitled
12 hours ago | 3.86 KB
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the
Cookies Policy
.
OK, I Understand
Not a member of Pastebin yet?
Sign Up
, it unlocks many cool features!