Wordlist

pyClean: Python way to clean up large dictionaries.
pyClean: Python way to clean up large password dictionaries.

I created pyClean.py to help me manage the 800+ Gigabytes of password dictionaries I have collected and obtain by cracking password hashes.  Lot of the passwords were password!@# password123, 1@3password4@%, etc.   Same base word just with numbers and special characters at the beginning and end of each word.  Not only does this take a lot of disk space its also very inefficient. Hashcat using rules can add numbers and special characters a lot more efficient and with better results.  Now adding them in the middle and randomly through out the word can be harder. Rules do exist for this but I am leaving that part alone.

So how do you use pyClean.py?

You need to have at least python 2.7.x I think I programmed it right to support python 3 but haven’t tested it.

user@host$: git clone https://github.com/initiate6/pyClean.git

user@host$: cd pyClean/

user@host$: python pyClean.py -h

pyClean.py by INIT_6
Cleans large dictionaries removing configured chars via regex.
For more information check out my blog: https://blog.init6.me/?p=63

–help Prints this help
–file File containing words you want to clean up. 1 word per line
–threads Number of processors
–output Output filename. Default: input file name plus .out
–lines Lines per chuck to read Default [10000]
–startRegex Regex to remove at start of line. Default [^[\d|\W]+]
–endRegex Regex to remove at end of line. Default [[\d|\W]+$]

Pretty self explanatory. Only required item is file. Threads default is the number of CPU cores. output default just adds .out to the end of the input file name. Lines default is 10,000 lines per chunk. However, this is just an approximate number, actual lines will differ greatly depending on the input file.  It will always go to end of line and fill the buffer.  startRegex and endRegex is so you can customize what is stripped out of the words. Default is numbers and special chars at beginning and end of each word.

List dictionaries I can remember.

Pretty much everything from here: https://wiki.skullsecurity.org/Passwords
crackStation
recent 10-million username/password release
95% of linkedin passwords I cracked with a team I was on.
All passwords I have cracked from crackmeifyoucan contest over the years.
Many others update when I remember.
Additional notes:

My built in de-duper isn’t very good it only removes dups per chuck I process. It also removes any words less than 3char. Its better then nothing but if you really want to get the job done use the following sort foo.

user@host$ LC_ALL=C sort –parallel=8 -f -u -S 30G -T /passwords/tmp/ -o SortedAllPasswords.wl allPasswords.wl

LC_ALL=C is to make sure the sorting order is based on the byte values. More info here: http://unix.stackexchange.com/a/87763

–parallel to add parallel support typically just the amount of CPU cores you have.

-f to ignore-case. Again hashcat is better at toggling the case.

-u unique. To remove duplicates

-S buffer-size. SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T

-T Temporary-directory. My default temp directory is small so I have to relocate it.

-o output file name

Last item is the input file name

Went from 1TB to 500GB. Compressed 103GB.

After everything was cleaned up, I organized the passwords by charter length. This helps in the cracking process as 8 - 12 char passwords are the most common. So using the 6 to 12 char password files with hashcat rules to expand and contract the base words you have pretty good coverage.

Note, these list are mostly good for wide net, first pass on a password dumps. After you get more information about the dump its good to use a targeted word list.

Torrent: https://box.init6.me/data/public/2042a9

I’ll be collecting more and more passwords and sort through them. If you have any list you would like to share please let me know at init6@init6.me