Untitled

The ebx are obviously weakly compressed whereas they used to be uncompressed throughout bf3 and in the bf4 alpha. Compare the keyword sections:

levellistreport.ebx:
	alpha:
		DataContainer.Asset.$.Name.LevelReportingAsset.array.member.BuiltLevels
	beta:
		DataContainer.Asset.$.Name.LevelReporting....array.member.Built&..s.

	In detail with the dots replaced with the actual hex:
		LevelReporting 1B00F103 array
		=> 1B00F103 equals Asset\x00

	It's just simple substitution stuff, try the common weak compressions. lzo or snappy, or maybe zlib on very low compression (or rather not lol, very unlikely)

	Offset of Asset: 0x78 (0x78 refers to 78 in hex, i.e. 120 decimal)
	Offset of where Asset would start after LevelReporting (before 1b): 0x93
	Difference: 0x1b


dataversion.ebx:
	alpha:
		DataContainer.Asset.$.Name.VersionData.disclaimer.Version.DateTime.BranchId.GameName
	beta:
		DataContainer.Asset.$.Name.Version"...disclaimer...:...eTime.BranchId.Game:..]

	In detail:
		Version 2200B400 disclaimer
		=> 2200B400 can mean either:
			2200b4 (which somehow equals Data) and a nullbyte
			or 2200b400 on its own (which somehow equals Data\x00).

	Note that levellistreport required 4 bytes too but already got the nullbyte from the Asset\x00 string.


	Offset of Data: 0x3f
	Offset of where Data would start after Version: 0x61
	Difference: 0x22

	So it might be 2200B400 which means:
		move back 0x22 bytes and grab 5 bytes
	or it could be 2200B4 which means:
		move back 0x22 bytes and grab 4 bytes (with the final byte given directly)


snappy compression algorithm says:
	Copies are references back into previous decompressed data, telling
	the decompressor to reuse data it has previously decoded.
	They encode two values: The _offset_, saying how many bytes back
	from the current position to read, and the _length_, how many bytes
	to copy.

=>looks promising

snappy also mentions (zlib, LZO, LZF, FastLZ, and QuickLZ), might be worth checking out the various LZs

LZO says:
	 LZO is a block compression algorithm - it compresses and decompresses
	 a block of data. Block size must be the same for compression
	 and decompression.

	 LZO compresses a block of data into matches (a sliding dictionary)
	 and runs of non-matching literals.

This is basically the entire documentation regarding how it actually works, which is probably(?) similar to snappy (I don't understand the documentation).


Ignore the compressed parts for a moment and consider the header (the very first few bytes in the file).
The payload is compressed and has a header of a few bytes placed in front of it.
When compression is involved the header is pretty much guaranteed to contain both the decompressed size and the compressed size.

Compression header (i.e. before the actual compressed payload):
	4 bytes: 0000 01d0, decompressed size?
	2 bytes: 09 70??
	2 bytes: 015b, size of everything after it till EOF (end-of-file) for small files
	2 bytes: f102?

0970 was the same in a couple of files, check for constancy of 0970 @offset 4:
	import os

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fnames in ff:
			fname=dir0+"\\"+fnames
			f=open(fname,"rb")
			f.read(4)
			if f.read(2)!="\x09\x70":
				print fname

	NO HIT => CONSTANT FOR ALL BETA EBX

The sum of the 2 integers after the ebx magic (ced1b20f) is the (decompressed) file size. Usually the ints
are untouched as the files are only weakly compressed. The sum of them is equal to the first 4 bytes in the file, confirming that this indeed is the decompressed size.


Compression header in big endian (big endian is when a hex number is written in a normal way from left to right, little endian inverts the order of the bytes):
	4 bytes: decompressed size
	2 bytes: 0970
	2 bytes: 015b, size of everything after it till EOF (for small files)
	2 bytes: f102, possibly part of the compressed payload


What happens when is file too large for 2 byte size?
	Two possibilities:
		1) Some varint stuff (with pairs of 2 bytes?), so when the first bit is 1, then read two more bytes. Rather unlikely, never seen any varints working with pairs of 2 bytes before.
		2) Compressed in small blocks with max ffff bytes, one block after another. Could be that 0970 is the start of one package which would also align the start of the first section to a multiple of 4.

	materialgrid contains 0970 eight times, spaced apart by the the block size => option 2.
	This is very similar to the fb2 zlib format (or maybe it is even zlib with low compression).

The last two bytes are really part of the payload and not part of the header. Some files only have one byte there before the ebx magic.

=> The file consists of several blocks, with no global metadata.
The blocks are set to have a size of 0x010000 when decompressed, except for the last one which is usually smaller.

Compressed block (big endian):
	4 bytes: decompressed size (0x10000 or less)
	2 bytes: 0970
	2 bytes: compressed size
	compressed payload

Decompress each block and glue the decompressed parts together to obtain the file.


Maybe it is zlib at weak compression, try compressing a string at the various compression levels (from 0 to 9) to get an idea what it looks like:
	import zlib
	from binascii import hexlify
	string="adgfasdfavasdfasdf00000000"
	for i in xrange(10):
		hexlify(zlib.compress(string,i))

	'7801011a00e5ff6164676661736466617661736466617364663030303030303030858708c4'
	'78014b4c494f4b2c4e494b2c0393409601140000858708c4'
	'785e4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'785e4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'785e4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'785e4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'789c4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'78da4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'78da4b4c494f4b2c4e494b2c0393406c000500858708c4'
	'78da4b4c494f4b2c4e494b2c0393406c000500858708c4'

	Nope, zlib starts with 78 (usually 78da because default compression is set pretty high). You may want to connect 78da with zlib in your mind. It's used in many archives.

	Try to decompress anyway, run through all possible slices of a small file and see if it can decompress. If it cannot then the file is probably not zlib.

	import zlib
	f=open("levellistreport.ebx","rb")
	size=355 #size of the file
	for i in xrange(size):
		for j in xrange(size):
			f.seek(i)
			data=f.read(j) #when j is greater than the number of the remaining bytes in the file,
			#it doesn't cause an error but just gives back everything till the end of the file
			try:
				data2=zlib.decompress(data) #try to decompress it (usually it will complain about the format being invalid
				if len(data2)!=0: #make sure that there's actually something there when decompressed
					print i,j,len(data2)
			except: continue

	No output at all, so disregard zlib.

snappy (has only one compression level):
	Grabbed the libraries from http://www.lfd.uci.edu/~gohlke/pythonlibs/

	Take the script from before and replace the "zlib" with "snappy" (Python is simple), yielding

	15 3 1
	22 5 2
	62 3 1
	216 7 3
	256 4 2
	319 35 32
	347 3 1


	So it only gives back small random segments out of it with a size of max 32 bytes. No snappy.

lzo:
	Exactly the same script as before but with lzo instead. Always fails. No lzo.


wiki mentions some lossless algorithms

Run-length encoding (RLE) – a simple scheme that provides good compression of data containing lots of runs of the same value.
Lempel-Ziv 1978 (LZ78), Lempel-Ziv-Welch (LZW) – used by GIF images and compress among many other applications
DEFLATE – used by gzip, ZIP (since version 2.0), and as part of the compression process of Portable Network Graphics (PNG), Point-to-Point Protocol (PPP), HTTP, SSH
bzip2 – using the Burrows–Wheeler transform, this provides slower but higher compression than DEFLATE
Lempel–Ziv–Markov chain algorithm (LZMA) – used by 7zip, xz, and other programs; higher compression than bzip2 as well as much faster decompression.
Lempel–Ziv–Oberhumer (LZO) – designed for compression/decompression speed at the expense of compression ratios
Statistical Lempel Ziv – a combination of statistical method and dictionary-based method; better compression ratio than using single method.

Run-length encoding (RLE) => nope, too simple
Lempel-Ziv 1978 (LZ78), Lempel-Ziv-Welch (LZW) => probably LZ77; LZ78 and LZW are different again
DEFLATE – Each block is preceded by a 3-bit header. => nope, there is no 3 bit header.
bzip2  => unlikely (high compression)
Lempel–Ziv–Markov chain algorithm (LZMA) => unlikely (even higher compression)
Lempel–Ziv–Oberhumer (LZO) => nope, just tried it
Statistical Lempel Ziv => very novel and definitely uncommon; nope


wiki on LZ77: In the implementation used for many games by Electronic Arts,[4] the size in bytes of a length-distance pair can be specified inside
the first byte of the length-distance pair itself; depending on if the first byte begins with a 0, 10, 110, or 111 (when read in big-endian bit orientation),
the length of the entire length-distance pair can be 1 to 4 bytes large.

[4]: http://wiki.niotso.org/QFS_compression (Niotso is a semi-collaborative effort to re-implement the engine used in The Sims Online.)


Googling ea and LZ77 got me here http://www.vgleaks.com/world-exclusive-durangos-move-engines/

"The Xbox One (Durango) GPU includes a number of fixed-function accelerators. Move engines are one of them.

Xbox One (Durango) hardware has four move engines for fast direct memory access (DMA)

This accelerators are truly fixed-function, in the sense that their algorithms are embedded in hardware.
They can usually be considered black boxes with no intermediate results that are visible to software.
When used for their designed purpose, however, they can offload work from the rest of the system and obtain useful results at minimal cost."

The Xbox One has one move engine for encoding and one for decoding LZ77.

So, some LZ77 variant was probably chosen in preparation for the Xbox One. The Xbox finally gets rid of that proprietary XMA audio codec implemented via hardware
that made it impossible for a long time for anyone to decode Xbox audio (until some russians managed to get a hold of some code IIRC), but now it has an
LZ77 variant that is done via hardware and will most likely never be documented anywhere. meh.


Apply the info from niotso to the levellistreport.ebx:
	Recall that the string is: DataContainer.Asset.$.Name.LevelReporting....array.member.Built&..s.

	Detail:
		LevelReporting 1B00F103 array
		=> 1B00F103 equals Asset\x00

	1b00f103 is a 4 bytes opcode.
	in binary: 00011011 00000000 11110001 00000011

	niotso says for the individual bits in a 4byte opcode:
		110ORRPP	OOOOOOOO	OOOOOOOO	RRRRRRRR

		Note the O in the first byte. Looks like some attempt at obfuscation to me (would make more sense to have it on the right in the first bit close to the other Os).

		O: Offset, move backwards by this amount of bytes and start copying a certain number of bytes following that position.
		R: Length, how many bytes to copy. If the length is larger than the offset, start at the offset again and copy the same values again.
		P: The engine needs a way to know if what it sees is data (that may happen to look similar to an opcode) and what is opcode. So this value
		tells the distance to the next opcode, with everything in between being ordinary uncompressed data. Proceed this distance.
		This can only be a value up to 3, which is far too small. Therefore it's possible to add just one more byte after the opcode to increase that distance.

		As the offset 1b is in fact on the left in the code, it does not match the niotso 4byte opcode which requires the offset in the middle.
		Maybe it is a 3byte opcode with an extra byte. Will investigate this later.

		What's more, the offset is the very first thing to appear, so the engine must have some idea how many bytes to read.
		I'm not sure if the niotso format is used. It's more likely that it is indeed some custom format for the Xbox One. Still, most of the info here applies in either case.

Well, a custom LZ77 it is then. Not awfully surprising. Initially I thought the binary XML used
by the game was some already existing format and spent hours to find nothing (of course).

So let's not waste any time and get going.

It might be a good idea to consider the distance between one opcode and the next, i.e. the distance between the positions after LevelReporting and after Built.

Offset after LevelReporting: 0x93
Offset after Built: 0xA9
Difference: 0x16, could be up to 4 bytes lower because I'm not sure if it starts counting before or after the opcode. The value may also be constantly shifted by a small value.

It's still not conclusive.

Note that the 1b in 1b00f103 is probably part of a 2 byte sequence, 1b00, written in little endian (i.e. 001b in big endian).


Anyway, go to the start of the file after the header. There are two bytes before the ebx magic, f102, while other times just one byte is enough.
This indicates a varint, so the first byte has a bit to indicate that the number has reached its end or if the next byte is part of the number too.

There are many other ebx files having two bytes before the magic and the first byte being f1, f2, etc.
As a rough estimate then, when the first half of the first byte is f, then another byte follows.
One half of a byte is 4 bits, so only the 4 remaining bits actually contain information about the number.

Anyway, check this theory by adjusting the script. As the header structure is already known the script could read every block and not just the first one.
However, the first block is known to contain the ebx magic which can be used as a landmark, so it's not necessary yet to implement that.

	import os
	from struct import unpack,pack #convert a sequence of bytes into ints or floats

	#create some sets, they can contain every element only once; perfect for this kind of analysis
	oneset=set()
	twoset1=set()
	twoset2=set()

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fnames in ff:
			fname=dir0+"\\"+fnames
			f=open(fname,"rb")

			#grab the header
			decompressedSize=unpack(">I",f.read(4)) #read big endian unsigned int
			constant=f.read(2) #0970
			compressedSize=unpack(">H",f.read(2)) #read big endian unsigned half int

			#check the bytes after the header, using the ebx magic as a reference
			ebxMagic="\xce\xd1\xb2\x0f"

			sample = f.read(10) #read 10 bytes, even the smallest file is several times larger than that
			#now find the position of the magic in it and then analyze the bytes before it

			magicPos=sample.find(ebxMagic)
			if magicPos==-1: asdf #could not find the magic at all
			elif magicPos>2: fdsa #more than 2 bytes before the magic

			if len(sample)<2: tooshorttoanalyze

			if magicPos==1:
				oneset.add(ord(sample[0])) #this set will contain all possible bytes that appear when there is only one byte before the magic
			if magicPos==2:
				twoset1.add(ord(sample[0]))
				twoset2.add(ord(sample[1]))

	print oneset
	print twoset1
	print twoset2

results in:
	set([224, 192, 162, 161, 209, 194, 144, 177, 128])
	set([240, 241, 242, 243, 244])
	set([2, 4, 40, 41, 42, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 10])


Later on it might be useful to collect the specific combinations of the 2 bytes that are possible. For now however, there are some useful results already:
	There are always either one or two bytes before the magic.
	When there are two bytes before the magic, the first byte can be 240 to 244, which is 0xf0 to 0xf4.
	When there is only one byte the values range from 0x80 to 0xe0.


Now comes the time to directly compare between small alpha and beta files that remain unchanged.
As a rough indicator for that, use the two ints after the magic and make sure they are the same. As they appear so early in the file and the compression
can only look behind to copy stuff from there, and not into the future, these ints should always be written out.

levellistreport.ebx and many others fail the test.
	import os
	from struct import unpack,pack

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fnames in ff:
			fname=dir0+"\\"+fnames
			f=open(fname,"rb")

			#grab the header
			decompressedSize=unpack(">I",f.read(4))
			constant=f.read(2) #0970
			compressedSize=unpack(">H",f.read(2))

			#check the bytes after the header, using the ebx magic as a reference
			ebxMagic="\xce\xd1\xb2\x0f"
			sample = f.read(10)
			magicPos=sample.find(ebxMagic)

			f.seek(-10+magicPos,1) #move back to where the ebx magic starts, then grab 12 bytes and compare to alpha
			betabytes=f.read(12)

			try: f2=open("D:/hexing/bf4 alpha dump/bundles/"+dir0+"/"+fnames,"rb")
			except: continue #some files do not exist in the alpha

			alphabytes=f2.read(12) #not compressed, so no header or other trouble

			if alphabytes==betabytes:
				print fname

	which gives back a whopping two files that satisfy the condition:
		ebx\sound\mixers\impairedhearing_soundstate_mixer.ebx
		ebx\sound\mixers\mandown_soundstate_mixer.ebx


Both of these still differ directly after those 12 bytes, so ignore them.

Keep working with levellistreport.ebx instead, trying to match the metadata:
	from my ebx script (these 11 ints appear directly after the ebx magic, they are little endian):
		class Header:
			def __init__(self,varList): ##all 4byte unsigned integers
				self.absStringOffset     = varList[0]  ## absolute offset for string section start
				self.lenStringToEOF      = varList[1]  ## length from string section start to EOF
				self.numGUID             = varList[2]  ## number of external GUIDs
				self.null                = varList[3]  ## 00000000
				self.numInstanceRepeater = varList[4]
				self.numComplex          = varList[5]  ## number of complex entries
				self.numField            = varList[6]  ## number of field entries
				self.lenName             = varList[7]  ## length of name section including padding
				self.lenString           = varList[8]  ## length of string section including padding
				self.numArrayRepeater    = varList[9]
				self.lenPayload          = varList[10] ## length of normal payload section; the start of the array payload section is absStringOffset+lenString+lenPayload


	alpha (from here on, all nums in hex even without 0x prefix):
		self.absStringOffset     = 180
		self.lenStringToEOF      = 50
		self.numGUID             = 1
		self.null                = null, lol
		self.numInstanceRepeater = 1
		self.numComplex          = 4
		self.numField            = 5
		self.lenName             = 50
		self.lenString           = 10
		self.numArrayRepeater    = 2
		self.lenPayload          = 30

	beta (all nums in hex; with some guessing involved):
		self.absStringOffset     = 190
		self.lenStringToEOF      = 40
		self.numGUID             = 2
		self.null                = null, maybe? It's compressed; it seems that it was removed
		self.numInstanceRepeater = 1
		self.numComplex          = 4
		self.numField            = 5
		self.lenName             = 50
		self.lenString           = 10
		self.numArrayRepeater    = maybe 6? unlikely
		self.lenPayload          = 20

The null part is the first occurrence of compression I think.
The file starts with f102, then reads 10 bytes, and then compresses the nulls with
01020073? Or the null was removed from the header altogether. Skim through some larger files
as larger files mean larger values in the metadata so they aren't compressed as easily.

Consider the beta materialgrid:
	CED1B20F 70250200 10D20400 BF000000
	CD05CD05 10002000 5B007006 30000000
	A9280000 A012010063

	self.absStringOffset     = 22570
	self.lenStringToEOF      = 4d210
	self.numGUID             = bf
	self.null                = null, maybe?
	self.numInstanceRepeater = 210?
	self.numComplex          = 75b?
	self.numField            = 306?
	self.lenName             = ?
	self.lenString           = ??
	self.numArrayRepeater    = ???
	self.lenPayload          = ????


	Just make sure that the header indeed remains the same, and figure out if the null entry is still there or not.
	The first two entries were already checked. Move on to numGUID.
		get length of the guid section. It is made up of guid pairs with a length of 20 each.
		The guids are random bytes and thus should be almost impossible to compress.

		The size is roughly 17f5, starting @32 and ending before the section containing the
		string keywords.

		17f5/20 = bf, good

	The keyword section (which I called Names for some reason) starts @1827 and has a size of about 3bd

	The next sections in the file are:
		fieldDescriptors #10 (i.e. sixteen) bytes long, the 9. byte is pretty much always null
		complexDescriptors #look the same as fieldDescriptors, but the byte is not null
		instanceRepeaters #consist of three ints, the first int used to be always null
		arrayRepeaters #look the same as instanceRepeaters, but the first int is not null
	These characteristics can be used to identify the length of the sections and thereby the number of entries.

	fieldDescriptors and complexDescriptors: the section for both is 500 bytes
	instanceRepeaters: about d80 bytes; d80/c = 120
	arrayRepeaters: about a788 bytes; a788/c = df6

	The length of the string section is about 30 bytes compressed.
	The length of the non-array payload after that section c080.

	Can't figure out anything, return to a simpler file.


Analyze the keyword sections of various files to get an idea of how the offset works.
More files = more accurate results.
	levellayerinclusion:
		alpha:
			DataContainer.Asset.$.Name.SubWorldInclusion.array.member.Criteria.WorldPartInclusion.SubWorldInclusionCriterion.Options.WorldPartInclusionCriterion
		beta:
			DataContainer.Asset.$.Name.SubWorldInclusion.array.member.Criteria.%.FPart)..;..-..on.Options6

		in detail:
			WorldPartInclusion.SubWorldInclusionCriterion
			vs
			250046 Part 29000D 3B0003 2D00AF on

		moving 25 backwards gives back the offset of WorldInclusion
		46 means to copy 5 bytes, then read Part as uncompressed data and then read
		another opcode.
		So 46 => copy 5 and proceed by 4?

		With the 29 byte afterwards I end up at the ldInclusion, two bytes too early.


		At least 2d00af refers back to Inclusion. But it should be SubWorldInclusion... meh


	uiawardsoverlaylogic:
		alpha:
			DataContainer.GameDataContainer.$.DataBusPeer.Flags.
		beta:
			DataContainer.Game.. $....BusPeer.Flags&.`

		in detail:
			GameDataContainer.$.DataBusPeer.Flags.
			vs
			Game 120020 $ 00 1000D1 BusPeer.Flags 260060

			120020: move back by 12 and copy 0e bytes, then proceed by 2 bytes.
			1000d1: move back by 10 and copy 4, then proceed 0d.

			Ah, so it does it sequentially.
			It first converts Game 120020 to GameDataContainer.
			then when it reaches the next code it uses this replacement already,
			so when it moves back by 10 bytes it actually arrives it the Data from
			GameDataContainer, not the normal DataContainer. This is then copied again, etc.

			It only makes sense, it would be odd if it copied opcodes too. Silly me.

			20 in binary: 00100000, copy 0e, proceed 2
			d1 in binary: 11010001, copy 4, proceed 0d


Manually analyze a whole lot more, I'll only jot down the results; long keyword sections are ideal for this:
	using sound/master.ebx. Made a copy of the compressed file. After every step I figured out I replaced
	the part with the uncompressed data so the subsequent compressed parts make sense.

	0F0057, move 0f, copy 5, proceed 5
	370057, move 37, copy 0b, proceed 5

	Well isn't that interesting. The same 57 is used for two different things.
	This might be an indicator that there is some obfuscation. E.g. it could be implemented
	that the program adds the current offset in the file to some number to obtain 57
	when compressing.
	So I would need to subtract that offset before analyzing the number. That of course
	requires an even greater number of samples, with their offset documented.
	Let's ignore that for a moment and go on.

	100040, move 10, copy 0b, proceed 4
	630091, move 63, copy 4, proceed 9
	0D00F205, move 0d, copy 5, proceed 14

	Note that the keywords are ordered slightly differently, which hopefully explains how
	the same code meant two different things above. The words all sound the same here,
	so it's hard to recognize a different order. Go on anyway, keeping that in mind.

	220002, move 22, copy 6, proceed 0 (the next opcode is directly after this one)
	7C00F306, move 7c, copy 6, proceed 15
	6C0002, move 6c, copy 7, proceed 0. How is that even possible? Is this some sort of ruse?

	The first two bytes seem pretty reliable, don't type out how far it moves.
	1D0002, copy 6, proceed 0
	0C0062, copy 6, proceed 6
	120001, copy 6, proceed 0
	550004, copy 4, proceed 0
	6E0006, copy 9 or 8, proceed 0
	C80000, copy 0a or 9, proceed 0
	8100F101, copy 4, proceed 10


	Not conclusive at all.

	However, it seems that that rule about f0 to f4 is true even later in the file.
	So identifying compressed parts is not that difficult. Look out for 1 byte with a value
	(move back by this amount), then one byte is almost always null because the last time the
	string appeared is usually closer. At least, when a file has just ff bytes this always works.
	As there's not much point in looking at compressed parts later in the file anyway
	unless either the previous parts have been decompressed or I can directly compare against
	an alpha file (even then, the keyword section comes pretty early and is the most useful),
	just assume it is null. When the first half of the final byte is f, then make sure
	to look one byte ahead, if that byte is extremely low, these 4 bytes form
	one compressed unit. If the first half is not f it's a bit harder.
	Just make sure to look back to see if the position to copy from actually makes sense.


Take another look at levellistreport.ebx with the goal of fully decompressing it:
	decompressed size: 1d0

	Starts with F102, then some bytes.


	The first guid D6076D4B4DF8DD11BE32C64EACA26B06 is the same in the alpha.
	In particular, the file guid is very similar to the instance guid, so
	parts are compressed the second time.

	Likewise, the first guid of the second guid pair, A4E429350D405687DE5E6EFF3347F7ED
	is similar to the second part of the pair.

	I can't really figure that part out yet; moving on to the keyword section which
	shouldn't be too hard to decompress.

	1B00F103, copy 6, proceed 12
	260013, copy 5, proceed 0

	It's the end of the keyword section, so it might not seem trivial that it proceeds 0.
	However, every string has to end with a nullbyte which has not appeared yet.
	Furthermore, the keyword section is padded with nullbytes to a multiple of 10 (sixteen). So
	while it may be possible that the section needs no padding by chance (or only 2-3 bytes),
	it's very likely that the next upcode fills in the nulls.
	Indeed, as the next section starts with 81b5 (the first two bytes of the hash of DataContainer)
	there is some padding here.

	8E0040, copy some nulls, proceed 4

	In the alpha (and in bf3) the first fieldDescriptor
	is just the hash and lots of nulls: 81B50200000000000000000000000000

	In the beta, knowing that the next fieldDescriptor starts with the hash 82D8827C,
	the entire entry reads: 81B5 C60030000008050092000000

	This is fundamentally different. Even assuming that C60030 is an opcode,
	there is definitely a non-null byte later on. It would have been too easy anyway.


	So, with every descriptor starting with a hash (which is hard to compress) and every
	descriptor having a fixed 16 bytes size (hopefully), try to count the entries to get
	an idea of the final size to check if the ebx file header remains correct.

	Never mind, that's impossible.


	Fix the string later on, LevelListReport

	CF0042 should equal Level; copy 5, proceed 4
	F90020 shoudl equal Report; copy 6,

	current minimum distance between CF0042 and Level: 8f, so there are 40 bytes missing
	current distance between F90020 and Report (assuming that Level was replaced by now):
	b9, so there are 40 bytes missing here too. So there truly are 40 bytes missing.

	This means that the difference between the keywords and the strings section
	is a multiple of 10. Hopefully that means that the padding remains there as it used to.

	The keyword section should be 50 bytes long then.
	That makes 8E00400000 become 9 nulls, so basically 8E0040 is 7 nulls.

	So from CF0042, move backwards until Level is reached. That distance is actually cf.
	Now just add the size of the keywords before that: 41 bytes
	=> 110 bytes for keywords + descriptors/repeaters.
	Assume that keywords need 50 bytes (they are padded at the end).
	That leaves c0 for the descriptors/repeaters

	So what's the size of all metadata:
		Meta size (as given by the header):
			190
		Meta size (by summing up the parts):
			30? for the header itself
			+60 for the guids (2 external guid pairs, one file guid pair, each 20 bytes)
			+50 for the keywords
			+c0 for the descriptors/repeaters
			= 1a0

		I suspect either the header lost 10 bytes or one half of a guid pair has been dropped.

	Moving on to the payload itself. The payload always starts with a 10 byte guid, and more
	guids appear later on in the file. These correspond to instances in the xml file.
	What's interesting to note is that the guid section at the top contains (among others)
	the guid of the primary instance. This guid must appear once in the payload, written out
	exactly like at the top. In the case of levellistreport.ebx, there is only one instance in
	total. Which means that the guid that must appear is known (from the alpha):
		D7076D4B4DFEDD11A232C64E4C926B06

	It's interesting to note that it is still written out for the most part at the bottom,
	while it is somewhere at the top (in compressed form). This indicates that the window
	was too small so the compressor did not see the guid at the top anymore.
	Or that indeed this is the half of a guid pair that was dropped.


Got to properly decompress before dealing with that.

Assume that the offset in the file is indeed relevant. Have a script sort files by their very
first proceed-number, right before the ebx magic. Then manually analyze them. Would be good
to find files with the number just varying by 1 and with the number to copy varying by 1 too.
Of course that requires still a lot of understand of the header which I don't really have.
Will try anyway.
	import os
	from struct import unpack,pack #convert a sequence of bytes into ints or floats
	from binascii import hexlify #converts several bytes into a string of their hex representation,
	#e.g. "\x00\xab"=>"00ab": and similarly "doc"=>"646f63"

	#utility function, by default Python gives back an error when trying to create a file in a nonexistent folder
	#this creates the folder and then the file; requires another function for long pathnames
	def open2(path,mode="rb"):
		if mode=="wb":
			#create folders if necessary and return the file handle

			#first of all, create one folder level manully because makedirs might fail
			path=path.replace("/","\\")
			pathParts=path.split("\\")
			manualPart="\\".join(pathParts[:2])
			if not os.path.isdir(manualPart):
				os.makedirs(manualPart)

			#now handle the rest, including extra long path names
			folderPath=lp(os.path.dirname(path))
			if not os.path.isdir(folderPath): os.makedirs(folderPath)
		return open(lp(path),mode)
	def lp(path): #long pathnames
		if len(path)<=247 or path=="" or path[:4]=='\\\\?\\': return path
		return unicode('\\\\?\\' + os.path.normpath(path))

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fnames in ff:
			fname=dir0+"\\"+fnames
			f=open(fname,"rb")

			#grab the header
			decompressedSize=unpack(">I",f.read(4)) #read big endian unsigned int
			constant=f.read(2) #0970
			compressedSize=unpack(">H",f.read(2)) #read big endian unsigned half int

			#check the bytes after the header, using the ebx magic as a reference
			ebxMagic="\xce\xd1\xb2\x0f"
			sample = f.read(10)
			magicPos=sample.find(ebxMagic)

			f.seek(-10,1) #move back to where the start of the number I want, then grab it to order the files
			proceedNum=hexlify(f.read(magicPos))

			f2=open2("D:/hexing/bf4 beta dump sorted/bundles/"+proceedNum+"/"+fnames,"wb")
			f.seek(0)
			f2.write(f.read()) #copy the entire file to f2
			f2.close()
			#this overwrites if files in different beta folders happen
			#to have the same name, but there should be enough files anyway


Beautiful: http://i.imgur.com/8ntBZPv.jpg


Well, that was easy:
	type 80:
		proceed 8 bytes in all 6 files of that type.
		In those files the metadata has the same size as the payload, so the
		payload integer is compressed already.

		Because I know from the decompressed size exactly that this is the case,
		and it's extremely unlikely that after one compressed integer comes another
		number that happens to look the same (so the same number would be copied twice),
		I can assume that the number of bytes to copy is indeed exactly 4.

		There are two different opcodes after proceeding 8 (absolute offset: 11; @11):
			040020, copy 4, proceed 2
			040051, copy 4, proceed 5

	type 90:
		There is just one file; as you might imagine this proceeds 9 bytes.
		Which is really odd. The last digit/byte of the integer happens to be
		the same (remember that it's little endian so the last byte is on the left) and the
		compressor saw an option to optimize this.
		@12: 040001, copy 4?, proceed 3

	type a1:
		proceed 0a bytes.
	type a2:
		same wtf

	type b1:
		proceed 0b

Can't really compare like I'd like to if I don't know the header structure,
on the other hand how do I figure out the header structure if it is compressed.


Well, simple, try to get the header structure by looking at the longer types,
e.g. f429 probably proceeds past the first 20-30 bytes:
12gflechette_bpb.ebx header structure:
	absStringOffset     = 4bytes
	lenStringToEOF      = 4bytes
	numGUID             = 4bytes
	2bytes, numInstanceRepeater?
	2bytes, numComplex?
	2bytes, numField?
	2bytes, 19
	2bytes, 58
	2bytes, size of keyword section?
	4bytes, 40
	4bytes, 2a
	4bytes, 20f0, slightly smaller than lenStringToEOF (2320)

Try another file to fill in the gaps, bd_buildingskyscrapermatteyellow_top_01.ebx
has a corresponding alpha file too in contrast to 12gflechette_bpb:
	alpha (all fields are all 4 bytes long):
		self.absStringOffset     = 1f40
		self.lenStringToEOF      = 510
		self.numGUID             = 1
		self.null                = null
		self.numInstanceRepeater = 8
		self.numComplex          = 36
		self.numField            = c8
		self.lenName             = e30
		self.lenString           = 70
		self.numArrayRepeater    = 8
		self.lenPayload          = 430

	beta (2 bytes or 4 bytes length):
		self.absStringOffset     = 1e40
		self.lenStringToEOF      = 430
		self.numGUID             = 2
		2bytes, 08, probably numInstanceRepeater
		2bytes, 03, no idea
		2bytes, 08, probably numArrayRepeater
		2bytes, 34, probably numComplex
		2bytes, c2, probably numField
		2bytes, de0, probably size of keyword section
		4bytes, 40, probably size of string section
		4bytes, 07, no idea
		4bytes, 380, probably size of payload (without arrays)

		and the guids are directly after this header
		in particular, the first half of the first guid pair (the file guid) is still there, unchanged
		the second pair is replaced with nulls or something, which is then of course compressed


This is too confusing as it is, have the script cut off the compression header so it is easier
to measure the right distances. While I'm at it, replace the ebx magic with the decompressed size.
That way I can calculate the payload size even when compressed.
	import os
	from struct import unpack,pack #convert a sequence of bytes into ints or floats
	from binascii import hexlify #converts several bytes into a string of their hex representation,
	#e.g. "\x00\xab"=>"00ab": and similarly "doc"=>"646f63"

	#utility function, by default Python gives back an error when trying to create a file in a nonexistent folder
#this creates the folder and then the file; requires another function for long pathnames
	def open2(path,mode="rb"):
		if mode=="wb":
			#create folders if necessary and return the file handle

			#first of all, create one folder level manully because makedirs might fail
			path=path.replace("/","\\")
			pathParts=path.split("\\")
			manualPart="\\".join(pathParts[:2])
			if not os.path.isdir(manualPart):
				os.makedirs(manualPart)

			#now handle the rest, including extra long path names
			folderPath=lp(os.path.dirname(path))
			if not os.path.isdir(folderPath): os.makedirs(folderPath)
		return open(lp(path),mode)
	def lp(path): #long pathnames
		if len(path)<=247 or path=="" or path[:4]=='\\\\?\\': return path
		return unicode('\\\\?\\' + os.path.normpath(path))

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fnames in ff:
			fname=dir0+"\\"+fnames
			f=open(fname,"rb")

			#grab the header

			#totally forgot about indexing, need to take element 0; peculiarity of the unpack libary.
			decompressedSize=unpack(">I",f.read(4))[0] #read big endian unsigned int
			constant=f.read(2) #0970
			compressedSize=unpack(">H",f.read(2))[0] #read big endian unsigned half int

			#check the bytes after the header, using the ebx magic as a reference
			ebxMagic="\xce\xd1\xb2\x0f"
			sample = f.read(10)
			magicPos=sample.find(ebxMagic)

			f.seek(-10,1) #move back to where the start of the number I want, then grab it to order the files
			proceedNum=hexlify(f.read(magicPos))

			f2=open2("D:/hexing/bf4 beta dump sorted/bundles/"+proceedNum+"/"+fnames,"wb")
			f.seek(4,1) #don't go back to the start, instead move 4 bytes too (past the ebx magic)
			f2.write(pack("I",decompressedSize)) #write as little endian so it is easier to read when next to the other LE stuff
			f2.write(f.read()) #copy the entire file to f2
			f2.close()
			#this overwrites if files in different beta folders happen
			#to have the same name, but there should be enough files anyway


As a reference, a valid ebx header of a f429 file (without the compression header):
	CED1B20F 000F0000 20230000 01000000
	0400 0200 0400 1900 5800 8005 40000000
	2A000000 F0200000

	give them some letters
	a b c d
	e f g h i j k
	l m

	a) ebx magic
	b) meta size
	c) payload size (meta + payload = file size)
	d) number of external guid pairs (each pair is 20 bytes in total;
	and there is one internal guid for the file itself)
	e) numInstanceRepeater
	f) ?
	g) numArrayRepeater
	h) numComplex
	i) numField
	j) keyword section size
	k) string section size
	l) ?
	m) payload size without arrays

	=> 8 bytes less compared to the bf3 ebx header

(a1 type) fontcollection_zh_fontmapcollectionwin32:
	proceed 0a
	70030000 70020000 0001 030021 00010200F314

	030021 must be a substitute for something that's 4 bytes or longer.
	It refers to three bytes in the past. So at least one byte must be copied twice.
	I.e. 030021 refers to 000001 + at least 00 (and would then continue 0001, 000001, 000001)

	70030000 70020000 00010000 0100 (,0001,000001,...) 0001020200F314

	Ignore the first three parts, so the next entry is the number of ext guids
		0100 (,0001,000001,...) 0001020200F314
		the number of ext guids is obviously very small
		so 01000001 does not work, because it is a gigantic number (16777217 in decimal)
		it must be 010000, then another null is added from the byte afterward

		01000000 01 0200F314


	Therefore:
		a1: proceed 0a
		030021: move 3, copy 5, proceed 1


(a2 type) commanderkit:
	proceed 0a too

	fuck, fuck, fuck

	There are 11576 files with a2, and just 125 with a1.
	So it's at least not seeded by the filename or crap like that.


	A01C0000 E01B0000 C000 010011 02 0200F213

	010011, move 1, copy 4 or more, proceed 1

	This file has no external guids so decompressed it is
		A01C0000 E01B0000 C0000000 00000000 02 0200F213

	i.e. 010011, move 1, copy 6, proceed 1

	So basically these 11576 files are all files with no external guid.


c0 type:
	proceed 0c

	lav25_mesh:
		070011, copy 4, proceed 1

	layer9_homebase_ch:
		0700F31A, copy 4?, proceed?
		no idea


proceeds:
	f011: proceed 20
	f012: proceed 21
	f015: proceed 24
	f016: proceed 25
	f004: proceed 13

	by that logic, f000: proceed 0f
	or more generally:
		f0xy: proceed f + xy

	and any values below that can be written directly as 80, e0, etc. to proceed 8, e

	yup, indeed varint.


	f429: 38
	f42a: 39

	So the second half of the first byte is not relevant for the proceed, it must have
	another purpose!


Oh man. This looks promising.

When an entry says 0100F23C, it means:
	move 1, copy f(2) with some function f yet to be determined, proceed f+3c=4b

	yup, works correctly along several entries.

	Still got to figure out the function. And why does the very first opcode not always have 0.
	The most reasonable guess is to say f(x)=x+4, because it takes an opcode is 3 bytes at least.

	The first opcode probably contains the bytes to copy for the next opcode.
	Err... I had
		a1, 030021: move 3, copy 5, proceed 1
		a2, 010011, move 1, copy 6, proceed 1
		So the 1 in a1 means copy 5, the 2 in a2 means copy 6.

	So it was the typical "these elements are placed together so they belong together" syndrome; fool me twice :(


Yup, have tested a small sample and it works. That fully explains it, well almost. The question
is what happens when the first byte is ff. I assume it then extends over three bytes or something like that.
It's not important yet to figure out the details, just make sure the script will
give an error when that happens so I can look into it.

Write a script to decompress the all blocks of each file. Need to see what happens at ff.
Also tidy up the script a lot. Decompressing the files until an error occurs. Don't create
the decompressed files yet; only when an error occurs, create a debug file to get the
decompressed data until the error.

Script with some improvements made due to errors (that will be stated shortly):
	import os
	from struct import unpack,pack #convert a sequence of bytes into ints or floats
	from binascii import hexlify #converts several bytes into a string of their hex representation,
	from cStringIO import StringIO #create something that has the same functions as a file but is in memory only

	def readNum(f): #when byte is ff, read one more byte until not ff, add all
		total=0
		while 1:
			byte=ord(f.read(1))
			total+=byte
			if byte!=0xff: return total

	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk("ebx"):
		for fname in ff:
			f=open(dir0+"\\"+fname,"rb")

			#cheap way to get the end-of-file, i.e. the size of the file
			f.seek(0,2)
			EOF=f.tell()
			f.seek(0)

			decompressedStream=StringIO() #write the decompressed data into memory only

			while f.tell()<EOF:
				#grab the header of a compressed block
				decompressedSize, constant, compressedSize = unpack(">IHH",f.read(8)) #const 0970
				blockOffset=f.tell()
				if constant!=0x970: print "Block header constant is not 970, the script may or may not work correctly."

				#go from one opcode to the next and write the decompressed data into the stream until the block is done
				while f.tell()-blockOffset<compressedSize:

					#read the length byte as a number, then split it in two numbers (from 0 to f each) with bitmasking and shifting
					lengthByte=ord(f.read(1))  #e.g. 9e
					proceedSize=lengthByte>>4  #=>   09
					copySize   =lengthByte&0xf #=>   0e

					#the revised version to deal with larger numbers; the original one is in the comment below
					if proceedSize==0xf:
						proceedSize+=readNum(f)

	##                #add the next byte to the proceedSize if the half-byte is f.
	##                #Raise some errors if something new pops up so I can investigate.
	##                if proceedSize==0xf:
	##                    nextByte=ord(f.read(1))
	##                    if nextByte==0xff: #the byte behaved normally when reaching a number larger than 128, so now check ff
	##                        print dir0, fname, f.tell()
	##                        f2=open("debug","wb")
	##                        f2.write(decompressedStream.getvalue())
	##                        f2.close()
	##                        proceedSizeIsFF
	##                    else: proceedSize+=nextByte
	####                if copySize==0xf:
	####                    print dir0, fname, f.tell()
	####                    f2=open("debug","wb")
	####                    f2.write(decompressedStream.getvalue())
	####                    f2.close()
	####                    copySizeIsF

					decompressedStream.write(f.read(proceedSize))
					pos0=decompressedStream.tell()

					#this check was added later on, the very last bytes in the block are not compressed so there is no offset to read
					if f.tell()-blockOffset==compressedSize:
						break

					offset=unpack("H",f.read(2))[0]

					#the revised version to deal with larger numbers; the original one is in the comment below
					if copySize==0xf:
						copySize+=readNum(f)

	##                #might be a varint. Not sure if this case happens even once
	##                if copySize==0xf:
	##                    print "#########"
	##                    copySummand=ord(f.read(1))
	####                    print copySummand, dir0, fname, f.tell()
	##                    if copySummand>>7: #what happens if the first bit is set
	##                        print dir0, fname, f.tell()
	##                        f2=open("debug","wb")
	##                        f2.write(decompressedStream.getvalue())
	##                        f2.close()
	##                        asdfasdf
	##                    copySize+=copySummand

					copySize+=4
					decompressedStream.seek(-offset,1) #go back to copy the data

					#make several copies if necessary
					if offset<copySize:
						times=copySize/offset
						rest=copySize%offset
						copy=decompressedStream.read(copySize)
						decompressedStream.seek(pos0)
						for i in xrange(times): decompressedStream.write(copy)
						decompressedStream.write(copy[:rest])
					else:
						copy=decompressedStream.read(copySize)
						decompressedStream.seek(pos0)
						decompressedStream.write(copy)

			f.close()


Errors encountered:
	ebx crowsweaponhudlogic.ebx 373
	NameError: name 'copySizeIsF' is not defined
	i.e. at offset 373 (decimal) in that file in the main ebx folder, an entry has copysize f.

	So what happens in that case?

	the most recent bytes:
		FieldAccessType.FieldAccessType_Source.FieldAccessType_

	the problematic expression:
		6f Target 2E0004 And

	which the alpha files say become \00FieldAccessType_SourceAnd (note that the Target here belongs to
	the FieldAccessType_Target after removing 6f (opcodes are not decompressed after all).

	Therefore, 17 bytes are needed to produce \x00FieldAccessType_Source

	And the info is to proceed 6 bytes (past Target), and copy f (whatever that means) with an
	offset of 2e.

	Alright, assume the same model as before, with copy = f + 4 = 13.

	So it seems that it has to a single bye is placed directly behind the offset which is
	then added to the number of bytes to copy.
	Therefore it copies f+4+4 = 17 bytes.

	So the proceed length is directly extended after the length byte, whereas the
	copy length is extended at the end, even after the offset.
	The next question is how the lengths are extended in detail. Either the extended bytes
	are some sort of varint (so the first bit specifies whether to read one more byte or not)
	or they just go from 0 to ff, with ff meaning that the next byte will be read too.


Another error occured towards the end of the file.
The very last few bytes in that file were not compressed and there was no offset given afterwards.
Which makes means half a byte was wasted because it specified the number of bytes to copy,
while nothing was copied at all.

Anyway, don't try to read an offset then.
Though I wonder what happens if the last few bytes in a file are indeed compressed.

Next error (in another file, so apparently one file was fully decompressed already):
	The second byte of the proceed size had its first bit set (could have been variant).
	However when investigating the file behaved normally. So it should
	behave normally at least until it reaches ff (thus not varint).

Next error:
	uiawardsoverlaylogic.ebx has its extra copy byte with its first bit set @8796 (decimal).

	Can't make any sense of it directly though because it's deep in the data.
	However,the copy byte is set to exactly ff, followed by fb. If it was a varint,
	the number would actually be way longer than the entire file in its
	alpha version. Therefore read the number fffb as ff + fb (with ff also indicating that
	the next byte, fb, is to be read). And apply that system to the proceed size too.

	As a function:
		def readNum(f): #when byte is ff, read one more byte until not ff, add all
			total=0
			while 1:
				byte=ord(f.read(1))
				total+=byte
				if byte!=0xff: return total

And with that, the script can handle thousands of files.
One file remains:
	ebx\ui\static\sharedicons.ebx
	Part of the file there give me back a warning about the block header not being 970.
	This file is also the largest beta ebx file there is, being 347 kb.

	The header of the second block: 00010000 00710000 3878

	It's in some odd section containing random ascii letters. Skipping 7100 does not
	get me to the start of the next section, however skipping 10000 does. So 00710000
	says that this block is not compressed (the 3878 is part of the payload already).


Summarized, then:
	As of the beta of Battlefield 4, the ebx files (containing binary XML) are compressed with an LZ77 algorithm.

	A compressed file consists of several blocks, with no global metadata.
	The blocks are set to have a size of 0x010000 when decompressed, except for the last one which is usually smaller.

	Structure of a compressed block (big endian):
		4 bytes: decompressed size (0x10000 or less)
		2 bytes: compression type (0970 for LZ77, 0071 for uncompressed data)
		2 bytes: compressed size (0000 for uncompressed data) of the payload (i.e. without the header)
		compressed payload

	Decompress each block and glue the decompressed parts together to obtain the file.

	The compression is an LZ77 variant. It requires 3 parameters:
		Copy offset: Move backwards by this amount of bytes and start copying a certain number of bytes following that position.
		Copy length: How many bytes to copy. If the length is larger than the offset, start at the offset again and copy the same values again.
		Proceed length: The number of bytes that were not compressed and can be read directly.

	Note that the offset is defined in regards to the already decompressed data which e.g. does not contain any compression metadata.

	The three values are split up however; while the copy length and proceed length are
	stated together in a single byte, before an uncompressed section, the relevant offset
	is given after the uncompressed section:
		Use the proceed length to read the uncompressed data, at which point you arrive at the start of the offset value.
		Read this value, then move to the offset and copy a number of bytes (given by copy length)
		to the decompressed data. Afterwards, the next copy and proceed length are given and the process starts anew.

	The offset has a constant size of 2 bytes, in little endian.

	The two lengths share the same byte. The first half of the byte belongs to the proceed length,
	whereas the second half belongs to the copy length.

	When the half-byte of the proceed length is f, then the length is extended by another byte,
	which is placed directly after the byte that contains both lengths. The value of that byte
	is added to the value of the proceed length (i.e. f). However, if the extra byte is ff, one more
	byte is read (and so on) and all values are added together.

	The copy length can be extended in the same manner. However, the possible extra bytes are
	located at the end, right after the offset.
	Additionally, a constant value of 4 is added to obtain the actual copy length.

	Finally, it is possible that a file ends without specifying an offset (as the last few bytes
	in the file were not compressed). The proceed length is not affected by that (and the copy
	length is of no relevance).

	As an example, consider the length byte B2:
		Proceed length: B
		Copy length: 2 + 4 = 6

	Another example, F23C:
		Proceed length: F + 3C = 4B
		Copy length: 2 + 4 = 6

	A full example (the whitespace is there to separate hex from ascii; it doesn't count):
		0000001a 0970 0018 80 minimap. 0800 51 ature 0a00 40 mize

		Header:
			Decompressed size 1a
			LZ77 compression (due to 0970)
			Compressed size 18

		Payload:
			Compressed stream:   80 minimap. 0800 51 ature 0a00 40 mize
			Decompressed stream: *empty*

			The decompression is sequential, start with the left part:
				80 minimap. 0800

				Read 8 uncompressed bytes into the decompressed stream.
				Decompressed stream: minimap.

				Move back by 8 bytes in the decompressed stream (to the start)
				and copy 4 bytes (mini) to the decompressed stream.

			Compressed stream:   51 ature 0a00 40 mize
			Decompressed stream: minimap.mini

			Perform the same step again:
				51 ature 0a00

				Read 5 uncompressed bytes into the decompressed stream.
				Decompressed stream: minimap.miniature

				Move back by 0a bytes in the decompressed stream
				and copy 5 bytes (.mini) to the decompressed stream.


			Compressed stream:   40 mize
			Decompressed stream: minimap.miniature.mini

			Read 4 uncompressed bytes into the decompressed stream (with no offset specified).

			Decompressed stream: minimap.miniature.minimize


Clean up the script. Have it create a new folder with all decompressed ebx
to investigate the changes to the ebx format:
	import os
	from struct import unpack,pack
	from cStringIO import StringIO

	def open2(path,mode="rb"): #when used to write, create folders too
		if mode=="wb":
			#create folders if necessary and return the file handle

			#first of all, create one folder level manully because makedirs might fail
			path=path.replace("/","\\")
			pathParts=path.split("\\")
			manualPart="\\".join(pathParts[:2])
			if not os.path.isdir(manualPart):
				os.makedirs(manualPart)

			#now handle the rest, including extra long path names
			folderPath=lp(os.path.dirname(path))
			if not os.path.isdir(folderPath): os.makedirs(folderPath)
		return open(lp(path),mode)
	def lp(path): #long pathnames
		if len(path)<=247 or path=="" or path[:4]=='\\\\?\\': return path
		return unicode('\\\\?\\' + os.path.normpath(path))


	def readNum(f): #when byte is ff, read one more byte until not ff, add all
		total=0
		while 1:
			byte=ord(f.read(1))
			total+=byte
			if byte!=0xff: return total

	def decompressLZ77(f,fileSize=None):
		#takes a file handle, gives back a decompressed string
		#allow file size to be specified to work from within archives

		#if file size not specified, get it now (will only work correctly on single files)
		if fileSize==None:
			f.seek(0,2)
			fileSize=f.tell()
			f.seek(0)
		fileOffset=f.tell() #0 for single files, much greater for archives

		#write the decompressed data into memory only, eventually return it
		decompressedStream=StringIO()

		#go through each block, filling the decompressed stream with data
		while f.tell()-fileOffset<fileSize:
			#grab the header of a compressed block
			decompressedSize, compressionType, compressedSize = unpack(">IHH",f.read(8))

			if compressionType==0x71:
				decompressedStream.write(f.read(decompressedSize))
				continue
			elif compressionType!=0x970: print "Unknown compression type: "+str(compressionType)

			#from here on, LZ77
			#go from one opcode to the next and write the decompressed data into the stream until the block is done
			blockOffset=f.tell()
			while f.tell()-blockOffset<compressedSize:

				#retrieve the two sizes from a single byte
				lengthByte=ord(f.read(1))  #e.g. 9e
				proceedSize=lengthByte>>4  #=>   09
				copySize   =lengthByte&0xf #=>   0e

				if proceedSize==0xf: proceedSize+=readNum(f)

				#add the uncompressed data to the stream
				decompressedStream.write(f.read(proceedSize))

				#it's possible that the very last bytes in the block are not compressed
				#so there is no offset to read; handle this case
				if f.tell()-blockOffset==compressedSize: break


				pos0=decompressedStream.tell() #data will be written to the end of the stream, so take note of it
				offset=unpack("H",f.read(2))[0]

				if copySize==0xf: copySize+=readNum(f)
				copySize+=4

				decompressedStream.seek(-offset,1) #go back to copy the data

				#make several copies if necessary
				if offset<copySize:
					times=copySize/offset
					rest=copySize%offset

					copy=decompressedStream.read(offset) #either read offset or copySize; read() will yield the same
					decompressedStream.seek(pos0)
					for i in xrange(times): decompressedStream.write(copy)
					decompressedStream.write(copy[:rest])
				else:
					copy=decompressedStream.read(copySize)
					decompressedStream.seek(pos0)
					decompressedStream.write(copy)
		return decompressedStream.getvalue()


	#go through all files in the ebx folder
	for dir0, dirs, ff in os.walk(r"D:\hexing\bf4 beta dump\bundles\ebx"):
		for fname in ff:
			fullPath=dir0+"\\"+fname

			writePath=fullPath.replace(r"bf4 beta dump\bundles\ebx","bf4 decompressed ebx")
			f=open(dir0+"\\"+fname,"rb")
			data=decompressLZ77(f)
			f.close()
			f2=open2(writePath,"wb")
			f2.write(data)
			f2.close()


The files certainly look more tolerable now:

Before: http://i.imgur.com/RwjMdgi.png
After:  http://i.imgur.com/xHkGjOD.jpg


Supplement:
While dealing with the new patched cas-enabled sbtoc I've stumbled upon two more compression types.
Type 0070 is almost the same as type 0071, but for 0070 the compressed size equals the decompressed size
whereas the compressed size is zero for type 0071.

Another type is 0000, which only occurs when decompressed and compressed size are null too.
Basically there are 8 nullbytes. In this case, return an empty string.