Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- \documentclass{article}
- \usepackage[utf8]{inputenc}
- \usepackage{natbib}
- \usepackage{graphicx}
- \usepackage{authblk}
- \usepackage{pgfplots}
- \setcitestyle{square}
- \begin{document}
- \author{Youssef Walid Hassan}
- \affil{[email protected] \\ ID: 9181668 \\ Section: 36 - BN: 2}
- \title{Compressing enwik8 using known compression algorithms}
- \maketitle
- \section{Introduction}
- Compressing data has been a challenge for many years now as data sizes are getting bigger and our reliance on transferring data through networks is getting more and more bigger. We have come up with many compression algorithms each with strong and weak points. In this article we discuss compressing enwik8, a file containing the first 100 MBs of the English Wikipedia dump on Mar. 3, 2006 \cite{mattmahoney}.
- \section{Case Analysis}
- Studying the enwik8 file, we immediately notice that it is filled with XML tags which consists of opening and closing tags with the same tag word. e.g. $<example></example>$ which will provide us with a lot of reoccurrences of the tags. Following up on that observation, we shift our attention to dictionary coding techniques which will perform much better than other techniques, lets also bear in mind that the file mostly contains English language text (and other languages expressed in UTF-8 format) which will make dictionary coding more stronger. We can also use a Prediction by Partial Matching technique (PPM) but the existing algorithms have long running times which makes it impractical to use.
- \section{The Algorithm}
- There are many dictionary coding techniques, which are mostly variations of LZ77 and LZ78. We chose to work with LZW, an improved version of LZ78. The reason behind our choice was that LZW is adaptive by nature, it achieves a relatively high compression ratio compared to its family, and it is very optimized compared to others. We can improve LZW even further by using a variable length coding technique, which means increasing the size of the bit word as the dictionary size increases instead of choosing a fixed size from the beginning and wasting space. We preferred to code the algorithm in C++ to gain speed and make the code more modular to ease improving on it.
- \section{Results}
- We benchmarked the algorithm alongside other algorithms like 7z, rar, and zip, on a machine with the following specifications:\
- $$\begin{tabular}{ |p{1.85cm}|p{7cm}| }
- \hline
- \multicolumn{2}{|c|}{Computer Specifications} \\
- \hline
- CPU & Intel(R) Core(TM) i7-4771 CPU @ 3.70GHz\\
- \hline
- Chipset & Gigabyte Z87X-D3H-CF\\
- \hline
- Memory & 4x4 DDR3 Kingston HyperX @ 2400 MHZ\\
- \hline
- GPU & NVIDIA GeForce GTX 760\\
- \hline
- Drive & WDC WD1003FZEX-00MK2A0 7200RPM\\
- \hline
- \end{tabular}$$
- Using O3 C++ compiler optimizations, running on a single thread, our algorithm achieved the following results compared to 7z, rar, and zip. It is to be noted that our algorithm is single-threaded while rar, 7z and zip all run on multi threads which give them a huge advantage in time.
- $$\begin{tabular}{ |p{1.8cm}|p{3cm}|p{2.5cm}|p{2.5cm}| }
- \hline
- \multicolumn{4}{|c|}{Time Benchmark Between Algorithms} \\
- \hline
- Algorithm & Compression Ratio & Encoding Time & Decoding Time \\
- \hline
- LZW & 2.905 & 51.7424 s & 9.83825 s \\
- \hline
- RAR & 3.448 & 4.71 s & $<$ 1 s \\
- \hline
- ZIP & 2.755 & 1.4 s & $<$ 1 s \\
- \hline
- 7z & 3.834 & 29.4 s & 1.2 s \\
- \hline
- \end{tabular}$$
- As seen in the table, we can notice that our algorithm achieved better results than ZIP, but it was worse than RAR and 7z.
- \begin{tikzpicture}[
- Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
- ]
- \begin{axis}[
- ybar,
- xtick=data,
- ymin = 2,
- ymax = 4,
- xmin=0.5,
- xmax=4.5,
- ymajorgrids=true,
- bar width=0.5cm,
- ylabel={Compression Ratio},
- ylabel style={yshift=-3mm},% <-- changed
- xlabel = {Compression Ratio Chart},
- xtick align=inside,
- xticklabels={LZW, RAR, ZIP, 7z},
- nodes near coords,
- nodes near coords align={vertical},
- x tick label style={font=\normalsize, rotate=0, anchor=north},
- clip=false% <-- added
- ]
- \addplot coordinates {(1,2.905) (2,3.448) (3,2.755) (4, 3.834)};
- \end{axis}
- \end{tikzpicture}
- \
- \begin{tikzpicture}[
- Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
- ]
- \begin{axis}[
- ybar,
- xtick=data,
- ymin = 0,
- ymax = 60,
- xmin=0.5,
- xmax=4.5,
- ymajorgrids=true,
- bar width=0.5cm,
- ylabel={Compression Ratio},
- ylabel style={yshift=-3mm},% <-- changed
- xlabel = {Encoding Time Chart},
- xtick align=inside,
- xticklabels={LZW, RAR, ZIP, 7z},
- nodes near coords,
- nodes near coords align={vertical},
- x tick label style={font=\normalsize, rotate=0, anchor=north},
- clip=false% <-- added
- ]
- \addplot coordinates {(1,51.7424) (2,4.71) (3,1.4) (4, 29.4)};
- \end{axis}
- \end{tikzpicture}
- \
- \begin{tikzpicture}[
- Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
- ]
- \begin{axis}[
- ybar,
- xtick=data,
- ymin = 0,
- ymax = 11,
- xmin=0.5,
- xmax=4.5,
- ymajorgrids=true,
- bar width=0.5cm,
- ylabel={Compression Ratio},
- ylabel style={yshift=-3mm},% <-- changed
- xlabel = {Decoding Time Chart},
- xtick align=inside,
- xticklabels={LZW, RAR, ZIP, 7z},
- nodes near coords,
- nodes near coords align={vertical},
- x tick label style={font=\normalsize, rotate=0, anchor=north},
- clip=false% <-- added
- ]
- \addplot coordinates {(1,9.83725) (2,0.8) (3,0.7) (4, 1.2)};
- \end{axis}
- \end{tikzpicture}
- \subsection{Future Work}
- We can improve the running time of our algorithm by using multi-threaded compression instead of doing all the work on a single thread which puts us behind by a lot, we can also improve the compression ratio by using multiple contexts instead of using only one context.
- \begin{thebibliography}{00}
- \bibitem{mattmahoney}
- Matt Mahooney (2011, September 1). About the Test Data. Retrieved from http://mattmahoney.net/dc/textdata.html
- \end{thebibliography}
- \end{document}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement