SHARE
TWEET

Untitled

Shad3yz Jun 5th, 2020 1,817 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. \documentclass{article}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{natbib}
  4. \usepackage{graphicx}
  5. \usepackage{authblk}
  6. \usepackage{pgfplots}
  7. \setcitestyle{square}
  8.  
  9.  
  10. \begin{document}
  11. \author{Youssef Walid Hassan}
  12. \affil{yousefwalid98@hotmail.com \\ ID: 9181668 \\ Section: 36 - BN: 2}
  13. \title{Compressing enwik8 using known compression algorithms}
  14. \maketitle
  15.  
  16. \section{Introduction}
  17. Compressing data has been a challenge for many years now as data sizes are getting bigger and our reliance on transferring data through networks is getting more and more bigger. We have come up with many compression algorithms each with strong and weak points. In this article we discuss compressing enwik8, a file containing the first 100 MBs of the English Wikipedia dump on Mar. 3, 2006 \cite{mattmahoney}.
  18.  
  19. \section{Case Analysis}
  20.  
  21. Studying the enwik8 file, we immediately notice that it is filled with XML tags which consists of opening and closing tags with the same tag word. e.g. $<example></example>$ which will provide us with a lot of reoccurrences of the tags. Following up on that observation, we shift our attention to dictionary coding techniques which will perform much better than other techniques, lets also bear in mind that the file mostly contains English language text (and other languages expressed in UTF-8 format) which will make dictionary coding more stronger. We can also use a Prediction by Partial Matching technique (PPM) but the existing algorithms have long running times which makes it impractical to use.
  22.  
  23. \section{The Algorithm}
  24.  
  25. There are many dictionary coding techniques, which are mostly variations of LZ77 and LZ78. We chose to work with LZW, an improved version of LZ78. The reason behind our choice was that LZW is adaptive by nature, it achieves a relatively high compression ratio compared to its family, and it is very optimized compared to others. We can improve LZW even further by using a variable length coding technique, which means increasing the size of the bit word as the dictionary size increases instead of choosing a fixed size from the beginning and wasting space. We preferred to code the algorithm in C++ to gain speed and make the code more modular to ease improving on it.
  26.  
  27. \section{Results}
  28.  
  29. We benchmarked the algorithm alongside other algorithms like 7z, rar, and zip, on a machine with the following specifications:\
  30.  
  31. $$\begin{tabular}{ |p{1.85cm}|p{7cm}| }
  32. \hline
  33. \multicolumn{2}{|c|}{Computer Specifications} \\
  34. \hline
  35.    CPU & Intel(R) Core(TM) i7-4771 CPU @ 3.70GHz\\
  36. \hline
  37.    Chipset & Gigabyte Z87X-D3H-CF\\
  38. \hline
  39.    Memory & 4x4 DDR3 Kingston HyperX @ 2400 MHZ\\
  40. \hline
  41.    GPU & NVIDIA GeForce GTX 760\\
  42. \hline
  43.    Drive & WDC WD1003FZEX-00MK2A0 7200RPM\\
  44. \hline
  45. \end{tabular}$$
  46. Using O3 C++ compiler optimizations, running on a single thread, our algorithm achieved the following results compared to 7z, rar, and zip. It is to be noted that our algorithm is single-threaded while rar, 7z and zip all run on multi threads which give them a huge advantage in time.
  47. $$\begin{tabular}{ |p{1.8cm}|p{3cm}|p{2.5cm}|p{2.5cm}| }
  48. \hline
  49. \multicolumn{4}{|c|}{Time Benchmark Between Algorithms} \\
  50. \hline
  51.    Algorithm & Compression Ratio & Encoding Time & Decoding Time \\
  52. \hline
  53.    LZW & 2.905 & 51.7424 s & 9.83825 s \\
  54. \hline
  55.    RAR & 3.448 & 4.71 s & $<$ 1 s \\
  56. \hline
  57.    ZIP & 2.755 & 1.4 s & $<$ 1 s \\
  58. \hline
  59.    7z & 3.834 & 29.4 s & 1.2 s \\
  60. \hline
  61. \end{tabular}$$
  62.  
  63. As seen in the table, we can notice that our algorithm achieved better results than ZIP, but it was worse than RAR and 7z.
  64.  
  65. \begin{tikzpicture}[
  66.    Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
  67.                     ]
  68.     \begin{axis}[
  69.    ybar,
  70.    xtick=data,
  71.    ymin = 2,
  72.    ymax = 4,
  73.    xmin=0.5,
  74.    xmax=4.5,
  75.    ymajorgrids=true,
  76.    bar width=0.5cm,
  77.    ylabel={Compression Ratio},
  78.    ylabel style={yshift=-3mm},% <-- changed
  79.     xlabel = {Compression Ratio Chart},
  80.    xtick align=inside,
  81.    xticklabels={LZW, RAR, ZIP, 7z},
  82.    nodes near coords,
  83.    nodes near coords align={vertical},
  84.    x tick label style={font=\normalsize, rotate=0, anchor=north},
  85.     clip=false% <-- added
  86.     ]
  87.     \addplot coordinates {(1,2.905) (2,3.448) (3,2.755) (4, 3.834)};
  88. \end{axis}
  89. \end{tikzpicture}
  90. \
  91.  
  92. \begin{tikzpicture}[
  93.    Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
  94.                     ]
  95.     \begin{axis}[
  96.    ybar,
  97.    xtick=data,
  98.    ymin = 0,
  99.    ymax = 60,
  100.    xmin=0.5,
  101.    xmax=4.5,
  102.    ymajorgrids=true,
  103.    bar width=0.5cm,
  104.    ylabel={Compression Ratio},
  105.    ylabel style={yshift=-3mm},% <-- changed
  106.     xlabel = {Encoding Time Chart},
  107.    xtick align=inside,
  108.    xticklabels={LZW, RAR, ZIP, 7z},
  109.    nodes near coords,
  110.    nodes near coords align={vertical},
  111.    x tick label style={font=\normalsize, rotate=0, anchor=north},
  112.     clip=false% <-- added
  113.     ]
  114.     \addplot coordinates {(1,51.7424) (2,4.71) (3,1.4) (4, 29.4)};
  115. \end{axis}
  116. \end{tikzpicture}
  117.  
  118. \
  119.  
  120. \begin{tikzpicture}[
  121.    Levels/.style = {above right, font=\footnotesize, align=left}% <-- added
  122.                     ]
  123.     \begin{axis}[
  124.    ybar,
  125.    xtick=data,
  126.    ymin = 0,
  127.    ymax = 11,
  128.    xmin=0.5,
  129.    xmax=4.5,
  130.    ymajorgrids=true,
  131.    bar width=0.5cm,
  132.    ylabel={Compression Ratio},
  133.    ylabel style={yshift=-3mm},% <-- changed
  134.     xlabel = {Decoding Time Chart},
  135.    xtick align=inside,
  136.    xticklabels={LZW, RAR, ZIP, 7z},
  137.    nodes near coords,
  138.    nodes near coords align={vertical},
  139.    x tick label style={font=\normalsize, rotate=0, anchor=north},
  140.     clip=false% <-- added
  141.     ]
  142.     \addplot coordinates {(1,9.83725) (2,0.8) (3,0.7) (4, 1.2)};
  143. \end{axis}
  144. \end{tikzpicture}
  145.  
  146. \subsection{Future Work}
  147.  
  148. We can improve the running time of our algorithm by using multi-threaded compression instead of doing all the work on a single thread which puts us behind by a lot, we can also improve the compression ratio by using multiple contexts instead of using only one context.
  149.  
  150. \begin{thebibliography}{00}
  151.  
  152. \bibitem{mattmahoney}
  153. Matt Mahooney (2011, September 1). About the Test Data. Retrieved from http://mattmahoney.net/dc/textdata.html
  154.  
  155. \end{thebibliography}
  156. \end{document}
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
Top