Untitled

As it stands the code is pretty tame. Our issues have been trying to incorporate pyparsing, so the raw python code we do have is pretty straight forward.

3.1
This Python script takes in a file as an argument and spits out a file with the correct formatting. It starts by creating a hash table of all the words, creating a dictionary of the data with the keys being the words from the file and the values being the number of times the word is found.

After being hashed, the data is converted from a hash table (or dictionary in python) to a list with the format: [[key1,value1],[key2,value2]….[keyn,valuen]], because dictionaries are unsortable. Now that it’s in a list, it can be sorted. We use the bubblesort algorithm because we are lazy. If we were to try to optimize this for performance, we would have used something like a merge sort, but because the input is generally very small we felt that using a more substantial sorting algorithm wasn’t necessary.
Once it is sorted, it is formatted into the string called ‘data’ and written to the appropriate file.

3.2
3.2 takes in the frequencies.txt file from 3.1, and converts it into a data structure that we developed for containing the Huffman data. Here is test1.txt in this structure as an example:
[['!@12', ['!@6', ['bye3'], ['hello3']], ['hi6']]]
It works like this –
So every layer of these lested lists is a different set of a parent and its children, or a single child by itself. Each parent node (nodes with no words attached) are indicated by the ‘!@’ preceding it’s value. The next two elements in that layer are children to the first element in the layer. If only one element is present, it is a leaf node and there are no more nested lists to traverse.
In the example, the first layer is ['!@12', ['!@6', ['bye3'], ['hello3']], ['hi6']]. The parent is !@12, with its children: ['!@6', ['bye3'], ['hello3']] and ['hi6']. The first child has more than 1 element, so we repeat the same structure as the previous later, with !@6 being the parent, ['bye3'] and [‘hi6’] being the children.
As the algorithm traverses through the layers, it looks at the children and identifies them as parents or not based on the number of elements in them. It adds any children with children of their own in the “todo” queue. Once it has finished its current layer, it selects the next node from the oldest element in the queue. If there are no more elements in the queue, then the graph is done as there are no parent nodes without children on the graph.
Our implementation of 3.2 actively formats the data into the ‘data’ variable, so when the algorithm has completed, the data is in the correct format for a .gv file.