Untitled

g belongs to O(f):

- if g is bounded above by a constant multiple of f
- if g grows no faster than (a constant multiple of) f
- if the ratio g/f is bounded above by a constant

There exists a C greater than 0 and an n0 greater than 0 for which all n > n0 g(n) <= C x f(n)

- if lim n-->infinity g(n)/f(n) = 0, then for any C >0 the ratio is less than C as long as n is sufficiently large. If it tends to infinity, then there is no C >0 such that g(n) <= C x f(n) for sufficiently large n

L'Hospitals Rule:

If the ratio of limits lim n--> infinity g(n)/f(n) is an indeterminate form, then take the derivative of the top and bottom and then apply the limit


Geometric sum to n: 1 - r^n/1 - r

If r < 1, geometric sum to infinity = 1/(1-a)

Sum of first n integers: n(n+1)/2

nth Harmonic number: Hn = sum (i =1 to n) 1/i

Hn = theta(logN)


- In probability theory, an event is a subset of the sample space. (So an event is a set of outcomes)

Expectation:

- The expected value, or expectation, of a random variable X represents its "average value"/
- Formally: if X is a random variable with a finite number of possible values, E(X) = Sum(x belongs to X) of x x P(X = x)
E(X + Y) = E(X) + E(Y)


-----------------------------

Dynamic Arrays:

- Similar to arrays, but size can be increased or decreased. ArrayList in java, list in Python


- Inserting or deleting an item in the middle of linked list is fast. Accessing a cell given its index (i.e., finding the kth item in the list) is slow

- Hashing is fast: O(1) expected time for access, insertion


Hashing Disadvantages:

- Access time (except for cuckoo hashing) and insertion time (for all strategies) is expected time, not worse-case time. So there is no absolute guarantee on performance
- Performance depends on the load factor: number of items stored/table size
- Hashing cannot tell us about nearby keys - Exact match query, successor query


Facts about binary trees:

- there are at most 2^k nodes at level k
- a binary tree with depth d has:
  - at most 2^d keaves and at most 2^(d+1) - 1 nodes
- a binary tree with n leaves has depth >= ceiling(logn)


Binary search trees:

- Find, insert and remove can all be done in O(h) time where h is tree height and O(logn) for AVL trees


Lower bound on the worst case number of 3 way comparisons (<, >, ==) required to find an item in an array is floor(logn) +1
Hence any algorithm for locating an item in an array of size n using only comparisons must perform at least floor(logn) +1 comparisons in the worst case
Hence binary-search is optimal with respect to worst-case performance

--------------------------------------------------

SORTING: measure of time = number of comparisons


- an inversion in a sequence or list is a pair of items such that the larger one precedes the smaller one


Insertion Sort:

- Work from left to right across array
- insert each item in correct position with respect to (sorted) elements to its left.
(Like arranging a hand of cards)

def insertionSort(n, A):
for k = 1 to n-1:
   if x = A[k]
   j = k-1
   while (j >= 0) and (A[j] > x )
      A[j+1] = A[j]
      j = j -1
   A[j+1] = x


Analysis of Insertion Sort:

- Storage: in place: O(1) extra storage
- Worse-case running time: theta(n^2)
- Average-case time: approximately theta(n^2)
Insertion sort is efficient if the input is almost sorted ( Time <= n - 1 + (# inversions) )


Selection Sort:

2 variations:

- Repeatedly (for i from 0 to n-1) find the minimum value, output it, delete it
- Repeatedly (for i from n-1 down to 1) find the maximum of A[0] ... A[i] and swap this value with A[i]

- Both variations run in O(n^2) time if we use the straightforward approach to finding the max/min. They can be improved by treating the items as items in an appropriately designed priority queue


Quick Sort:

function quicksort(A, first, last):
   splitpoint = split(A, first, last)
   quicksort(A, first, splitpoint)
   quicksort(A, splitpoint + 1, last)

def split(A, first, last):
splitpoint = first
x = A[first]
for k = first+1 to last do:
  if A[k] < x:
     A[splitpoint+1] <--> A[k}
     splitpoint = splitpoint + 1
A[first] <--> A[splitpoint]
return splitpoint

Worst Case Running Time - theta(n^2) (worst case number of comparisons = O(n^2))


Quicksort Key Fact: During the run of Quicksort, two values x and y get compared if and only if the first key in the range [x..y] to be chosen as pivot is either x or y

Average time for quicksort is O(nlogn)


Implementation tricks for improving Quicksort:

1. Better choice of pivot element:
  - Instead of a single element, choose median of 3 ( or 5 or 7 )
  - Choose a random element
  - Combine

2. Reduce procedure call overhead
  - Explicitly manipulate the stack in the program (rather than making recursive calls)
  - for small lists, use a nonrecursive sort like insertion sort or selection sort

3. Reduce stack space
  - Push the larger sublist and immediately working on the smaller sublist
  - Reduces worst-case stack usage from O(n) to O(logn)


Mergesort

function mergeSort(A, first, last):
if first < last:
  mid = floor[(first+last)/2]
  mergeSort(A, first, mid)
  mergeSort(A, mid+1, last)
  merge(A, first, mid, mid + 1, last)

function merge(A, first1, last1, first2, last2):
index1 = first1
index2 = first 2
tempIndex = 0

while (index1 <= last1) and (index2 <= last2):
  if A[index1] < A[index2]:
    temp[tempIndex++] = A[index1++]

  else:
    temp[tempIndex++] = A[index2++]

while(index1 <= last1):
  temp[tempIndex++] = A[index1++]
while(index2 <= last2):
  temp[tempIndex++] = A[index2++]

temp = 0, index = first
while (index <= last2): A[index++] = temp[tempIndex++]


Asymptotic Solution of mergesort recurrence equation: T(n) = theta(nlogn)


To count inversions using mergesort, add an invCount variable in the merge step and at each addition into tempIndex, do:
invCount = invCount + last1 - index1 + 1
and then return it at the end

- if we also additionally, wanted to report the inversions, the extra running time is proportional to the number of inversions reported: Report inversions in O(nlogn + k) time


---------------------------------------------------------------------------

- Priority queues are implemented in the form of a heap


Observation: Logn (where n is the number of nodes in the binary tree) tells you the depth of the binary tree with the depth of the root being 0


Viewing the array as a binary tree:

- left child of H[i] = H[2i+1]
- right child of H[i] H[2i+2]
- parent of H[i] = H[(i-1)2]


Binary Heap:

- for a SiftUp, at most 1 comparison at each level, so total time is O(logn)
- for a SiftDown, at most 2 comparisons at each level, so total time is O(logn)


- Insert operation: O(logn) (because SiftUp time dominates)
- Deletion: Siftup/ siftdown time dominates, so total time is O(logn)
- Extractmax: Delete time dominates, so total time is O(logn)


- To construct a heap in O(n) time, put the data in H in arbitrary order and then run the heapify function:

function heapify(H, n):
for i = [(n-1)/2] down to 0: //start with the last parent node
  SiftDown(H, i)


def heapsort(A, n)
  heapify(A, n)
  for k = n-1 down to 1 do:
    A[k] = ExtractMax(A)

Total Time is O(nlogn)

- if we use a min heap instead, we can just output elements in sorted order as opposed to a max heap where we would have to store the elements back in an array


Comparison-based sorts:Summary

Insertion sort
Worst Case time: theta(n^2)
Storage requirement: In-place
Remarks: Good if input is almost sorted
Stable

QuickSort
Worst Case time: theta(n^2)
Storage requirement: O(logn) extra for stack
Remarks: O(nlogn) expected time
Not stable

Mergesort
Worse Case time: theta(nlogn)
Storage requirement: O(n) extra for merge
Stable

Heapsort
Worse case time: theta(nlogn)
Storage Requirement: In-place
Remarks: Can output k smallest in sorted order in O(n + klogn) time
Not stable


- Any algorithm for sorting a list of size n using only comparisons must perform at least ceiling(logn!) comparisons in the worst case (decision tree with left:right)
ceiling(lgn!) = theta(nlogn). hence any algorithm for sorting a list of size n using only comparisons must perform at least theta(nlogn) comparisons in the worst case
Hence heapsort and mergesort are asymptotically optimal. The lower bound is asymptotically tight (cannot be improved asymptotically)


Bucket Sort: Storage is O(n+b)
Runtime = O(s^2) where s is worst case number of items in a bucket

Special case Bucket sort = the number of buckets is equal to the number of keys. In which case the keys are distributed independently and uniformly over the buckets.
Then the expected total cost of the intra-bucket sorts is O(n)


- Suppose we have n integers in the range 1..b. Then we can use b buckets and get a runtime of O(n + b) worst case


Radix sort:

Useful for sorting multi-field keys (e.g dates) and multi-digit numbers

algorithm radix_sort(A, n):
for field ranging from rightmost (least significant) to leftmost (most significant) do:
  sort L on field using a stable sort

External Sorting:

When sorting a filer larger than available memory, with n records in file and m records:

- Read in groups of m records
- Sort each group
- Write each run (sorted group) to a separate output file
- Choose f files (or all that are left if there are fewer than f)
- Merge the contents of the f input files into a new output file


Replacement Selection:

make initial runs longer by using this improvement

- When a key is written and the next key is read, if the new key is >= the last key written, make it a part of the current run. If not, save it for the next run
On average, replacement selection doubles the sizes of the runs, assuming uniform distribution of the sort keys


- A sorting algorithm is stable if whenever two keys are equal the algorithm preserves their order (i.e, does not reverse them.)