Untitled

documentclass[a4paper,11pt]{article}
title{HW-1 REPORT}
author{Eral, Mustafa}
date{March 2015}
usepackage{amsthm, mathtools}
usepackage{graphicx}
usepackage{caption}
usepackage{placeins}

newtheorem{theorem}{Theorem}
newtheorem{lemma}[theorem]{Lemma}
newtheorem{proposition}[theorem]{Proposition}
newtheorem{corollary}[theorem]{Corollary}
setlengthparindent{24pt}
begin{document}
   maketitle
 section{Theoratical Estimates}
 subsection{All To One Broadcats Algorithm On a Ring}
 Each node first sends to one of its neighbors
 the data it needs to broadcast. In subsequent steps, it forwards the data received from one of
 its neighbors to its other neighbor. Each node receives all $(p - 1)$ pieces of information from all other nodes in (p - 1) steps. Pseudo code is as follows:
textit{•\
\  procedure ALL-TO-ALL-BC-RING(myid, my-message, p, result)
\  indent begin
\  indent left := (my-id - 1) mod p;
\  indent right := (my-id + 1) mod p;
\  indent result := my-msg;
\  indent msg := result;
\  indent indent for i := 1 to p - 1 do
\  indent indent send msg to right;
\  indent indent receive msg from left;
\  indent indent result := result $cup$ msg
\ indent end for;
\end ALL-TO-ALL-BC-RING  }

 subsubsection{Wall Time Estimation}
 If we assume each for loop in the algorithm given above in pseudo form takes a time of $t_w $(time for receiving message from its left and sending previously received message to its right) for $p$ processor on the
ring, this should take $(p-1)t_w$ amount of time.

subsection{All To One Broadcats Algorithm On a Hybercube}
 It takes $log(p)$ steps. In each iteration, nodes communicate
in pairs so that the labels of the nodes communicating with each other in the $i$ th iteration differ
in the $i$ th least significant bit of their binary representations. At each step, message size to be transmitted is doubled. Pseudo code is as follows:


textit{•\
\  procedure ALL-TO-ALL-BC-HCUBE(myid, my-msg, d, result)
\  indent begin
\  indent result := my-msg;
\  indent for i := 0 to d - 1 do
\  indent indent partner := my-id XOR 2 i ;
\  indent indent send result to partner;
\  indent indent receive msg from partner;
\  indent indent result := result $cup$ msg
\ indent end for;
\end ALL-TO-ALL-BC-HCUBE }

subsubsection{Wall Time Estimation}
 Algorithm needs $log(p)$ steps to complete. Assuming message size of each individual message is $m$ byte (it is sizeof(int) in the code), time to send message at $i$th step is $t_s+2^it_wm$ where $t_s$ is the start up time spent at each send, and $t_w$ is per byte transfer time. Total time for all all to all broadcats is then
$sum_{i=1}^{log(p)} 2^{i-1}t_wm = t_s log(p) + t_wm(p-1)$ .Since message size $m$ is too small in this experiment, we can ignore the term  $t_wm(p-1)$. Thus we roughly expect the execution time $t_s log (p)$, which is proportional to $log(p)$.


section{Execution Results}

Note that when obtaining wall times, each combination was run 100 times and average of these wall times was taken.
Execution times are given in tables and resulting plots are also give in figures. This results show that when the node size is doubled, execution time increases much more rapidly. For that reason, execution time when node
size is two is not shown in the below figures to display others more clear.

%subsection{Using MPI-AllGather}

begin{table}[h]
    begin{center}
        begin{tabular}{|c|c|}
        hline
        $num of processors$ &$Execution Time(ms)$ \
        hline
        1 node 2 proc &0. 0.002\
        1 node 4 proc &0.0.004\
        1 node 8 proc &0.008\
        2 node 16 proc&0.012\
        hline
        end{tabular}
  end{center}
  caption{Execution times when built-in MPI-AllGather function is used}
end{table}

begin{figure}[h]

 centering
 includegraphics[scale= 0.5]{buildin.eps}
  caption{Resulting wall-time vs p plot when buildin function MPI-AllReduce is used}
  label{}
end{figure}

FloatBarrier


subsection{Using Ring Topology}

begin{table}[h]
    begin{center}
        begin{tabular}{|c|c|}
        hline
        $num of processors$ &$Execution Time(ms)$ \
        hline
        1 node 2 proc &0.001\
        1 node 4 proc &0.005\
        1 node 8 proc &0.007\
        2 node 16 proc&270,570\
        hline
        end{tabular}
  end{center}
  caption{Execution times when Ring architecture is used}
end{table}

begin{figure}[h]
 centering
 includegraphics[scale= 0.5]{ring.eps}
  caption{Resulting wall-time vs p plot when Ring architecture is used}
  label{}
end{figure}

FloatBarrier

subsection{Using Hybercube Topology}

begin{table}[h]
    begin{center}
        begin{tabular}{|c|c|}
        hline
        $num of processors$ &$Execution Time(ms)$ \
        hline
        1 node 2 proc &0.001\
        1 node 4 proc &0.003\
        1 node 8 proc &0.005\
        2 node 16 proc&155.763\
        hline
        end{tabular}
  end{center}
  caption{Execution times when Hybercube  architecture is used}
end{table}


begin{figure}
 centering
 includegraphics[scale= 0.5]{cube.eps}
  caption{Resulting wall-time vs p plot when Cube architecture is used}
  label{}
end{figure}

FloatBarrier

section{Conclusion}

    Although ring and cubic implemantations are successfull in the sense that they give the correct
    results, built-in MPI function is more successful especially when the processes are at the different nodes.
    This is because communication is more costly between different nodes since it requires external link between
    nodes, where in the same node, processes can use shared memory to transfer data which is much more faster.
    The reason that built-in  functions is successful even is node size is doubled, probably it does not use
     MPI-Send and MPI-Receive functions in their implemantation, but they use lower level functions which make
     use of underlying hardware better.

     Note also that in ring implementation, wall time is more rapidly increases compared to Hypercube
     implementation as $p$ increases because it is linearly dependent to p(except at the time node size is
     increases), and hypercube is dependent to $logp$


end{document}