Advertisement
Guest User

Untitled

a guest
Mar 29th, 2015
242
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 6.00 KB | None | 0 0
  1. documentclass[a4paper,11pt]{article}
  2. title{HW-1 REPORT}
  3. author{Eral, Mustafa}
  4. date{March 2015}
  5. usepackage{amsthm, mathtools}
  6. usepackage{graphicx}
  7. usepackage{caption}
  8. usepackage{placeins}
  9.  
  10. newtheorem{theorem}{Theorem}
  11. newtheorem{lemma}[theorem]{Lemma}
  12. newtheorem{proposition}[theorem]{Proposition}
  13. newtheorem{corollary}[theorem]{Corollary}
  14. setlengthparindent{24pt}
  15. begin{document}
  16. maketitle
  17. section{Theoratical Estimates}
  18. subsection{All To One Broadcats Algorithm On a Ring}
  19. Each node first sends to one of its neighbors
  20. the data it needs to broadcast. In subsequent steps, it forwards the data received from one of
  21. its neighbors to its other neighbor. Each node receives all $(p - 1)$ pieces of information from all other nodes in (p - 1) steps. Pseudo code is as follows:
  22. textit{•\
  23. \ procedure ALL-TO-ALL-BC-RING(myid, my-message, p, result)
  24. \ indent begin
  25. \ indent left := (my-id - 1) mod p;
  26. \ indent right := (my-id + 1) mod p;
  27. \ indent result := my-msg;
  28. \ indent msg := result;
  29. \ indent indent for i := 1 to p - 1 do
  30. \ indent indent send msg to right;
  31. \ indent indent receive msg from left;
  32. \ indent indent result := result $cup$ msg
  33. \ indent end for;
  34. \end ALL-TO-ALL-BC-RING }
  35.  
  36. subsubsection{Wall Time Estimation}
  37. If we assume each for loop in the algorithm given above in pseudo form takes a time of $t_w $(time for receiving message from its left and sending previously received message to its right) for $p$ processor on the
  38. ring, this should take $(p-1)t_w$ amount of time.
  39.  
  40. subsection{All To One Broadcats Algorithm On a Hybercube}
  41. It takes $log(p)$ steps. In each iteration, nodes communicate
  42. in pairs so that the labels of the nodes communicating with each other in the $i$ th iteration differ
  43. in the $i$ th least significant bit of their binary representations. At each step, message size to be transmitted is doubled. Pseudo code is as follows:
  44.  
  45.  
  46. textit{•\
  47. \ procedure ALL-TO-ALL-BC-HCUBE(myid, my-msg, d, result)
  48. \ indent begin
  49. \ indent result := my-msg;
  50. \ indent for i := 0 to d - 1 do
  51. \ indent indent partner := my-id XOR 2 i ;
  52. \ indent indent send result to partner;
  53. \ indent indent receive msg from partner;
  54. \ indent indent result := result $cup$ msg
  55. \ indent end for;
  56. \end ALL-TO-ALL-BC-HCUBE }
  57.  
  58. subsubsection{Wall Time Estimation}
  59. Algorithm needs $log(p)$ steps to complete. Assuming message size of each individual message is $m$ byte (it is sizeof(int) in the code), time to send message at $i$th step is $t_s+2^it_wm$ where $t_s$ is the start up time spent at each send, and $t_w$ is per byte transfer time. Total time for all all to all broadcats is then
  60. $sum_{i=1}^{log(p)} 2^{i-1}t_wm = t_s log(p) + t_wm(p-1)$ .Since message size $m$ is too small in this experiment, we can ignore the term $t_wm(p-1)$. Thus we roughly expect the execution time $t_s log (p)$, which is proportional to $log(p)$.
  61.  
  62.  
  63.  
  64. section{Execution Results}
  65.  
  66. Note that when obtaining wall times, each combination was run 100 times and average of these wall times was taken.
  67. Execution times are given in tables and resulting plots are also give in figures. This results show that when the node size is doubled, execution time increases much more rapidly. For that reason, execution time when node
  68. size is two is not shown in the below figures to display others more clear.
  69.  
  70. %subsection{Using MPI-AllGather}
  71.  
  72. begin{table}[h]
  73. begin{center}
  74. begin{tabular}{|c|c|}
  75. hline
  76. $num of processors$ &$Execution Time(ms)$ \
  77. hline
  78. 1 node 2 proc &0. 0.002\
  79. 1 node 4 proc &0.0.004\
  80. 1 node 8 proc &0.008\
  81. 2 node 16 proc&0.012\
  82. hline
  83. end{tabular}
  84. end{center}
  85. caption{Execution times when built-in MPI-AllGather function is used}
  86. end{table}
  87.  
  88. begin{figure}[h]
  89.  
  90. centering
  91. includegraphics[scale= 0.5]{buildin.eps}
  92. caption{Resulting wall-time vs p plot when buildin function MPI-AllReduce is used}
  93. label{}
  94. end{figure}
  95.  
  96. FloatBarrier
  97.  
  98.  
  99. subsection{Using Ring Topology}
  100.  
  101. begin{table}[h]
  102. begin{center}
  103. begin{tabular}{|c|c|}
  104. hline
  105. $num of processors$ &$Execution Time(ms)$ \
  106. hline
  107. 1 node 2 proc &0.001\
  108. 1 node 4 proc &0.005\
  109. 1 node 8 proc &0.007\
  110. 2 node 16 proc&270,570\
  111. hline
  112. end{tabular}
  113. end{center}
  114. caption{Execution times when Ring architecture is used}
  115. end{table}
  116.  
  117. begin{figure}[h]
  118. centering
  119. includegraphics[scale= 0.5]{ring.eps}
  120. caption{Resulting wall-time vs p plot when Ring architecture is used}
  121. label{}
  122. end{figure}
  123.  
  124. FloatBarrier
  125.  
  126. subsection{Using Hybercube Topology}
  127.  
  128. begin{table}[h]
  129. begin{center}
  130. begin{tabular}{|c|c|}
  131. hline
  132. $num of processors$ &$Execution Time(ms)$ \
  133. hline
  134. 1 node 2 proc &0.001\
  135. 1 node 4 proc &0.003\
  136. 1 node 8 proc &0.005\
  137. 2 node 16 proc&155.763\
  138. hline
  139. end{tabular}
  140. end{center}
  141. caption{Execution times when Hybercube architecture is used}
  142. end{table}
  143.  
  144.  
  145. begin{figure}
  146. centering
  147. includegraphics[scale= 0.5]{cube.eps}
  148. caption{Resulting wall-time vs p plot when Cube architecture is used}
  149. label{}
  150. end{figure}
  151.  
  152. FloatBarrier
  153.  
  154. section{Conclusion}
  155.  
  156. Although ring and cubic implemantations are successfull in the sense that they give the correct
  157. results, built-in MPI function is more successful especially when the processes are at the different nodes.
  158. This is because communication is more costly between different nodes since it requires external link between
  159. nodes, where in the same node, processes can use shared memory to transfer data which is much more faster.
  160. The reason that built-in functions is successful even is node size is doubled, probably it does not use
  161. MPI-Send and MPI-Receive functions in their implemantation, but they use lower level functions which make
  162. use of underlying hardware better.
  163.  
  164. Note also that in ring implementation, wall time is more rapidly increases compared to Hypercube
  165. implementation as $p$ increases because it is linearly dependent to p(except at the time node size is
  166. increases), and hypercube is dependent to $logp$
  167.  
  168.  
  169. end{document}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement