Advertisement
Guest User

Untitled

a guest
Apr 30th, 2019
100
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 14.29 KB | None | 0 0
  1. From van@helios.ee.lbl.gov Mon Apr 30 01:44:05 1990
  2. To: end2end-interest@ISI.EDU
  3. Subject: modified TCP congestion avoidance algorithm
  4. Date: Mon, 30 Apr 90 01:40:59 PDT
  5. From: Van Jacobson van@helios.ee.lbl.gov
  6. Status: RO
  7.  
  8. This is a description of the modified TCP congestion avoidance
  9. algorithm that I promised at the teleconference.
  10.  
  11. BTW, on re-reading, I noticed there were several errors in
  12. Lixia's note besides the problem I noted at the teleconference.
  13. I don't know whether that's because I mis-communicated the
  14. algorithm at dinner (as I recall, I'd had some wine) or because
  15. she's convinced that TCP is ultimately irrelevant :). Either
  16. way, you will probably be disappointed if you experiment with
  17. what's in that note.
  18.  
  19. First, I should point out once again that there are two
  20. completely independent window adjustment algorithms running in
  21. the sender: Slow-start is run when the pipe is empty (i.e.,
  22. when first starting or re-starting after a timeout). Its goal
  23. is to get the "ack clock" started so packets will be metered
  24. into the network at a reasonable rate. The other algorithm,
  25. congestion avoidance, is run any time but when (re-)starting
  26. and is responsible for estimating the (dynamically varying)
  27. pipesize. You will cause yourself, or me, no end of confusion
  28. if you lump these separate algorithms (as Lixia's message did).
  29.  
  30. The modifications described here are only to the congestion
  31. avoidance algorithm, not to slow-start, and they are intended to
  32. apply to large bandwidth-delay product paths (though they don't
  33. do any harm on other paths). Remember that with regular TCP (or
  34. with slow-start/c-a TCP), throughput really starts to go to hell
  35. when the probability of packet loss is on the order of the
  36. bandwidth-delay product. E.g., you might expect a 1% packet
  37. loss rate to translate into a 1% lower throughput but for, say,
  38. a TCP connection with a 100 packet b-d p. (= window), it results
  39. in a 50-75% throughput loss. To make TCP effective on fat
  40. pipes, it would be nice if throughput degraded only as function
  41. of loss probability rather than as the product of the loss
  42. probabilty and the b-d p. (Assuming, of course, that we can do
  43. this without sacrificing congestion avoidance.)
  44.  
  45. These mods do two things: (1) prevent the pipe from going empty
  46. after a loss (if the pipe doesn't go empty, you won't have to
  47. waste round-trip times re-filling it) and (2) correctly account
  48. for the amount of data actually in the pipe (since that's what
  49. congestion avoidance is supposed to be estimating and adapting to).
  50.  
  51. For (1), remember that we use a packet loss as a signal that the
  52. pipe is overfull (congested) and that packet loss can be
  53. detected one of two different ways: (a) via a retransmit
  54. timeout or (b) when some small number (3-4) of consecutive
  55. duplicate acks has been received (the "fast retransmit"
  56. algorithm). In case (a), the pipe is guaranteed to be empty so
  57. we must slow-start. In case (b), if the duplicate ack
  58. threshhold is small compared to the bandwidth-delay product, we
  59. will detect the loss with the pipe almost full. I.e., given a
  60. threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
  61. 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
  62. full when fast-retransmit detects a loss (actually, until
  63. gateways start doing some sort of congestion control, the pipe
  64. is overfull when the loss is detected so at least 75% of the
  65. packets needed for ack clocking are in transit when
  66. fast-retransmit happens). Since the pipe is full, there's no
  67. need to slow-start after a fast-retransmit.
  68.  
  69. For (2), consider what a duplicate ack means: either the
  70. network duplicated a packet (i.e., the NSFNet braindead IBM
  71. token ring adapters) or the receiver got an out-of-order packet.
  72. The usual cause of out-of-order packets at the receiver is a
  73. missing packet. I.e., if there are W packets in transit and one
  74. is dropped, the receiver will get W-1 out-of-order and
  75. (4.3-tahoe TCP will) generate W-1 duplicate acks. If the
  76. `consecutive duplicates' threshhold is set high enough, we can
  77. reasonably assume that duplicate acks mean dropped packets.
  78.  
  79. But there's more information in the ack: The receiver can only
  80. generate one in response to a packet arrival. I.e., a duplicate
  81. ack means that a packet has left the network (it is now cached
  82. at the receiver). If the sender is limitted by the congestion
  83. window, a packet can now be sent. (The congestion window is a
  84. count of how many packets will fit in the pipe. The ack says a
  85. packet has left the pipe so a new one can be added to take its
  86. place.) To put this another way, say the current congestion
  87. window is C (i.e, C packets will fit in the pipe) and D
  88. duplicate acks have been received. Then only C-D packets are
  89. actually in the pipe and the sender wants to use a window of C+D
  90. packets to fill the pipe to its estimated capacity (C+D sent -
  91. D received = C in pipe).
  92.  
  93. So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
  94. are:
  95.  
  96. - The sender's input routine is changed to set cwnd to ssthresh
  97. when the dup ack threshhold is reached. [It used to set cwnd to
  98. mss to force a slow-start.] Everything else stays the same.
  99.  
  100. - The sender's output routine is changed to use an effective window
  101. of min(snd_wnd, cwnd + dupacksmss) [the change is the addition
  102. of the `dupacksmss' term.] `Dupacks' is zero until the rexmit
  103. threshhold is reached and zero except when receiving a sequence
  104. of duplicate acks.
  105.  
  106. The actual implementation is slightly different than the above
  107. because I wanted to avoid the multiply in the output routine
  108. (multiplies are expensive on some risc machines). A diff of the
  109. old and new fastrexmit code is attached (your line numbers will
  110. vary).
  111.  
  112. Note that we still do congestion avoidance (i.e., the window is
  113. reduced by 50% when we detect the packet loss). But, as long as
  114. the receiver's offered window is large enough (it needs to be at
  115. most twice the bandwidth-delay product), we continue sending
  116. packets (at exactly half the rate we were sending before the
  117. loss) even after the loss is detected so the pipe stays full at
  118. exactly the level we want and a slow-start isn't necessary.
  119.  
  120. Some algebra might make this last clear: Say U is the sequence
  121. number of the first un-acked packet and we are using a window
  122. size of W when packet U is dropped. Packets [U..U+W) are in
  123. transit. When the loss is detected, we send packet U and pull
  124. the window back to W/2. But in the round-trip time it takes
  125. the U retransmit to fill the receiver's hole and an ack to get
  126. back, W-1 dup acks will arrive (one for each packet in transit).
  127. The window is effectively inflated by one packet for each of
  128. these acks so packets [U..U+W/2+W-1) are sent. But we don't
  129. re-send packets unless we know they've been lost so the amount
  130. actually sent between the loss detection and the recovery ack is
  131. U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
  132. avoidance allows us to send (if we add in the rexmit of U). The
  133. recovery ack is for packet U+W so when the effective window is
  134. pulled back from W/2+W-1 to W/2 (which happens because the
  135. recovery ack is 'new' and sets dupack to zero), we are allowed
  136. to send up to packet U+W+W/2 which is exactly the first packet
  137. we haven't yet sent. (I.e., there is no sudden burst of packets
  138. as the `hole' is filled.) Also, when sending packets between
  139. the loss detection and the recovery ack, we do nothing for the
  140. first W/2 dup acks (because they only allow us to send packets
  141. we've already sent) and the bottleneck gateway is given W/2
  142. packet times to clean out its backlog. Thus when we start
  143. sending our W/2-1 new packets, the bottleneck queue is as empty
  144. as it can be.
  145.  
  146. [I don't know if you can get the flavor of what happens from
  147. this description -- it's hard to see without a picture. But I
  148. was delighted by how beautifully it worked -- it was like
  149. watching the innards of an engine when all the separate motions
  150. of crank, pistons and valves suddenly fit together and
  151. everything appears in exactly the right place at just the right
  152. time.]
  153.  
  154. Also note that this algorithm interoperates with old tcp's: Most
  155. pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
  156. If we don't get the dup acks, fast retransmit never fires and the
  157. window is never inflated so everything happens in the old way (via
  158. timeouts). Everything works just as it did without the new algorithm
  159. (and just as slow).
  160.  
  161. If you want to simulate this, the intended environment is:
  162.  
  163. - large bandwidth-delay product (say 20 or more packets)
  164. - receiver advertising window of two b-d p (or, equivalently,
  165. advertised window of the unloaded b-d p but two or more
  166. connections simultaneously sharing the path).
  167. - average loss rate (from congestion or other source) less than
  168. one lost packet per round-trip-time per active connection.
  169. (The algorithm works at higher loss rate but the TCP selective
  170. ack option has to be implemented otherwise the pipe will go empty
  171. waiting to fill the second hole and throughput will once again
  172. degrade at the product of the loss rate and b-d p. With selective
  173. ack, throughput is insensitive to b-d p at any loss rate.)
  174.  
  175. And, of course, we should always remember that good engineering
  176. practise suggests a b-d p worth of buffer at each bottleneck --
  177. less buffer and your simulation will exhibit the interesting
  178. pathologies of a poorly engineered network but will probably
  179. tell you little about the workings of the algorithm (unless the
  180. algorithm misbehaves badly under these conditions but my
  181. simulations and measurements say that it doesn't). In these
  182. days of $100/megabyte memory, I dearly hope that this particular
  183. example of bad engineering is of historical interest only.
  184.  
  185. - Van
  186.  
  187. ---
  188.  
  189. *** /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990
  190. --- tcp_input.c Mon Apr 30 01:33:30 1990
  191. ***************
  192. *** 834,850 ****
  193. * Kludge snd_nxt & the congestion
  194. * window so we send only this one
  195. ! * packet. If this packet fills the
  196. ! * only hole in the receiver's seq.
  197. ! * space, the next real ack will fully
  198. ! * open our window. This means we
  199. ! * have to do the usual slow-start to
  200. ! * not overwhelm an intermediate gateway
  201. ! * with a burst of packets. Leave
  202. ! * here with the congestion window set
  203. ! * to allow 2 packets on the next real
  204. ! * ack and the exp-to-linear thresh
  205. ! * set for half the current window
  206. ! * size (since we know we're losing at
  207. ! * the current window size).
  208. */
  209. if (tp->t_timer[TCPT_REXMT] == 0 ||
  210. --- 834,850 ----
  211. * Kludge snd_nxt & the congestion
  212. * window so we send only this one
  213. ! * packet.
  214. ! *
  215. ! * We know we're losing at the current
  216. ! * window size so do congestion avoidance
  217. ! * (set ssthresh to half the current window
  218. ! * and pull our congestion window back to
  219. ! * the new ssthresh).
  220. ! *
  221. ! * Dup acks mean that packets have left the
  222. ! * network (they're now cached at the receiver)
  223. ! * so bump cwnd by the amount in the receiver
  224. ! * to keep a constant cwnd packets in the
  225. ! * network.
  226. */
  227. if (tp->t_timer[TCPT_REXMT] == 0 ||
  228. ***************
  229. *** 853,864 ****
  230. else if (++tp->t_dupacks == tcprexmtthresh) {
  231. tcp_seq onxt = tp->snd_nxt;
  232. ! u_int win =
  233. ! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 /
  234. ! tp->t_maxseg;
  235.  
  236. if (win < 2)
  237. win = 2;
  238. tp->snd_ssthresh = win * tp->t_maxseg;
  239. -
  240. tp->t_timer[TCPT_REXMT] = 0;
  241. tp->t_rtt = 0;
  242. --- 853,864 ----
  243. else if (++tp->t_dupacks == tcprexmtthresh) {
  244. tcp_seq onxt = tp->snd_nxt;
  245. ! u_int win = MIN(tp->snd_wnd,
  246. ! tp->snd_cwnd);
  247.  
  248. + win /= tp->t_maxseg;
  249. + win >>= 1;
  250. if (win < 2)
  251. win = 2;
  252. tp->snd_ssthresh = win * tp->t_maxseg;
  253. tp->t_timer[TCPT_REXMT] = 0;
  254. tp->t_rtt = 0;
  255. ***************
  256. *** 866,873 ****
  257. tp->snd_cwnd = tp->t_maxseg;
  258. (void) tcp_output(tp);
  259. !
  260. if (SEQ_GT(onxt, tp->snd_nxt))
  261. tp->snd_nxt = onxt;
  262. goto drop;
  263. }
  264. } else
  265. --- 866,879 ----
  266. tp->snd_cwnd = tp->t_maxseg;
  267. (void) tcp_output(tp);
  268. ! tp->snd_cwnd = tp->snd_ssthresh +
  269. ! tp->t_maxseg *
  270. ! tp->t_dupacks;
  271. if (SEQ_GT(onxt, tp->snd_nxt))
  272. tp->snd_nxt = onxt;
  273. goto drop;
  274. + } else if (tp->t_dupacks > tcprexmtthresh) {
  275. + tp->snd_cwnd += tp->t_maxseg;
  276. + (void) tcp_output(tp);
  277. + goto drop;
  278. }
  279. } else
  280. ***************
  281. *** 874,877 ****
  282. --- 880,890 ----
  283. tp->t_dupacks = 0;
  284. break;
  285. + }
  286. + if (tp->t_dupacks) {
  287. + /*
  288. + * the congestion window was inflated to account for
  289. + * the other side's cached packets - retract it.
  290. + */
  291. + tp->snd_cwnd = tp->snd_ssthresh;
  292. }
  293. tp->t_dupacks = 0;
  294. *** /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990
  295. --- tcp_timer.c Mon Apr 30 00:36:29 1990
  296. ***************
  297. *** 223,226 ****
  298. --- 223,227 ----
  299. tp->snd_cwnd = tp->t_maxseg;
  300. tp->snd_ssthresh = win * tp->t_maxseg;
  301. + tp->t_dupacks = 0;
  302. }
  303. (void) tcp_output(tp);
  304.  
  305.  
  306.  
  307. From van@helios.ee.lbl.gov Mon Apr 30 10:37:36 1990
  308. To: end2end-interest@ISI.EDU
  309. Subject: modified TCP congestion avoidance algorithm (correction)
  310. Date: Mon, 30 Apr 90 10:36:12 PDT
  311. From: Van Jacobson van@helios.ee.lbl.gov
  312. Status: RO
  313.  
  314. I shouldn't make last minute 'fixes'. The code I sent out last
  315. night had a small error:
  316.  
  317. *** t.c Mon Apr 30 10:28:52 1990
  318. --- tcp_input.c Mon Apr 30 10:30:41 1990
  319. ***************
  320. *** 885,893 ****
  321. * the congestion window was inflated to account for
  322. * the other side's cached packets - retract it.
  323. */
  324. ! tp->snd_cwnd = tp->snd_ssthresh;
  325. }
  326. - tp->t_dupacks = 0;
  327. if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
  328. tcpstat.tcps_rcvacktoomuch++;
  329. goto dropafterack;
  330. --- 885,894 ----
  331. * the congestion window was inflated to account for
  332. * the other side's cached packets - retract it.
  333. */
  334. ! if (tp->snd_cwnd > tp->snd_ssthresh)
  335. ! tp->snd_cwnd = tp->snd_ssthresh;
  336. ! tp->t_dupacks = 0;
  337. }
  338. if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
  339. tcpstat.tcps_rcvacktoomuch++;
  340. goto dropafterack;
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement