TCP connections hang in SYN_SENT Asked by: pfenerty Hello, I have an intermittent TCP connection problem that affects multiple PCs on a home LAN: Background: - all TCP connections from the LAN are masqueraded through an IPCop Linux firewall/gateway - IPCop is a stock 1.4.6 install ... no addons or modifications - packets leaving the gateway are routed onto the WAN by way of a Motorola SB3100 Cable Modem - normal firewall/gateway/LAN behavior was observed for several months before problem was first noticed Problem: - all outbound TCP connections intermittently do not complete for a given PC - all such connection attempts during the problem period remain in state SYN_SENT in gateway connection table - ping and traceroute to WAN destinations work OK as always during the problem period - problem occurs typically in firefox 1.5.0.1 on fully patched WindowsXP - problem also reproduced in netscape 4.7 on Linux 2.2.16 (!) - typically only one PC is affected at a time - i.e., other PCS on the LAN routinely establish TCP connections OK while the affected one cannot - frequency of occurrence is typically several times per week, but typically not more than once per day Problem Resolution: - TCP connections, for the affected PC, begin routinely completing again after any of: (a) Rebooting the affected PC ... never the gateway (b) waiting sufficiently ... tens of minutes to hours (c) connection flooding ... multiple rapid repeat browser page reload requests from the affected PC ... 20 to 30 typically Troubleshooting, so far: - winXP PCS run Norton AV, Spybot S&D, AD-Aware SE - IPCop firewall runs rkhunter - (unreplied) outbound SYN packets from the affected PCs appear on the firewall WAN interface (!) - these SYN packets appear to be well-formed, at least as far as I can tell, and seem to match subsequent, successfully SYN_ACKed packets Discusssion: Initially I thought this would be a Microsoft problem. But then I saw it occur on an old Linux box. So, after packet-sniffing the gateway LAN interface during the problem, and seeing, coming from the affected PC, first only a successful (UDP) DNS transaction, and then followed by groups of three unreplied TCP SYN request packets, one group for each time the connection is tried, I thought that for sure I'd find dropped packets at the firewall. But after inserting log messages up and down the gateway's netfilter chains, none of which caught anything, I eventually moved the sniffer to the gateway's WAN interface, and found there the same three lonely unreplied TCP SYN request packets that had been visible on the LAN side: WAN interface packet capture: - packets are sniffed against filter 'host 66.249.81.99' ... google_news server - google_news server IPaddress determined just prior to packet capture using a non-affected PC - STEP 1: unreplied SYN packets captured by google_news browser page request from affected PC - STEP 2: affected PC is rebooted - STEP 3: completed TCP connection packets captured as per STEP 1 - no changes made to the gateway, or to the laptop running ethereal, other than to start and stop packet capture, during above STEPs So, where do the SYN packets go? Why are they ignored, intermittently? Is another subscriber on my cable feeder line hijacking them? Perhaps more to the point, at least initially, is if the packets are properly SNATed at the firewall, as they appear to be, how can the problem appear, at the WAN interface, to be localized to a single host on the LAN? And for different PCs, at different times? ?? Thanks so much! Paul **************************************************************************************************** From the above mentioned packet capture session, here's the first outbound SYN packet that remains UNREPLIED, such that the connection hangs in state SYN_SENT: ** Note that it is the 7th captured packet for the session. The captured browser page request was preceded by a single google_news 'ping' from the gateway, to verify last minute-reachability (3 ICMP ping requests, 3 replies). No. Time Source Destination Protocol Info 7 32.246799 72.134.170.173 66.249.81.99 TCP 3186 > http [SYN] Seq=0 Ack=0 Win=65535 Len=0 MSS=1460 Frame 7 (62 bytes on wire, 62 bytes captured) Arrival Time: Feb 3, 2006 12:57:16.931563000 Time delta from previous packet: 30.040320000 seconds Time since reference or first frame: 32.246799000 seconds Frame Number: 7 Packet Length: 62 bytes Capture Length: 62 bytes Protocols in frame: eth:ip:tcp Ethernet II, Src: Shuttle_3a:ca:6f (00:30:1b:3a:ca:6f), Dst: USRoboti_40:54:70 (00:c0:49:40:54:70) Destination: USRoboti_40:54:70 (00:c0:49:40:54:70) Source: Shuttle_3a:ca:6f (00:30:1b:3a:ca:6f) Type: IP (0x0800) Internet Protocol, Src: 72.134.170.173 (72.134.170.173), Dst: 66.249.81.99 (66.249.81.99) Version: 4 Header length: 20 bytes Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00) 0000 00.. = Differentiated Services Codepoint: Default (0x00) .... ..0. = ECN-Capable Transport (ECT): 0 .... ...0 = ECN-CE: 0 Total Length: 48 Identification: 0xa07e (41086) Flags: 0x04 (Don't Fragment) 0... = Reserved bit: Not set .1.. = Don't fragment: Set ..0. = More fragments: Not set Fragment offset: 0 Time to live: 127 Protocol: TCP (0x06) Header checksum: 0xd39b [correct] Good: True Bad : False Source: 72.134.170.173 (72.134.170.173) Destination: 66.249.81.99 (66.249.81.99) Transmission Control Protocol, Src Port: 3186 (3186), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 Source port: 3186 (3186) Destination port: http (80) Sequence number: 0 (relative sequence number) Header length: 28 bytes Flags: 0x0002 (SYN) 0... .... = Congestion Window Reduced (CWR): Not set .0.. .... = ECN-Echo: Not set ..0. .... = Urgent: Not set ...0 .... = Acknowledgment: Not set .... 0... = Push: Not set .... .0.. = Reset: Not set .... ..1. = Syn: Set .... ...0 = Fin: Not set Window size: 65535 Checksum: 0x6f71 [correct] Options: (8 bytes) Maximum segment size: 1460 bytes NOP NOP SACK permitted **************************************************************************************************** From the above mentioned packet capture session, here's the first outbound SYN packet that is successfully SYN_ACKed, after reboot of the affected PC, such that a connection attempt completes: No. Time Source Destination Protocol Info 1 0.000000 72.134.170.173 66.249.81.99 TCP 3230 > http [SYN] Seq=0 Ack=0 Win=65535 Len=0 MSS=1460 Frame 1 (62 bytes on wire, 62 bytes captured) Arrival Time: Feb 3, 2006 13:14:38.485541000 Time delta from previous packet: 0.000000000 seconds Time since reference or first frame: 0.000000000 seconds Frame Number: 1 Packet Length: 62 bytes Capture Length: 62 bytes Protocols in frame: eth:ip:tcp Ethernet II, Src: Shuttle_3a:ca:6f (00:30:1b:3a:ca:6f), Dst: USRoboti_40:54:70 (00:c0:49:40:54:70) Destination: USRoboti_40:54:70 (00:c0:49:40:54:70) Source: Shuttle_3a:ca:6f (00:30:1b:3a:ca:6f) Type: IP (0x0800) Internet Protocol, Src: 72.134.170.173 (72.134.170.173), Dst: 66.249.81.99 (66.249.81.99) Version: 4 Header length: 20 bytes Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00) 0000 00.. = Differentiated Services Codepoint: Default (0x00) .... ..0. = ECN-Capable Transport (ECT): 0 .... ...0 = ECN-CE: 0 Total Length: 48 Identification: 0xa8cd (43213) Flags: 0x04 (Don't Fragment) 0... = Reserved bit: Not set .1.. = Don't fragment: Set ..0. = More fragments: Not set Fragment offset: 0 Time to live: 127 Protocol: TCP (0x06) Header checksum: 0xcb4c [correct] Good: True Bad : False Source: 72.134.170.173 (72.134.170.173) Destination: 66.249.81.99 (66.249.81.99) Transmission Control Protocol, Src Port: 3230 (3230), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 Source port: 3230 (3230) Destination port: http (80) Sequence number: 0 (relative sequence number) Header length: 28 bytes Flags: 0x0002 (SYN) 0... .... = Congestion Window Reduced (CWR): Not set .0.. .... = ECN-Echo: Not set ..0. .... = Urgent: Not set ...0 .... = Acknowledgment: Not set .... 0... = Push: Not set .... .0.. = Reset: Not set .... ..1. = Syn: Set .... ...0 = Fin: Not set Window size: 65535 Checksum: 0x7873 [correct] Options: (8 bytes) Maximum segment size: 1460 bytes NOP NOP SACK permitted ***************************************** I found no Topic Area labelled 'TCP/IP', which would have been my first choice. So I have chosen "Linux Networking" (for the firewall gateway referenced below). Perhaps 'Broadband' would be a better choice? I suppose that depends on where the problem turns out to be. I rate this 500 points not so much for urgency, as I have lived with this for a month or so by now. But I rate it pretty-damn-difficult, because I thought for sure that I'd have it figured out long ago. Answers by: dbardbarPosted on 2006-02-13 at 03:36:22ID: 15940339 I have a suggestion, which might explain the situtation. Could it be that you have a machine in your network with the same IP as your GW? It might also be a network printer, a switch, or anything else with an IP. This could explain why somtimes a machine can send SYN packets, but is not getting anything back. The PC might be sending the trafffic to a wrong MAC address, instead to the MAC address of the GW. Waiting sufficently or rebooting the machine will clear it's ARP cache, and next time it's sends an ARP request it will get the reply from the correct machine (your GW). Hmmm... Actually, reading again through what you wrote (quite a lot... :-), you are saying that you saw the SYN packets on the external interface of your GW. That would seem to rule that out. Still, what you are describing does sound similar to an identical IP problem. Try to have a look at the MAC addresses of your PCs and GWs, and look at the arp caches of all the relevant machines before and after the problem occurs. by: pfenertyPosted on 2006-02-13 at 12:15:12ID: 15944605 Hi, and thanks for your suggestion. I've checked around, and all the IPaddress assignments look right, with no duplicates. Arp caches all seem OK, with MAC addresses matching assigned IPaddresses, but then there's no sign of the problem right now. I will check this on an affected machine and report back at the next episode. For what it's worth, your suspicion reminded me of the somewhat misguided exercise I tried upon first bringing the gateway online. It was to be an upgrade for my original gateway, a 486 linux 2.2 ipchains firewall. When I prepared to switch over to the new gateway, my cable modem Configuration Manager claimed only a "Max 1" for "Known CPE MAC Address", and had long ago learned the old gateway MAC address. Several years earlier, when I first signed up for the cable service, Max allowable MACs had been 3. So to make sure that the new gateway wouldn't get locked out by the cable modem, I decided to clone the old gateway MAC address, as per ifconfig eth1 down hw ether old_gateway_MAC_address ifconfig eth1 up But then to assure a clean DHCP assignment for the new gateway, I brought both the gateway & modem down, and then back up again. Of course at that point the gateway came back up with its actual MAC address, not the cloned one, and the modem simply learned it and never complained. Afterwards, I assumed that the cloning was 100% non-persistent, and never thought about it again, until now. The old gateway MAC address isn't in there somewhere, is it? Otherwise, the only not-quite-right configuration I can see is that a few of the machines that are not used to go out to the WAN still have the old gateway IPaddress as the default route. One of those machines is the winNT 4.0 PDC for the windows boxes, which also runs an SNMP agent that I once played around with, that still attempts to discover and then poll the network, but doesn't find many of the machines, including the gateway. Thanks again. by: pfenertyPosted on 2006-02-18 at 12:00:26ID: 15990182 Had an episode yesterday. arp cache on the affected machine looks OK ... only entry is the gateway. by: RapidDelpPosted on 2006-02-26 at 17:33:22ID: 16051889 I have been having funny connection problems in similar manner that ajusting the MTU helps. It does not seem to be the right kind of problem for all of your symptoms, but it might be worth running up the packet size with ping from the effected machine to find out the max size might give you some insight. ping -s 500 ping -s 1700 by: pfenertyPosted on 2006-02-26 at 17:55:00ID: 16051960 Thanks, I will give it a try, next opportunity. Wish I could trigger this problem, at my convenience, but so far, I have to wait for it to find me. Once or twice per week is all I get. The next time around i'd like to compare outgoing SYN request packets from the affected machine, with one not affected. Since all the LAN machines get SNATed, seems pretty funny that all but one get through ok ... by: pfenertyPosted on 2006-03-15 at 10:54:05ID: 16197238 Seems like the MTU issue would not apply here. In general, once a TCP connection is established, and the client asks for something big, and the server sends it back with the "don't fragment" bit set, then maybe some router along the path, unable to handle the large packet, and forbidden to fragment it, might return an ICMP notification instead, which maybe gets blocked somewhere, and so the requested object disappears without a trace. Such "black hole router" problems can show up as "really weird problems which can mainly be described such that everything works perfectly from your firewall/router, but your local hosts behind the firewall can't exchange large packets. This could mean such things as mail servers being able to send small mails, but not large ones, web browsers that connect but then hang with no data received, and ssh connecting properly, but scp hangs after the initial handshake. In other words, everything that uses any large packets will be unable to work." http://iptables-tutorial.frozentux.net/chunkyhtml/x4700.html But in my case, after requesting a TCP connection with a SYN packet, I get no SYN/ACK back from any server, no matter what, for the duration of the episode. Since no connection is ever established, there's never any client request for any payload, large or small. The missing SYN/ACK packet, which somehow never arrives at my gateway's WAN interface, for any of the SYN request re-tries, is on the order of 60 bytes ... not a fragmentation target. After the episode passes, TCP connections establish as before the episode, and as expected. The most puzzling part of all this is that such an episode only occurs at one host on the LAN, while all other hosts remain unaffected. This is puzzling because the gateway/router SNATs the IPAddresses for all the hosts on the LAN, such that by the time an outbound packet appears at the WAN, all the LAN hosts are indistinguishable from each other, at least by way of source address. By way of demonstration, here are diff files for two cases of SYN request packets for a host that has become unable to establish TCP connections. CASE 1 compares the SYN request packet for a single host: (<) connection hangs in SYN_SENT, compared with (>) connection becomes established. The host was rebooted inbetween connection attempts in order to 'fix' the problem. The packets compared here are the packets included in my original post. 4,8c4,8 < Frame 7 (62 bytes on wire, 62 bytes captured) < Arrival Time: Feb 3, 2006 12:57:16.931563000 < Time delta from previous packet: 30.040320000 seconds < Time since reference or first frame: 32.246799000 seconds < Frame Number: 7 --- > Frame 1 (62 bytes on wire, 62 bytes captured) > Arrival Time: Feb 3, 2006 13:14:38.485541000 > Time delta from previous packet: 0.000000000 seconds > Time since reference or first frame: 0.000000000 seconds > Frame Number: 1 24c24 < Identification: 0xa07e (41086) --- > Identification: 0xa8cd (43213) 32c32 < Header checksum: 0xd39b [correct] --- > Header checksum: 0xcb4c [correct] 37,38c37,38 < Transmission Control Protocol, Src Port: 3186 (3186), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 < Source port: 3186 (3186) --- > Transmission Control Protocol, Src Port: 3230 (3230), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 > Source port: 3230 (3230) 52c52 < Checksum: 0x6f71 [correct] --- > Checksum: 0x7873 [correct] CASE 2 compares the (<) SYN request packet for a host unable to establish TCP connections, with a (>) SYN request packet for a host that can connect, and does. Both hosts are on the same LAN, and both packets are captured at the shared gateway's WAN interface. The connection requests were made within minutes of each other. The broken host remained broken before, during, and after the unaffected host successfully connected. 2c2 < Arrival Time: Mar 12, 2006 07:38:57.224563000 --- > Arrival Time: Mar 12, 2006 07:36:22.077999000 21c21 < Identification: 0x68a7 (26791) --- > Identification: 0xa865 (43109) 29c29 < Header checksum: 0xbd82 [correct] --- > Header checksum: 0x7dc4 [correct] 34,35c34,35 < Transmission Control Protocol, Src Port: 3146 (3146), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 < Source port: 3146 (3146) --- > Transmission Control Protocol, Src Port: 3423 (3423), Dst Port: http (80), Seq: 0, Ack: 0, Len: 0 > Source port: 3423 (3423) 48,49c48,49 < Window size: 16384 < Checksum: 0x118b [correct] --- > Window size: 65535 > Checksum: 0x0e8a [correct] ... not very different ... ports, IDs, checksums ... !! by: RapidDelpPosted on 2006-03-16 at 06:06:15ID: 16204955 Paul, You clearly have a deep understaning of what is going on in the relm that you can observe. I am just going to try some semi-random observations to see if they trigger other paths to expore. Maybe this will help. From your descriptions, I think you are saying that the packet (SYN ACK) does not get back to the WAN interface, so it is either not being generated by the far side (google news) or is getting eaten on the way back (the real WAN or your cable plant). Can you confirm that the sniffer that you have been reporting is on the WAN side of the NAT function. I think that it is from reading what you have said. Also, is it far side host independant? (google news and all other TCP hosts) You also say that rebooting the local machine, or waiting will make it go away (start working again). Does the rebooting make it start working again every time? or is it just the time duration (that it takes to reboot)? You can ICMP ping the remote host from the non-working machien, can you tcp-ping it? (OK a simple question to ask, but you need a tool that I do not have and although a google for "tcp-ping download" comes back with some possibilties, I have not tried any of these. If someone on the cable plant is taking over, you would still see the syn ack. OK, just thoughts. feel free to abandon, or continue. This is interesting. by: pfenertyPosted on 2006-03-16 at 11:58:07ID: 16208802 Thanks for continuing to chew on this ... I deeply appreciate any new ways to think about what's going on here. And yes, 'interesting' is one of several descriptors I apply to this. There is no doubt about where the sniffing occurs. I plug a laptop into a hub along with the router WAN interface and the cable modem, and borrow an unused IPAddress for the session. All sniffed packets have the DHCP assigned IPAddress of the WAN interface. And yes, it matters not what's on the far end. Any browser request for any URL eventually times out during an episode, as do mail client download requests. I can't say for sure how recovery occurs, and so the possibility certainly exists that it's ultimately only timeout related, and everything else is coincidence. I do know that in the early days of troubleshooting this, it seemed to persist longer. Often I would go off and explore various network elements for awhile, and come back to find the problem remained. These days I only take data at the WAN interface, and exercise the broken machine more as a result. Twice recently the machine recovered before I was done. There have been times when the problem occurs while I am in no mood to be interrupted by a troubleshooting session, and so instead I simply hit the page refresh button some 20 - 30 times, at which point normal operation returns. Regarding tcp-pinging, my next edisonian-approach troubleshooting step (have to wait for the problem to come back ... i have never been able to trigger it) is to use linux netcat, from a non-broken host on the LAN, during an episode, and attempt to establish connections on both (a) the same port number that the broken machine is attempting to connect from, and (b) a numerically very different port. The idea being that it would at least make some sense if this turns out to be socket related, such that it's the half-association, as seen from the router, that breaks. More to the point, what I'd like to find, at least as far as understanding what's going on, is that half-associations for a -range- of ports break. Looking at the router's connection table during one episode, I saw that all the SYN_SENT hung connections started around port 3150, and continued on sequentially up through around 3186. Such behavior might explain why exercising the broken machine 'fixes' the problem ... i.e., by simply running out the broken range. Not sure how happy i'll be, from a security standpoint, to discover such a recurring mode of operation. My man-in-middle paranoia expressed earlier about sharing the cable feed with my neighbors came from working backwards from the 'solution' ... i.e., either my SYN packets don't reach their destination, or the returning SYN/ACKs don't reach me, for a targeted set of connections. "Paranoia is just reality on a finer scale." - Philo Gant by: pfenertyPosted on 2006-03-16 at 13:04:13ID: 16209569 Just occurred to me, as I posted the above, that maybe the broken port range is fixed, and not variable. So using netcat, I just sent a SYN request from port 4000 to google news and got a SYN/ACK right back. By the way, i first noticed the problem at google news, which loads a fairly lightweight webpage, reliably and rapidly, usually, which is why i continue to test there. Otherwise it's arbitrary, except that whenever i have no idea what's going on, i try to keep some set of parameters as unchanging as possible. Plus it's always nice to see what's going on in the real world. Then I repeated the test from port 3160, one that had show up in the broken range earlier and, yep, ethereal saw only three lonely SYNs, and zero SYN/ACKs. Now it looks like I know what's going on ... but ... how, and why ? Is this a broken cable modem? by: pfenertyPosted on 2006-03-18 at 11:07:21ID: 16225945 Not that I can imagine how a modem could break only certain bits within a stream, but i'd sure like to discover that this problem originates from within my facility, and it sure seems to be on the WAN side of my router. Wishful thinking. For what it's worth, the problem does not seem to spill over into UDP: when I replace the IPAddress for google news with 'news.google.com', I get back the UDP DNS lookup for both cases: netcat source port 4000, and netcat source port 3160. But I get no TCP SYN/ACK for that latter case. Works great from port 4000. Last time this occurred on its own, from a browser, I watched the router's connection table to try to better define the range of broken ports. The first hung SYN_SENT connection originated from port 3127, but 3126 was used locally. Refreshing the browser page in steps, I saw a hung connection at 3194, and then the first established connection at 3199. Not a very exciting range ... not even a power of 2. The binary gets mildly interesting, rolling from 3194 to 3199: 3194 110001111010 3195 110001111011 3196 110001111100 3197 110001111101 3198 110001111110 3199 110001111111 ... but that's only mirrored in four bits on the other end: 3127 110000110111 3126 110000110110 3125 110000110101 3124 110000110100 3123 110000110011 3122 110000110010 3121 110000110001 3120 110000110000 ... three months into the problem, grasping at bits ... by: pfenertyPosted on 2006-03-18 at 11:35:39ID: 16226090 mapped out the broken port range using netcat: 3126 ok 3127 ng ... 3198 ng 3199 ok ... knowledge is power by: dbardbarPosted on 2006-03-18 at 11:54:56ID: 16226180 And, above from 3199, it works OK? If so, perhaps you should change the range of source ports on the machines http://www.microsoft.com/technet/community/columns/cableguy/cg1205.mspx There's also a simple way to do so on Linux. by: pfenertyPosted on 2006-03-18 at 12:26:59ID: 16226326 Amazing what you can find, once you know what to look for: Learn How Your ISA Server Helps Block MyDoom Traffic Affected Ports Table 1 lists affected ports known to be used by MyDoom. You should block those ports. This data is current as of 01:24:53, Monday, February 09, 2004. # Port Number IP Protocol Known to Be Used by MyDoom? 1 3127-3198 TCP Yes http://www.microsoft.com/isaserver/support/prevent/mydoom.mspx by: pfenertyPosted on 2006-03-18 at 12:30:14ID: 16226343 thanks to all for listening, dbardbar came up with a workaround, so it's ok with me if you get the points. -paul by: dbardbarPosted on 2006-03-18 at 12:33:12ID: 16226361 http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html