53

Lately, we've become aware of a TCP connection issue that is mostly limited to mac and Linux users who browse our websites.

From the user perspective, it presents itself as a really long connection time to our websites (>11 seconds).

We've managed to track down the technical signature of this problem, but can't figure out why it is happening or how to fix it.

Basically, what is happening is that the client's machine is sending the SYN packet to establish the TCP connection and the web server receives it, but does not respond with the SYN/ACK packet. After the client has sent many SYN packets, the server finally responds with a SYN/ACK packet and everything is fine for the remainder of the connection.

And, of course, the kicker to the problem: it is intermittent and does not happen all the time (though it does happen between 10-30% of the time)

We are using Fedora 12 Linux as the OS and Nginx as the web server.

Screenshot of wireshark analysis

Screenshot of wireshark analysis

Update:

Turning off window scaling on the client stopped the issue from happening. Now I just need a server side resolution (we can't make all the clients do this) :)

Final Update:

The solution was to turn off both TCP window scaling and TCP timestamps on our servers that are accessible to the public.

  • 1
    I think we will need to see some tcpdump of it happening. – coredump Feb 15 '11 at 23:24
  • Do you have any acls or rules based on reverse DNS? You may need to look at more then just the connection between the client and the server. Perhaps a DNS lookup is timing out? – Zoredache Feb 15 '11 at 23:28
  • @coredump: here's a screen shot of the wireshark analysis that shows the issue http://i.imgur.com/Bnzrm.png (couldn't figure out how to export just the stream....) – codemonkey Feb 15 '11 at 23:49
  • @Zoredache: no, we don't have any acls or rules based on reverse DNS. This is a public facing webserver and we allow everyone to access it – codemonkey Feb 15 '11 at 23:51
  • Just a hunch, but are you doing any kind of incoming connection rate-limiting on the server? Say, with iptables? – Steven Monday Feb 15 '11 at 23:58
  • @Steven: no, we're not doing any connection rate-limiting. At least, not on purpose ;) – codemonkey Feb 16 '11 at 00:08
  • What is the value of your net.ipv4.tcp_window_scaling sysctl? – coredump Feb 16 '11 at 02:03
  • @coredump: the value for net.ipv4.tcp_window_scaling is 1 – codemonkey Feb 16 '11 at 15:11
  • Are there any additional open ports except #80 on the server? What happens when connecting to them?… 2. Does dmesg @ server-side reveal anything interesting?
  • – poige Feb 16 '11 at 03:55
  • port 80 is the only port open to the public. this problem only happens when connecting from outside the LAN network – codemonkey Feb 16 '11 at 15:20
  • Sometimes there's buggy software in middle: http://kerneltrap.org/node/6723 What is – poige Feb 16 '11 at 17:53
  • If you actually solved this, you should come back and post your answer. (And I hope you got rid of Fedora 12 before now!) – Michael Hampton Apr 05 '13 at 16:32