Servers suddenly unable to make new connections; seems like ephemeral port exhaustion

Question

We have several Windows 2008R2 servers which run a variety of commercial (e.g. SQL Server) and in-house applications. It's a variety of virtual and physical, all of which have been running for some years without any problems.

However, in the last few weeks a few servers have suddenly stopped being able to establish new network connection. An example of this is one of our applications that connects to SQL over a normal socket connection - it simply hangs. Trying to browse to a network share from the affected server tells us

The name limit for the local computer network adapter card was exceeded

This looked like good old fashioned ephemeral port exhaustion to me, and bumping up the number of ephemeral ports temporarily resolves the problem

However, even with this in place the server only lasts a few days before the problem recurs. Moreover, I can't see what's consuming any great number of ports - again, nothing has changed on the servers, and the problem has happened on 4 different boxes running different types of application.

If I look at the busiest server, running SQL Server 2014, in TCPView we're showing around 1000 connections:

I've run this script which logs ephemeral port usage, and it never goes above a few dozen ports.

Process Explorer doesn't show anything exciting either:

My gut feeling is that something in the July's "patch Tuesday" has caused this, but I could be completely wrong. All we know is that previously working servers now stop working after a few days, everything is patched up to date (both in terms of Microsoft and vendor drivers), it's affecting multiple servers, both physical and virtual, and there's no sign of anything burning through ephemeral ports. Can anyone suggest how to isolate what's causing the issues?

Could you check with tcpview/netstat for FIN_WAIT_2 and LAST_ACK connections? — Simone Zabberoni, Aug 19 '17 at 21:53
Afraid there's no sign of any connections in those states - they're either ESTABLISHED (the majority) or TIME_WAIT (only around 20) — KenD, Aug 21 '17 at 07:19
Are you using the -a option with netstat to show "all" connections? Also recommend using -o for the owning process ID, but sometimes this is 0 for orphaned connections. I typically use -ano — Clayton, Aug 22 '17 at 18:51
Yes, I'm using -ano. I have a server in it's "problematic" state at the moment - it's an hour before "end of business" here and the time when I can bounce it. netstat -ano reports there are 874 ESTABLISHED connections, 17 TIME_WAIT connections and 39 LISTENING. If I use netsh to bump up the ephemeral port range, the problem disappears - but I'm running out of ports I can allocate to it. — KenD, Aug 22 '17 at 19:02

Clayton · Answer 1 · 2017-08-22T20:52:44.160

4

Sounds like your suspicion about the July update being the source of the problem has merit. Try uninstalling the July 11 2017 Rollup from one of your affected servers. If problem goes away, consider contacting MS support? Then again, as its a "known issue" that might not be very productive...

Known issues in this update KB4025341

Symptom: Due to a defect in WLDAP32.DLL, applications that perform LDAP referral chasing can consume too many dynamic TCP ports (potentially exhausting them)

Workaround: To work around the problem, restart services or applications that perform LDAP referral chasing to free TCP dynamic ports.

edited Aug 22 '17 at 20:52

answered Aug 22 '17 at 20:46

Clayton

4,573

Interesting. The server's been rebooted just an hour ago, and using Process Explorer I can see the only things with handles to WLDAP32.DLL are Explorer and (bizarrely) the Java update app. I wonder, if it is this issue, I'd expect to see those executables as "owners" of half-open connections? I'll wait for the issue to reoccur (which usually takes a couple of days), and then try blocking and removing that update from this one server and see what happens. – KenD Aug 22 '17 at 21:43

score 2 · Answer 2 · answered Aug 24 '17 at 18:10

The script mentioned here might help diagnosing what process is using up the ports. Basically it combines the output of netsh int ipv4 show dynamicportrange tcp with netstat –ano –p tcp to help with the diagnosis. The text also mentions this is only for finding issues in user-mode processes only, explaining how WinDBG must be used for diagnosing issues in kernel-mode processes.

This thread also mentions Event 4231 as another indicator of the issue.

score 0 · Answer 3 · answered Aug 22 '17 at 16:39

This is an interesting one. Caught my attention because previously I ran into issues opening multiple tcp connections with Windows servers. Are you logging CPU, network throughput and noticing any anomalies?

Assuming you are doing post mortem, you probably checked over the logs and havent noticed anything unusual. Is traffic on the rise? Anything that could explain the uptick in errors. Are the 4 boxes on the same subnet? Accessed by different applications?

Other things to look at, shared connections. Were the extended range of ephemeral ports open when any shares were created?

TCP TCB connections. I think its 2000 on 2k8 R2. Double check that. Good luck with this one.

There's no sign of anything wrong on the server - CPU, memory and network are all fairly standard. Once the problem has happened, it'll remain like that until I reboot the server - even once there's no user activity and the server is very "quiet". There's only 2 shares, both of which were created years ago, way before this problem occurred. — KenD, Aug 22 '17 at 20:31

Servers suddenly unable to make new connections; seems like ephemeral port exhaustion

3 Answers3