VirtualBacon

Weird VM network issues with HP NC375T Quad Port Network Adapter

Posted on October 30, 2012

NC375T Network Adapter

A few weeks ago a weird network issue began popping up on random VMs on our network. Specifically VMs would not communicate to other servers on the same subnet, other servers on the same subnet could not communicate with them, and in some cases servers across subnets could not communicate across routed connections. In looking at the problem the pieces of information that stood out (at first, more later) were:

  • The servers affected were all running Windows 2008, either 32 bit Std SP2 or 64 bit R2.
  • The hosts were plugged into the same pair of switches.
  • The subnets, though different were behind the same firewall.
  • The problem was related to missing ARP entries either on the servers or on the firewall, depending on what the symptom was at the time.

Needless to say having this happen seemingly at random on a production network is bad news, and of course this usually happens in the middle of the night causing people to be paged/called and woken up. The fix, though not an acceptable permanent solution, was to add static ARP entries on the affected servers and on the firewall. We also pro-actively picked a few key servers to do this for as well.

Well this issue was confusing for a couple of weeks but I think I finally found the solution.

First off searching online found no shortage of people having similar problems going back to about Windows 2008-2009. Specifically right around the time Vista and Windows 2008 were released, and more specifically because of a change in the default OS behavior in regards to gratuitous ARP requests, to protect the OS from poisoning the ARP cache. Unfortunately because people expected the behavior to be the same as it was in prior versions of Windows this caused problems for many people. It could also cause cluster failover issues and Microsoft eventually released a hotfix in 2011. It kind of made sense to me, especially since I was seeing ARP being filtered somewhere between the server and the network switch port on Wireshark traces, but what did not make sense was why this would happen seemingly out of nowhere. In looking at the few changes that had taken place nothing that happened should have caused this. Also, because we could "fix" the problem within the OS, and because the problem only seemed to apply to Windows 2008, it seemed like the problem was in the OS. I also discussed it on the VMware Communities forum within someone else who seemed to be experiencing similar symptoms.

Other fixes mentioned by others for similar symptoms were:

  • Disable proxy arp on the Cisco PIX or ASA firewall (as mentioned by Jason Boche for example).
  • A code problem on a Cisco switch or firewall where a code upgrade fixed the problem.

While this all sounds plausible what did not make sense was that we had not change much of anything and the problem suddenly showed up. Nearly two weeks later a new development: The problem happened on two other Windows 2008 servers on another vDS, on another physical switch, on a subnet off a different firewall, both different models.

Hmm. There goes the ASA and IOS software bug theory, which I did not like anyway. Then another week later a new development: The first symptom on a Windows 2003 server. There goes the Windows 2008 GARP issue theory (I had already tried the hotfix by this point and it didn't work).

After having a case open with VMware support for a few days and not finding anything yet the introduction of a Windows 2003 server having the issue pushed me into looking at some other possibilities again. I had seen a few mentions of the problem possibly being due to a bad Broadcom NIC driver, but I had checked my NICs and while I do have some of those, I also have Intel and HP network adapters, and the affected servers did not have a broadcom adapter for an uplink anyway. I did finally come across an article that mentioned a firmware update being available for the HP NC375T Quad Port network adapter, of which I have many, in most hosts, and it appeared to address the type of issue I was seeing. It fixed a sporadic loss of network connectivity, usually under load, and applied to Windows, Linux, and ESX (and XenServer based on another forum thread I came across). Bingo.

It was difficult to track down the problem as my hosts are not all identically configured (we purchase them over time and re-purposed other hosts too), and the uplinks are not all connected to the same NIC models for each switch. If they were the same it might have been easier to spot. Either way I plan to install the firmware on a few of the hosts in the cluster to see if that fixes the problem. It's the best explanation so far and makes the most sense. This explains why it usually happened at night (batch jobs, large file transfers, backups), and it also explained one host in a test environment which began showing a NIC down that could not be resolved without rebooting the host (one of the symptoms described in the HP advisory).

Here is the HP Advisory link.

And here is a recently updated VMware KB article describing the issue as well, with notes of what to look for in the logs to spot the problem.

Update: While the latest firmware update (588) seems to have reduced the frequency of the problem, it is still occurring under load. I have isolated the problem to the specific NIC model. Hosts without this network adapter do not experience the issue. As such I will likely be replacing the HP NC 375T network adapters in our hosts with another model based on a different chipset. One option is the HP NC 365T which is based on an Intel chipset. Initial research does not pull up any issues with this adapter other than some people having issues before it was on the HCL. These run between $265 and $500 depending on where you find them.

 

Update 2: After continuing to work with HP Support on this issue they finally said that they had to send me new adapters with a later hardware revision. We replaced some of the adapters and the new ones appear to work properly. I am still working on getting the rest of the HP NC375T adapters replaced however since I had 12 of them in 6 hosts but for some reason HP only sent 7 replacements. It's been months now and we're still not quite done. We solved the problem ourselves by purchasing different cards, but we should not have had to.

Posted by Peter

Comments (0) Trackbacks (0)
  1. Hi Sunny,

    The replacement cards with a new hardware revision that HP sent us appeared to fix the problem. That said for peace of mind we did purchase NC365T adapters which use an Intel chipset. In the end we had one of each in each server – instead of two of the same – for driver level redundancy.

    Peter

  2. Hi, We are facing the same issue with our NC375T cards and after numerous firmware and driver updates we continue to see issue. HP just sent us a new card with a different hardware revision. I was wondering if you experienced any more outages after your update 2 where HP sent you new cards. We are at a point where we just want to buy brand new Intel cards so we can sleep better at night.


Trackbacks are disabled.

Website Security Test