If you are like me and have seen the infamous Windows failover cluster error that contains the phrase “A transient communication failure” followed by your email database being down than this is the patch you have been waiting for. Many of my clients have multiple sites that participate in a DAG and over the past few months unexplained failover/cluster shutdowns have occurred randomly. After extensive debugging most of the error logs were pointing to some sort of communication issue between nodes. Now, raising the (cross)subnet latency thresholds are one of the first things that get changed in my build outs since most of my client’s sites are connected through an unreliable internet IPSEC VPN tunnel. Even though this helped a lot for the times when the VPN tunnel would bounce up and down this didn’t address the cluster failures. Fast forward to a few weeks ago, I saw a tweet from Scott Schnoll (@schnoll) who is a technical writer for Microsoft mention KB2550886, which has been released by the Windows clustering team.
Basically, Scott explains how this issue affect a DAG in a situation where two nodes cannot communicate with each other but can still talk to the others. This type of scenario is most likely to happen in a multisite DAG with a WAN latency his high. With this topology cross site cluster communication delays recieving regroup messages from the Global Update Manager (GUM) can occur and end up being received in an unexpected order. In a situation like this the offending node should automatically be removed from the cluster and the quorum should remain intact. Instead this was causing a domino effect which ultimately caused quorum to be lost which resulted to databases being dismounted.
Happy to see Microsoft finally address this and to Scott for explaining what exactly was going on under the hood. A lot of the documentation you see in Technet for Exchange has been written by him and he is a super smart guy. The patch has to be installed on all mailbox servers and is recommended to be installed even if you don’t have a multi-site cluster. On a side note, tweets like this have caused me to realize what an invaluable tool Twitter has become to me and my everyday personal/worklife. @3cVguy
You can download the hotfix here.
Other important hotfixes:
http://support.microsoft.com/kb/2549472 – Cluster node cannot rejoin the cluster after the node is restarted or removed from the cluster in Windows Server 2008 R2
http://support.microsoft.com/kb/2549448 – Cluster service still uses the default time-out value after you configure the regroup time-out setting in Windows Server 2008 R2
http://support.microsoft.com/kb/2552040 – A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication fail