Tuesday, December 20, 2011

Lync Server 2010: Building redundancy into your dial plans

We've had quite a bit of discussion lately about the behaviour associated with Lync Server 2010 call routing when a gateway is "down". The discussion was prompted by a customer outage scenario in which a voice gateway was used to connect a Nortel CS1000 PBX to Lync Server 2010. A single T1 interface was used to provide PBX to Lync calling and Lync to PSTN for certain dialing scenarios. The Nortel CS1000 implementation was vast and contained several H.323 trunks connecting various remote sites together for toll bypass and short digit dialing patterns.


The customer had an issue with one of the H.323 connections in the Nortel world which resulted in an outage to Lync Server. Looking deeper into the issue we discovered the dial plan in Lync Server had multiple phone usages with routes matching the same dialing pattern. For example, a user has a voice policy with two phone usages (see below):

Phone Usage Route Matching Pattern Gateway
NA-AB-Usage1 NA-AB-PBX-Route1 \+1 vgw1.contoso.local
NA-AB-Usage2 NA-AB-SIP-Route1 \+1 siptrunk1.provider.com

Given the above, I assumed Lync Server would match on the first usage, fail....then not use the second usage even if the pattern matched the route for the call. In certain situations this is correct while other situations is isn't. For example, if the gateway returns a 5xx level response to the Mediation server or if it's marked as "down", we will use the next phone usage matching the called number. If the gateway returns a 4xx level response to the Mediation server we will NOT try the next phone usage resulting call failure.

So what happens if we have a single phone usage with multiple routes?
In this case, the same behaviour would be experienced. A 5xx level SIP response to the Mediation server would permit a call to the second route in the same usage however a 4xx level response would result in a call failure. Adding multiple gateways to a route only causes them to be used in a round robin fashion and doesn't protect us from a 4xx level response. Below is an example of a phone usage with multiple routes:


Phone Usage Route Matching Pattern Gateway
NA-AB-Usage1 NA-AB-PBX-Route1 \+1 vgw1.contoso.local
NA-AB-SIP-Route1 \+1 siptrunk1.provider.com

Let's take a slight detour for a moment and talk about what happens when a gateway is down, how long it stays down for, and when or how long it takes for service to be restored.

How are gateways marked as "down"?
The Mediation Server sends a SIP OPTIONS request to the next hop gateway which can be viewed by running NetMon or Wireshark on the interface bound to the IP used by that service. If no reponse, or an invalid reponse is returned we raise event ID 25051, then increment a counter. Once the counter reaches five we raise event ID 25061 and 25052 thus marking the gateway "down". Subsequent failure events after this point will not be logged as described in event ID 25052.

Event ID 25051: First failure up to five attempts...

Event ID 25052: Tried five times, won't log it anymore...

Event ID 25061: Taking the gateway out of service (down)...

How long do they stay down for?
A gateway taken out of service by Lync Server will be re-tried every 1 minute which means we will put a gateway back in service very quickly once we receive a successful OPTIONS request. We follow this up by creating event ID 25062.

Event ID 25062: Back in business...

Even though the gateway is back in service from the Mediation server's perspective, Lync's OutBound Routing (OBR) logic may take up to 20 minutes to add it back as a viable call path. This is because the Lync OBR doesn't have access to the SIP OPTIONS status and will run an exponential back-off algorithm which is capped at 20 minutes.

What happens if the gateway is "unhealthy"?
When I use the word unhealthy I'm referring to a SIP response code in the 4xx range which would cause and OBR failure within Lync and ultimately a failure for the end user. Let's say given the original problem stated above we receive back a "SIP/2.0 488 Not Acceptable Here" from the gateway. Using Lync Server PowerShell commands we can create a new response code translation for the 488 message on that gateway as follows:

New-CsSipResponseCodeTranslationRule -Identity "PstnGateway:10.0.0.6/Rule488 -ReceivedResponseCode 488 -TranslatedResponseCode 503

In the above example, Lync's OBR logic will retry the next route or phone usage if the pattern matches the called number.

What other useful examples are there?
Let's say you have a single T1/PRI and a SIP trunk at a location. The SIP trunk is used as an overflow for outbound calls if all ports on the T1/PRI are used up. Again, it wouldn't matter if you had multiple phone usages with the independent routes to the PRI or to the SIP trunk. The response code from the gateway will be a "SIP/2.0 486 Busy Here" when no channels are available. If we map the 486 response code to a 503, OBR will retry the next route or phone usage.

New-CsSipResponseCodeTranslationRule -Identity "PstnGateway:10.0.0.6/Rule486 -ReceivedResponseCode 486 -TranslatedResponseCode 503

The exception to the above scenario would be if you were using a certified Lync Server gateway. A certified gateway will return a "SIP/2.0 503" instead of the "SIP/2.0 486".

So there you have it, you can use Lync Server to build out recovery scenarios based on certain responses from the gateway.
TechNet reference: http://technet.microsoft.com/en-us/library/gg413041.aspx