Thursday, May 5, 2011

SIP Trunking with Lync Server 2010 and reliability issues (calls only last 51 mins)

I recently worked on a project for a customer who was interested in working with a local SIP trunk provider (ITSP) who is listed on the Microsoft OIP page for Lync Server 2010. We offered four connectivity methods to the client which consisted of the following:

  1. Full IP NAT connection to ITSP
  2. Public IP connection to ITSP
  3. Site-to-site IPSEC VPN connection to ITSP
  4. Layer 2 PVC using ISP link
The customer decided to go with option 1 which meant we provisioned a new public IP and created a NAT to a private IP which was bound to the Lync server. Appropriate firewall rules were set up to permit SIP and RTP/RTCP packets between the two and for the most part everything worked really quite well.

Incidentally, calls made to, or coming from the PSTN (ITSP) are using G.711 as a codec and if you have a reasonably reliable connection with low delay and jitter, you can expect good results.

This particular customer had an interesting issue which resulted in their calls being dropped after being active on the phone for 51 minutes and 30 seconds (typically a conference call). After a series of funny looks asking if anyone else has seen this issue, I decided to dig into the Lync Server 2010 trunk configuration settings to see if we can fine tune something. The theory on why this was happening appeared to be related to the RTCP packets being blocked from the ITSP. Unfortunately I didn't have any evidence of this to share here but it makes sense. Looking at the default options for a Lync trunk connection, the following areas of interest are what I focused on:

EnableSessionTimer ($true | $false)
RTCPActiveCalls ($true | $false)
RTCPCallsOnHold ($true | $false)

The default option for a trunk in Lync Server 2010 is:

EnableSessionTimer = False
RTCPActiveCalls = True
RTCPCallsOnHold = True

Session timers will apply to a connection even if the trunk setting is "False" (this can occur when the remote side uses them). RTCPActiveCalls refer to the method of sending RTCP packets to determine if the call is still 'alive' or not. If these packets cease, the call is terminated after 30 seconds. The purpose of determining a valid call this way is because the SIP signaling for the call could traverse another path, such as Media Bypass, and/or become interrupted (brief network/device). The same applies to RTCPCallsOnHold but in a slightly different manner. Historically a call on hold without MOH will cease sending RTP packets and drop the peer (some of you may recall this being an issue on SNOM or Cisco sets).

If my theory of RTCP packets being blocked (inbound) or not sent at all, I would think the call wouldn't last very long at all (i.e. no more than about 30 seconds). I attempted to set "EnableSessionTimer" to True but this didn't seem to make a difference. I had to set RTCPActiveCalls and RTCPCallsOnHold to False as well for the issue to go away. Again, in the end, the configuration I went with looks like this:

EnableSessionTimer = True
RTCPActiveCalls = False
RTCPCallsOnHold = False

Wednesday, May 4, 2011

Calculating number of Mediation Servers and voice channels required for Lync Server 2010

I generally hate doing something I don't fully understand or haven't been taught so I've taken some time to try and grasp the mind bending, eye crossing, fascination that is capacity planning with respect to voice systems.

It all goes back to our Microsoft Certified Masters training for Lync Server 2010 in March/April of this year. Some of the pre-study content prescribed to us touches on what an "Erlang" is and why it's important to understanding voice systems design. In addition to this, the MCM program has us learn about applying factors such as Busy Hour Traffic (BHT), Blocking Percentage, Busy Hour Factor, and Erlangs against real world capabilities of Lync Server 2010.

So let's start with the basics. What is an Erlang? Well, if you look up the Wikipedia definition it states:

"The erlang (symbol E[1]) is a dimensionless unit that is used in telephony as a statistical measure of offered load or carried load on service-providing elements such as telephone circuits or telephone switching equipment."

Basically, an Erlang represents one voice path, or one channel, or one line in constant use (sorry Adam). The reason an Erlang is important is because we need to eventually determine the number of concurrent channels required for sizing T1/E1 capacity or even determining the number of Mediation servers we need.

The other important concept we need to understand is the Busy Hour Traffic (measured in Erlangs). BHT is the number of hours of call traffic during the busiest hour of the day. Said another way, BHT represents the maximum concurrent channels used during the busiest hour.

In addition to understanding line usage, we need to grasp the idea of a blocking percentage. This means the likelihood of a call being denied (blocked) due to insufficient channels or lines (capacity). When planning for capacity you need to determine the acceptable blocking percentage for an organization. Some will permit only 1% which means 1 out of every 100 calls will be blocked due to insufficient line capacity. Other organizations are willing to accept 2.5% or more. 

The last concept we need to cover is the Busy Hour Factor, represented in a percentage. The Busy Hour Factor is the percentage of minutes which are offered during the busiest hour of the day. The default is typically 17% for most businesses open during an 8 hour work window. We use the Busy Hour Factor to calculate the Erlangs based on a certain volume of minutes in a day.

Clear as mud? Let's look at the following scenario:

You are introduced to a customer who is looking to move to Lync Server 2010 and migrate from an existing PBX with 2 T1's. Rumors of an acquisition come true and the company plans to integrate more telephony capacity. You're given the phone statistics for both companies which works out to 37,000 minutes per day.

What is the Busy Hour Traffic (BHT, measured in Erlangs)?
What is the Busy Hour Factor (default is 17%)?
What is the Blocking Percentage?
How many T1's do you need?
How many Mediation servers do you need?

We actually can't answer these questions unless we have an "Erlang B" calculator which can be found here: http://www.erlang.com/calculator/erlb/. But first let's solve what we can. For those of you who wish to solve without assistance, the formula is:

To calculate Busy Hour Traffic, we can multiply the Busy Hour Factor of 17% by the total number of minutes (37,000) then divide that by 60. The calculation looks like this: 

37,000 * 0.17 / 60 = 104.8 BHT (Erlangs)

Since the scenario didn't specify a blocking percentage, let's assume 1%. With this assumption and the calculated BHT value, we now have enough information to put into our "Erlang B" calculator to determine the number of lines or channels we need.

This produces 122 lines.

Knowing a T1 can handle 23 lines of voice traffic, we get 5.3 T1's being required. Now you can't have .3 of a T1 so maybe the client is willing to accept a higher blocking percentage to squeeze the traffic into 5 T1's. You can use the "Erlang B" calculator to determine what the blocking percentage would be in this case.

5 T1's can carry 115 channels and with 104.8 BHT this produces a blocking percentage of 2.7%.

Acceptable? Maybe...maybe not. It really depends on the customer.

Now there are other clever ways of squeezing out a few more channels. NFAS is one way in which you can forgo the D-channel on each T1 if you've trunked several of them together. For example, 3 T1's would typically have 3 D-channels whereas with NFAS, you can get away with 1 D-channel between the group of three. This gives you two more B-channels for voice. Multiply that by 5 T1's and you get four more B-channels increasing your capacity from 115 to 119. Using the "Erlang B" calculator again....

This produces a blocking percentage of 1.6%

Not bad at all!

Okay, so bringing things back to reality, we have a recommendation to the customer about how many T1's they need to plan for which is 5 using NFAS. The next question we need to answer is how many Mediation servers we need so let's look at some capacity numbers:

A stand-alone Mediation server with quad 1Gb NIC with dual quad-core CPU's can support 800 - 1200 concurrent calls (not including media bypass).

A collocated Mediation server with Front-End server can support 226 concurrent calls.

Based on our Busy Hour Traffic number of 104.8 Erlangs, even a single collocated Mediation server on a Front-End server can handle all the traffic.

Anyway, I hope this helps some of you understand the importance and complexity of sizing voice channels and servers. Microsoft has done an amazing job at increasing the capacity of concurrency with Lync Server 2010. Comments welcome.

Cheers.