Friday, November 23, 2012

Lync Server 2013 HA Design Changes and Considerations

Lync Server 2013 introduces new capabilities for recovering from a single server or pool failure and failing over between pools of servers; either Enterprise or Standard Edition.

This post discusses these capabilities, demonstrates their use, and offers suggestions for organizations wondering which path to choose.

Lync Server 2013...what's changed?
  1. Enterprise Edition pools now are recommended to have a minimum of three, yes THREE front-end servers. This is due to the "Windows Fabric" replication architecture based on Azure. The back-end SQL database is no longer the store for real-time data.
  2. (subject to change) Enterprise Edition pools use a quorum model similar to Exchange Server 2010/2013 in that a Majority Node Set (MNS) quorum leverages a tie-breaker for pools with even-numbered front-end servers. In the case of Lync Server 2013 this is the pool back-end SQL server.
  3. Enterprise Edition pools no longer support SQL Server clustering for HA.
  4. SQL Server mirroring is now the supported method of providing back-end database resiliency.
  5. For automatic failover of a SQL mirror, a SQL witness is required; this can be SQL Express. Collocation of other services, software, etc. are subject to further testing.
  6. Lync Server 2013 uses a Web Application Companion (WAC) server (aka Office Web Apps) to stream PowerPoint meeting content including full transition support and embedded videos.
  7. Lync servers can be "paired" with like-infrastructure (Enterprise to Enterprise and Standard to Standard) to ensure resiliency in the event of a site outage (DR). This pairing activity ensures replication of critical pool/server data and must be invoked by an administrator via manual PowerShell commands.
  8. Multiple Federation routes can be applied to the topology. For example, a Boston Standard Edition server can use a Boston Lync Edge server as its Federation route whereas a Seattle Enterprise pool can use a regional Seattle Lync Edge server/pool for Federation.
Now that Enterprise Edition pools can be paired with other EE pools, and Standard Edition servers can be paired with Standard Edition servers, this changes how we design Lync solutions in certain cases. I talk to customers who often suggest they need "High Availability" (HA) in their Lync infrastructure and this often comes from those who are implementing IM&P only. Instead of trying to meet some kind of unrealistic expectation or design to a requirement which centers around a term like HA, drive the conversation toward Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These two factors, along with an Service Level Agreement (SLA) percentage (i.e. 99.9%) should drive the outcome. Anyway, here is the rule of thumb I use personally today:

If the organization suggests they need HA are they willing to accept a Recovery Time Objective of >1 hour? If so, and the per-server user count does not exceed ~5000, use Lync Standard Edition. Two Lync Standard Edition servers could even be used to split the load of 5000 users in a location where 2500 are homed on each server and both servers are paired (backup for each other). The build list would look something like this:

2 x Lync Server 2013 Standard Edition servers (paired with each other in the same site or stretched between a primary and DR site)
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)

If the organization insists they cannot incur downtime for Lync components contained within a single site, and they insist "high availability" is a requirement, the infrastructure looks something like this:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

But wait....I need DR!
If the organization also insists they have a plan for Disaster Recovery, a second "warm" site would house the following minimum infrastructure:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

That's 24 servers to build a site redundant Lync Server 2013 Enterprise environment. This may seem a bit ridiculous however the point I'm illustrating is the value Standard Edition now brings in Lync Server 2013. Additionally, I haven't found an organization yet who would dedicate server hardware or VM's in this manner. You can collocate many of the roles and scale back on things like WAC and SQL mirroring. Lastly, organizations might suggest their DR infrastructure would accommodate lower user counts which may drive a design lacking redundancy at the "warm" site.

Hold on....what about Persistent Chat?
Okay, so you want a Persistent Chat pool as well....we need to add dual redundant servers at each site raising the total to 28 servers.

As you can see the case for paired Standard Edition servers quickly becomes favorable from a cost and complexity perspective albeit sacrificing availability in the event of a single server outage. The fact that hardware load balancers can be completely eliminated also tells a great story around simplicity. To date I have yet to see a successful implementation of OCS or Lync where hardware load balancers are in the mix at all. This is mostly due to lack of knowledge, lack of understanding on how the solution works, or in some cases simple reluctance to work together.

What if I have more than 5000 users at a single site and need DR?
Consider placing multiple Standard Edition servers paired with similar servers at your backup sites. You can split users homed between servers (i.e. 3000 on ServerA in SiteA and 3000 on ServerB in SiteA) to meet your capacity requirements.

What are the drawbacks to Lync Standard Edition anyway?
Well the first point people typically jump on is no "high availability". This is obviously due to the lack of a shared common data store whereby multiple front-ends connect and relate to. Here are some of the more important drawbacks when considering this approach:
  1. Restoration of service is a manual effort resulting in users being left with "Resiliency Mode" until this action is taken.
  2. Your Edge proxy to 'next hop' internal server can be only one SE server even if you have several of them. An outage to this next hop server results in an outage for all remote users' traffic. It is important to note as well that if Edge cannot contact the next hop, clients will not attempt to sign into another Edge proxy even if another exists (without manual intervention at each client system).
  3. Response Groups and Call Park are a manual effort to switch over.
  4. Assignment of users to a collection of SE servers takes thought and proper assignment so as to not overload a single server. In the case where you have two servers, decide if you're going to run them active/active or active/passive as this will change your user placement behavior. This can also be scripted for ease of user placement automatically.
  5. You could argue this is more complex to manage however the same argument is made for the HLB/SQL infrastructure required.
  6. Your PSTN conference DID is homed to a single server. If this server is down, the DID is as well. I have not yet tested the behavior of a pool failover whether this DID is restored on the backup registrar or not (TBD).
  7. Exchange OWA/UCS integration has a single point of failure due to the lack of multiple server definitions in the Exchange 2010/2013 CAS setup.
Certainly you will have to weigh your own requirements against what is both supported and recommended. This article is intended to keep us thinking on our toes when designing Lync solutions for our customers. Enjoy!

11 comments:

  1. Great post Jason! I believe less is better in terms of simple designs with fewer servers. I really only see large enterprises doing full blown Lync Enterprise Edition deployments with 24+ servers!

    ReplyDelete
  2. That's an interesting write up Jason. We have been having numerous discussions on approach to Lync 2013 designs and have come to similar conclusions to yourself.

    It's interesting customer's focus on HA with little really thought as to the cost and real business benefit to near line HA isn't it? Usually failover with nearline voice recovery suffices, and is a good position to design from commercially.

    ReplyDelete
    Replies
    1. I agree completely. We often suffer from overly complex designs which are difficult to implement to begin with and nearly impossible for someone to support who doesn't have extensive multi-platform experience. I find it helps to keep my grey hair in check by keeping in simple.

      Delete
  3. Great Post Jason, I have a quick question regarding our implementation design, currently we have 1 main site, 2 branch offices and our Datacenter. all connected via MPLS cloud. we have 700 users and planning on moving all our voice calls to Lync. We want to deploy 2 lync 2013 standard at out datacenter and SBA at each site for voice resiliency. If is not too much asking, can you give me some feedback on this design. Thanks

    ReplyDelete
    Replies
    1. You may consider placing a Standard Edition server at your main site and one at the datacenter. If you have redundancy in your connectivity back to either site you could rely on that as your sole method of providing voice to the branch. Also, something many people miss these days is the cost difference of a simple 1-port T1 gateway with an SBS vs. an SBA. I would try and steer clear (if it makes sense) by having all standard edition servers in your topology in one site.

      Delete
  4. Thanks Jason... I was thinking about the SBS option, A R320 with a T1 card, since we already have IPFlex voice service at all our sites.

    Again, thanks for the feedback..

    ReplyDelete
  5. I have been a fan of the SE server and have been planning our 2013 infrastructure at two sites around the SE server for exactly the reasons you outline. Thanks for the great write up!

    ReplyDelete
  6. Hi, I have a few questions about this infrastructure setup:
    1) How about DNS requirements, do I have to setup SRV records and all the other A records to the second std server? or does the failover meen re-pointing the DNS records?

    2) The failover in this setup would be manual, triggered by an admin? And the downtime for the end users is up to 30 minutes I've read? why 30 mins and not instantly?

    ReplyDelete
  7. Concise and well written. Thanks for sharing. Just wanted to add to it,you also have to add 4 Reverse Proxy servers for Enterprise and 2 for Standard implementations to total pool.

    ReplyDelete
  8. Jason - great write up. However, I think that you would serve your readers well by including some of the sacrifices associated with utilizing Standard over Enterprise. I would recommend to anyone considering a Lync 2013 deployment to evaluate what Enterprise offers over Standard and ensure that your organization doesn't require what might be lacking from Standard.

    That doesn't change the central point that Jason made though. Very well written and received.

    ReplyDelete