Friday, November 23, 2012

Lync Server 2013 HA Design Changes and Considerations

Lync Server 2013 introduces new capabilities for recovering from a single server or pool failure and failing over between pools of servers; either Enterprise or Standard Edition.

This post discusses these capabilities, demonstrates their use, and offers suggestions for organizations wondering which path to choose.

Lync Server 2013...what's changed?
  1. Enterprise Edition pools now are recommended to have a minimum of three, yes THREE front-end servers. This is due to the "Windows Fabric" replication architecture based on Azure. The back-end SQL database is no longer the store for real-time data.
  2. (subject to change) Enterprise Edition pools use a quorum model similar to Exchange Server 2010/2013 in that a Majority Node Set (MNS) quorum leverages a tie-breaker for pools with even-numbered front-end servers. In the case of Lync Server 2013 this is the pool back-end SQL server.
  3. Enterprise Edition pools no longer support SQL Server clustering for HA.
  4. SQL Server mirroring is now the supported method of providing back-end database resiliency.
  5. For automatic failover of a SQL mirror, a SQL witness is required; this can be SQL Express. Collocation of other services, software, etc. are subject to further testing.
  6. Lync Server 2013 uses a Web Application Companion (WAC) server (aka Office Web Apps) to stream PowerPoint meeting content including full transition support and embedded videos.
  7. Lync servers can be "paired" with like-infrastructure (Enterprise to Enterprise and Standard to Standard) to ensure resiliency in the event of a site outage (DR). This pairing activity ensures replication of critical pool/server data and must be invoked by an administrator via manual PowerShell commands.
  8. Multiple Federation routes can be applied to the topology. For example, a Boston Standard Edition server can use a Boston Lync Edge server as its Federation route whereas a Seattle Enterprise pool can use a regional Seattle Lync Edge server/pool for Federation.
Now that Enterprise Edition pools can be paired with other EE pools, and Standard Edition servers can be paired with Standard Edition servers, this changes how we design Lync solutions in certain cases. I talk to customers who often suggest they need "High Availability" (HA) in their Lync infrastructure and this often comes from those who are implementing IM&P only. Instead of trying to meet some kind of unrealistic expectation or design to a requirement which centers around a term like HA, drive the conversation toward Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These two factors, along with an Service Level Agreement (SLA) percentage (i.e. 99.9%) should drive the outcome. Anyway, here is the rule of thumb I use personally today:

If the organization suggests they need HA are they willing to accept a Recovery Time Objective of >1 hour? If so, and the per-server user count does not exceed ~5000, use Lync Standard Edition. Two Lync Standard Edition servers could even be used to split the load of 5000 users in a location where 2500 are homed on each server and both servers are paired (backup for each other). The build list would look something like this:

2 x Lync Server 2013 Standard Edition servers (paired with each other in the same site or stretched between a primary and DR site)
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)

If the organization insists they cannot incur downtime for Lync components contained within a single site, and they insist "high availability" is a requirement, the infrastructure looks something like this:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

But wait....I need DR!
If the organization also insists they have a plan for Disaster Recovery, a second "warm" site would house the following minimum infrastructure:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

That's 24 servers to build a site redundant Lync Server 2013 Enterprise environment. This may seem a bit ridiculous however the point I'm illustrating is the value Standard Edition now brings in Lync Server 2013. Additionally, I haven't found an organization yet who would dedicate server hardware or VM's in this manner. You can collocate many of the roles and scale back on things like WAC and SQL mirroring. Lastly, organizations might suggest their DR infrastructure would accommodate lower user counts which may drive a design lacking redundancy at the "warm" site.

Hold on....what about Persistent Chat?
Okay, so you want a Persistent Chat pool as well....we need to add dual redundant servers at each site raising the total to 28 servers.

As you can see the case for paired Standard Edition servers quickly becomes favorable from a cost and complexity perspective albeit sacrificing availability in the event of a single server outage. The fact that hardware load balancers can be completely eliminated also tells a great story around simplicity. To date I have yet to see a successful implementation of OCS or Lync where hardware load balancers are in the mix at all. This is mostly due to lack of knowledge, lack of understanding on how the solution works, or in some cases simple reluctance to work together.

What if I have more than 5000 users at a single site and need DR?
Consider placing multiple Standard Edition servers paired with similar servers at your backup sites. You can split users homed between servers (i.e. 3000 on ServerA in SiteA and 3000 on ServerB in SiteA) to meet your capacity requirements.

What are the drawbacks to Lync Standard Edition anyway?
Well the first point people typically jump on is no "high availability". This is obviously due to the lack of a shared common data store whereby multiple front-ends connect and relate to. Here are some of the more important drawbacks when considering this approach:
  1. Restoration of service is a manual effort resulting in users being left with "Resiliency Mode" until this action is taken.
  2. Your Edge proxy to 'next hop' internal server can be only one SE server even if you have several of them. An outage to this next hop server results in an outage for all remote users' traffic. It is important to note as well that if Edge cannot contact the next hop, clients will not attempt to sign into another Edge proxy even if another exists (without manual intervention at each client system).
  3. Response Groups and Call Park are a manual effort to switch over.
  4. Assignment of users to a collection of SE servers takes thought and proper assignment so as to not overload a single server. In the case where you have two servers, decide if you're going to run them active/active or active/passive as this will change your user placement behavior. This can also be scripted for ease of user placement automatically.
  5. You could argue this is more complex to manage however the same argument is made for the HLB/SQL infrastructure required.
  6. Your PSTN conference DID is homed to a single server. If this server is down, the DID is as well. I have not yet tested the behavior of a pool failover whether this DID is restored on the backup registrar or not (TBD).
  7. Exchange OWA/UCS integration has a single point of failure due to the lack of multiple server definitions in the Exchange 2010/2013 CAS setup.
Certainly you will have to weigh your own requirements against what is both supported and recommended. This article is intended to keep us thinking on our toes when designing Lync solutions for our customers. Enjoy!