Thursday, September 18, 2014

Understanding client sign in behavior before and after a failure

In this post I'll cover the changes in Lync 2013's sign in process and discuss the various scenarios which could result in an outage or client-facing impact to the user experience.

If you've studied the differences in the sign-in behavior of Lync 2010 versus Lync 2013 you'll know the new "lyncdiscover" process improves the discovery of your pool's Edge server. For those of you who have not studied this change, I'll explain it briefly here.

How lyncdiscover works:
Expanding on Lync Server 2010's Mobility feature which came as a Cumulative Update (CU), the lyncdiscover process involves the publishing of two DNS names pointing to either your reverse proxy (when Internet facing) or to your front-end server/HLB VIP (when LAN facing).

The Lync 2013 client will query first and if it resolves and can connect, it assumes your endpoint is on the inside of the network (LAN). If the host name is unresolvable, we query and connect. In either case we if authenticate, an XML response is given to the endpoint containing the user's home pool SIP proxy and internal pool name among other things.

We then attempt a connection based on the response given in the XML document and cache the successful connection in a file called "EndpointConfiguration.cache' located in the user's profile.

The key takeaway is that the response is customized to your authenticated query and cached by the client. This is important when you have multiple Edge servers in your topology capable of proxying remote users to internal pools since you now have the ability to direct users to their regional Edge resources.

NOTE: This behavior is observed in the Lync 2010/2013 mobility clients, the Lync 2013 desktop client, and the Modern UI client for Windows 8.x. However only the desktop client will fall back to SRV lookup.

How does this compare with Lync 2010?
When we compare this capability to Lync 2010, the client would always query a DNS SRV record first (i.e. returning a static response. You could publish multiple SRV records with varying weights however this would always return a static ordered list. The result would yield all users in the organization signing into the same Edge proxy unless an outage occurred.

The following TechNet article explains the behavior well:

Failure scenarios and recovery options:
So you've done your reading about the sign in process and differences between Lync 2010 and 2013 and by now you're wondering what happens if the various components fail.

Failure Behavior Recovery Options
Query to lyncdiscover fails Fallback to SRV lookup Use GeoDNS solution to provide a response to the query
SRV lookup fails Query another SRV record in DNS Plan for adding a DNS SRV record for each Edge proxy adjusting weights to give priority
Next hop from Edge proxy fails Client sign in will fail No automatic recovery is possible.

Example scenario #1:
Let's say you have a front-end pool and Edge pool in New York and London. All systems are up except for the Edge in New York. Alice is a user homed on the New York pool and is connected externally with the Lync 2010 client. She connects for the first time and queries the DNS SRV record "" and receives a weighted response with New York's Edge proxy in priority sequence followed by London's Edge proxy. Alice queries the FQDN provided in the SRV record and attempts a connection to New York's Edge however this proxy is down. The Lync 2010 client will use the second, higher weighted DNS SRV record and attempt a connection to London's Edge proxy. Alice's connection is proxied to the next hop server which is London's front-end pool which will also proxy her connection to New York's front-end pool. Lastly, the Lync client caches the London Edge proxy and internal registrar from New York in her cache (EndpointConfiguration.cache file).

Now in this scenario Alice has signed and she will appear fully functional however she won't be able to establish media (voice/video/desktop or app sharing) with a user who doesn't have a public IP or an IP on the same LAN. There are a couple of reasons for this in play; one of which has to do with the New York pool's association with an Edge pool for media. The media relationship is established in the Lync topology file when the administrator created the pool and made a static association. Lync users authenticating to the pool will receive a Media Relay Access Server (MRAS) token which will be used in the STUN/ICE/TURN candidate exchange/negotiation making it possible to establish media sessions across NAT devices. If the Lync Edge pool in New York is down, the front-end pool cannot receive an MRAS token for the user, as such there will be no relay candidate for media establishment. When the Lync client negotiates media with another endpoint/user a check is first performed using the host IP (LAN/WiFi) to see if the two users are on the same network or are at least both directly reachable (both using public IP's). In the case where both IP's are routable, media negotiation should succeed.

Example scenario #2:
Given the above example topology of New York and London sites, assume all infrastructure is operational with the exception of the London pool which incurred in outage overnight. Bob is a user homed in the London pool and is connecting externally using the Lync 2010 client. This isn't Bob's first time connecting so we look at the EndpointConfiguration.cache file for the internal pool name/IP and attempt a connection since we don't know if Bob is internal or external. This will fail since Bob is Internet-facing in this scenario. The Lync client will use the same cache file and attempt a connection to the London Edge proxy/IP which will succeed however since the Edge server cannot communicate with it's next hop (London pool), his sign in fails.

It's important to understand there is no automatic recovery option available here. Since the connection to the Lync Edge in London was successful, but the next hop pool is down, Bob's Lync client will not fall back to a DNS SRV lookup. The Lync 2013 client behavior is the same given this scenario regardless of an administrator initiating a pool failover.

So you're probably wondering, if Bob was the CTO and you absolutely needed him logged in, how could you make this happen? Well you could obviously walk him through the manual configuration of connecting Lync to the New York Edge server by specifying the Edge FQDN (i.e. however the next hop front-end pool wouldn't be able to register the user until a pool failover was invoked. This puts pressure on the Lync administrator to invoke the pool failover command. Ultimately some intervention is required and this may include shutting down the VIP for the Edge pool or the servers themselves. Alternatively you could change the next hop to an available server and publish the topology. Just remember your firewall rules need to permit inbound TCP/5061 to the next hop.

Sticky Edge?
As I dug into this further for a customer of mine recently I thought about the scenario where a Lync remote user has signed into an Edge server which wasn't their "home Edge". In Alice's case I thought the London Edge proxy would be cached in the EndpointConfiguration.cache file making it nearly impossible for her to ever connect back to her home Edge proxy even if the service was restored.

Consider Alice's scenario again but this time on a large have a large population of 150,000 users signing into an Edge pool in London due to a failure in New York which resulted in their DNS SRV lookup signing them into the London Edge pool. In what case would they ever sign back into the New York proxy if the Lync client stored/cached their Edge proxy sign in? In a worst case scenario you could potentially have the majority of 150,000 users signing into London's Edge pool along with the London population without any realistic way of getting them back to New York! In what scenario would they ever sign back into New York???

The Good News!
Well the good news is that we "expire" Edge proxy information in the EndpointConfiguration.cache file if it has been more than 24 hours since the last access time of the file. The same is true for internal registrar pools if the time has been greater than 14 days. This means if Alice, being a New York user, finds the London Edge through DNS SRV query and signs in successfully, she will stay signed into the London Edge for 24 hours no matter how many times she signs in or out regardless of the New York Edge server's state. Once this time has expired, we will query Lyncdiscover again and find the New York Edge and sign Alice into that proxy if available.

Some Guidance...
If you require full redundancy of remote user sign in between two geographic locations with automatic failover and sign in, you won't ever achieve this. The biggest hurdle is the next hop from Edge to next proxy or registrar and even from TMG (or similar reverse proxy) to its next hop. The best you can do is make the DNS names available through GeoDNS/GSLB and ensure the Lync infrastructure is redundant (Enterprise Pools with redundant servers/NICs/etc.).
  • Use sub-domains for your web farm FQDNs (i.e. Using something like "lyncweb" as a sub from your SIP domain allows you to delegate the subdomain to GeoDNS.
  • Publish DNS SRV records for all possible Edge proxies for external users.
  • Publish DNS SRV records for all possible internal registrars for internal users.
  • Test everything!
Hopefully this sheds some light on the external access behavior of the various failure scenarios. Cheers!


  1. This article is about mobility clients, or the lyncdiscover*.* thing applies to the fat client as well? My mind may be rusty, but I believe this is mobileclient-only thing. Would be great to add this into a 1-line at the beginning of the article, not to confuse less experienced readers into thinking this story applies to the normal client as well.