Thursday, February 5, 2015

Automating Azure Cloud Service VM Start and Stop

Through my travels I've come to learn of some great automation features in Microsoft Azure lately which have helped me trim some of the compute cost associated with Virtual Machines running during off-business hours.

As an avid consumer of my MSDN monthly Azure credits I found I was consuming compute cost during hours I wasn't working or using the environment and for a long time I didn't bother learning how to use the "Automation" feature in Azure due to the learning curve. I've taken some time to figure out how to use it and offer my findings in this post.

First lets talk about the various components which make up the Automation feature in Azure:

Azure Automation Components

Automation Account: The Automation Account is a placeholder for your other objects which include Runbooks, Assets, Scale configuration, etc.

Assets: An Asset can be a "Connection", a "Credential", a "Variable", or a "Schedule" object which can be linked or referenced in your Runbooks.

Runbooks: The Runbook is essentially the script containing the commands you want to issue to your environment. A Runbook can contain references to Assets such as a reusable "Credential".

Creating your Automation Credential

The newer way of connecting to your Azure tenant from the Automation environment is to use an account in your Azure AD environment, adding it as an Asset (Credential), then referencing it in your connectivity to your environment.
  1. Click on the Active Directory navigation link on the left side of your Azure portal
  2. Click on the Azure AD environment associated with your Azure tenant and click the Users tab
  3. Click the Add User button on the taskbar and add an account called "Azure Automation" with a UPN of "" 
  4. NOTE: Once you receive the password for the account you must log in at least once to change the password otherwise your Runbook will fail. You can do this by opening a private IE window and logging in to change the password. Remember this password as we will use it to create an Asset shortly.
  5. Click on the Settings link in the left navigation window then click on Administrators
  6. Click the Add button on the toolbar and add your Azure Automation AD account as a Co-Administrator to your subscription
  7. Click on the Automation navigation section and click on your Automation Account
  8. Click the Assets tab and click Add Setting on the toolbar
  9. Choose Add Credential, choose Windows PowerShell Credential, and type a name (i.e. AzureAutomationAccount) and click the Next button
  10. Type the User Name and Password making sure to specify the full UPN ( and click the Checkmark to complete

Creating your first Runbook...

  1. Click on the Automation link in your Azure tenant, then click the Create button on the toolbar at the bottom of the screen
  2. Give the account a name and set the Region
  3. Click on the account to open it, select the Runbooks tab, then click the New button on the toolbar
  4. Click Quick Create, give the Runbook a name (i.e. StartTestCloudService), give it a description, make sure your automation account is selected, and click on the create link
  5. Click the Runbook which will take you to the Author tab right away

Anatomy of your Runbook script

We need to make a connection to the Azure environment which is done by issuing the following command:

$cred = Get-AutomationPSCredential -Name AzureAutomationAccount

We store the credentials in a variable called $cred which we pass to the function below. This makes it possible for an Administrator to establish a co-administrator ID without giving the password out to another person who might be using the account for scripting purposes.

Add-AzureAccount -Credential $cred

This command will connect to your Azure tenant using the credentials stored in the Asset you created earlier.

Select-AzureSubscription -SubscriptionName "Visual Studio Ultimate with MSDN"

NOTE: Depending on your environment, you may need to change the subscription name to your actual subscription name. So how do you find this information? Well, I suggest downloading the Azure PowerShell tools from: Start the Azure PowerShell tool and log into your account using the Add-AzureAccount function above. After authenticating, type Get-AzureSubscription and note the SubscriptionName parameter's value.

You're now ready to start issuing commands to your VMs. To start VMs in a particular cloud service use:

Start-AzureVM -Name -ServiceName

To stop all VMs in a cloud service, use:

Stop-AzureVM -Name * -ServiceName -Force

Using the "-Force" parameter will ensure the script properly de-allocates your Cloud Service resources including the shared Virtual IP. Without this switch the runbook will fail to shut down all the VMs.

You can save and test your Runbook using the buttons on the bottom toolbar when in "editing" mode to make sure everything is working. When complete, click the Publish button which will allow you now to schedule your Runbook.

Scheduling my Runbook

Now that we've created a Runbook, we want to schedule it to run on a daily basis at 018:00 every day:
  1. Click on the Automation section of your Azure portal and open your Automation Account
  2. Click on your Runbook and click the Schedule tab
  3. Click on the Link button and choose Link to a New Schedule
  4. Specify a name (i.e. 18:00 every day) and click Next
  5. Select Daily and type 18:00 for the time leaving 1 as the recur every value and click the Checkmark
You can go back to your Automation Account then into Assets to view your "Schedule" asset which was created when you clicked the Link button in the above steps.

NOTE: If you need to change the start date, time, or frequency of your Schedule asset it doesn't appear to be and editable object through the UI or Azure PowerShell (i.e. Set-AzureAutomationSchedule) so you likely have to create a new one, link it, and un-link the old one.

That's all for today folks. Enjoy!

Thursday, January 8, 2015

RESOLVED Password write back issue (Event ID 6329 and 32009)

I recently set up a password write-back configuration for a customer giving them the ability to enable self-service resets of their Office 365 users. In testing the password change I could get all the way through the steps up to the point where the password needed to be modified then I would get an error in the web page and Event ID 6329 and 32009 would show up on the DirSync server.

In doing some research I found the MSOL_* account in AD used by DirSync w/password sync needs to have more rights than the default "Domain Users" group gives it. You can either delegate the password change capability or add the account to the "Domain Admins" group in AD. Immediately after doing this I was able to change the password.

Some additional information for reference:
  1. Excellent article on how to enable:
  2. Password write-back configuration steps:
  1. If you have the option in Azure AD Premium requiring users to register before being able to change their password, they must do so before they'll be recognized at the sign-in page (i.e. For example, if I click on the "Can't access your account" link I'll be taken to asking for my username and character verification. You will not get past this step if you or your user's haven't registered for the service.
  2. Additionally, you must grant your Office 365 admin account an Azure AD Premium license in order to enable password reset feature(s).
  3. Lastly, it is advised to enter a telephone number on the "General" tab of the on-prem AD user so that the password reset contact method has at least one verification option.

Thursday, October 23, 2014

RESOLVED: Missing NETLOGON and SYSVOL shares on Azure domain controller

So I recently built a VM on Azure in Southeast Asia to do some testing for a customer. I had a few issues connecting the server to AD but eventually figured out Windows Firewall and a bad DNS configuration was the culprit. Soon after joining and promoting the box to a DC, I started my Lync Server 2013 installation.

This was when things went sideways...

I noticed the Lync Server tossing errors about AD being "partially" prepared which was odd since I had a well built out environment in North America. I thought it might have something to do with latency between Singapore and North America but soon realized that wasn't the issue. In an attempt to install the local configuration store it wouldn't get past the first page and produced an error saying something about not being able to find the configuration store. However, running a Get-CsManagementConnection did return the SCP for SQL.

I moved on to basic troubleshooting of the DC by running a "dcdiag" which revealed errors about the server being unsuitable and another error relating to a missing NETLOGON share. Ah hah!! I also noticed SYSVOL wasn't there as well. Coupled with those two were countless errors about DFS replication in Event Viewer. So after some digging I found the following worked for me:

Follow the instructions in this KB article:

In doing so, the SYSVOL share magically appeared!

Once the share was created I restarted the NETLOGON service only to notice the following in Event Viewer:

I manually created the "scripts" and "policies" folders in File Explorer and restarted the NETLOGON service again.

This worked and my Lync installation continued without issue!

FIXED: Lync Server 2013 pre-requisites fail to install on Windows Server 2012 R2 in Azure

So I've noticed recently that builds of Windows Server 2012 R2 which are pre-patched in Azure IaaS fail on the pre-requisite installation for Lync Server 2013. This also became apparent on a SQL Server 2012 installation as well and they're related to the same issue.

During the feature installation process we attempt to install .NET Framework 3.5 however the error you receive is that the 'source' location wasn't specified or can't find the files.

Warning seen from the wizard when installing the feature:

How to resolve: 

Install the following patch: 

...and retry your feature installation again.

Install Lync Server 2013 pre-reqs:

Install-WindowsFeature RSAT-ADDS, Web-Server, Web-Static-Content, Web-Default-Doc, Web-Http-Errors, Web-Asp-Net, Web-Net-Ext, Web-ISAPI-Ext, Web-ISAPI-Filter, Web-Http-Logging, Web-Log-Libraries, Web-Request-Monitor, Web-Http-Tracing, Web-Basic-Auth, Web-Windows-Auth, Web-Client-Auth, Web-Filtering, Web-Stat-Compression, Web-Dyn-Compression, NET-WCF-HTTP-Activation45, Web-Asp-Net45, Web-Mgmt-Tools, Web-Scripting-Tools, Web-Mgmt-Compat, Windows-Identity-Foundation, Desktop-Experience, Telnet-Client, BITS

Tuesday, October 7, 2014

Unwinding a cocked-up Azure deployment

First off, for those of you who believe my title is vulgar, read this. For everyone else, read on...

I started deploying Virtual Machines (VMs) to Azure only recently and found myself giddy as a school boy anticipating what I could build, however this quickly descended into a befuddling exercise leaving one heck of a mess in my Azure portal. As with any software offering with many knobs and buttons, and very little in the way of a traditional 'wizard' to walk new admins through the right way forward, you can get yourself in trouble quickly.

Working with customers in this area I've seen many different ways to deploy Cloud Services, VNets, VMs, and Storage Accounts. Because little guidance exists on how these various components impact each other, people generally forge ahead with little planning and eventually make mistakes.

This post will help demystify the core components of Azure IaaS and give you guidance on how to fix a cocked-up deployment. After all, you need to know what you're dealing with before you attempt to fix it. We'll begin by identifying and describing the building blocks of an Azure IaaS deployment:

1. Naming Conventions

This topic seems somewhat self explanatory however it's worth mentioning as it will make dealing in PowerShell easier and easier to adopt your topology by other Azure Admins. Consider using a format such as this:

[object name]-[location]-[data]

For example, if I'm creating a Virtual Network in South Central US with a network starting at 10.2.x.x then I might use:


Key Takeaway: Certainly don't take my suggestion as the 'right' way to do this either. I'm still working out and perfecting my own ideas on how descriptive these objects should be. You'll probably start with something similar, change it, then change it many more times before you get it right.

2. Virtual Networks

From my perspective vNETs are one of the most important components to get right so I'll start with them first. Like most people I just wanted to get started building VMs in Azure so I didn't pay much attention to this aspect. I can't say it enough now though; planning your vNET topology is foundational to a successful implementation!

Virtual Networks in Azure define the access boundary and the geographic location of your virtual machines. Start with planning your network address space you'll use in Azure including any segmentation via subnets.


Key takeaway: Create your vNETs first, define your address range, then carve out your subnets. Any VM on any subnet in the same vNET will be able to talk (route) without any intervention from you.

If you need VMs in different geographic regions, create those vNETs in those regions first. You can perform vNET to vNET routing through the use of Dynamic Routing Gateways (

3. Storage Accounts

Since your VMs will need a place to reside, you should create the location in advance even though it can be created at the time you build your VM. Refer to my first point above around naming conventions for a reason why. You likely want to have control over what these objects are called vs. the Azure service giving a cryptic name like "p12asgd798asfg98weiop" as a storage account. You also get the opportunity to be in control of where that account is created geographically.

Once your Storage Account is created, add a container for your VHD files (i.e. vhds) to hold your VM disks.

Key takeaway: If you're building VMs in East and West US datacenters, create the associated storage accounts in those datacenters as well.

4. Disks

In Azure IaaS Disks are somewhat self explanatory as they are built on the basis of a VHD file and can be attached to a VM during creation time. If you delete a VM and want to recreate it, you'll need to define a disk based on the VM before doing so....more on this later.

5. Cloud Services

This mysterious object has been at the root of many bunged up deployments and remains for many people a source of frustration in terms of understanding its importance and relevance to the overall architecture. The Cloud Service is basically a container for similar VMs which a business would define as a "service". Let's take Lync Server 2013 for example...

In a Lync environment you have Front-End, Back-End (SQL), File shares, Edge servers, Office Web App servers, and Mediation servers. An Azure admin running a Lync pool would conceivably put all these VMs into a single Cloud Service. You might decide to have a Cloud Service for "core" bits such as Active Directory servers, and yet another for client VMs or development boxes. The point is, each Cloud Service should encase all that makes up that defined service. Why? Well you can start and stop a cloud service to bring an environment online/offline depending on need easier than any other method.

Be aware though that you can only perform one VM operation for each Cloud Service such as a start/stop/update. For this reason I've seen some people create a Cloud Service for each VM which can have long term consequences in terms of the number of CS objects and overall management overhead.

6. Virtual Machines

Now we arrive at the last item; the Virtual Machine itself. Up to this point we've defined everything we need to establish the basis for a solid Azure environment in which the VM will reside. If you've planned the environment to your liking, read no further.

The Virtual Machine will consume disk and network (vNET) resources in the geographic region you've created them in. Remember, the vNET in "West US" should have a Storage Account in "West US", making the VM reside in "West US" after all.

Now on to what we came here for.... to fix your Azure environment

I'll give you an example of what I did to resolve my many issues. Some of which you may want to implement and some you may simply toss away. My problem stemmed from the lack of conceptual understanding of the geographic placement of vNETs and Cloud Services. I wanted some VMs to run in the West coast datacenter and some in the East. My storage accounts were all in the West and some of my networking was East and some was West. A mess really....

To fix it I followed these "simple" steps:
  1. Define my naming convention for networks, VMs, storage accounts, etc.
  2. Build vNET and define address range including all subnets making sure no overlap existed with my existing environment.
  3. Build my storage accounts and set up my "vhds" container.
  4. Create my Cloud Service objects.
  5. Stopped my VMs
  6. Copied the VHD files to the new storage accounts (renaming them along the way). For more information on how to do this check out:
  7. Delete the existing VMs making sure to KEEP the attached disk objects.
  8. Create new disk objects based on the new home location of my VHD files from step 6 above.
  9. Recreate my deleted VM by choosing the "New VM", "From Gallery", then selecting "My Disks" to locate the objects from step 8 above.
  10. Be sure to deploy the VM into an existing Cloud Service, choose the correct vNET, and Storage Account.
  11. Repeat this for all your VMs!
While this seems like a daunting task, consider this a lesson learned. I know I did!!

Friday, October 3, 2014

Here is something strange....HP gets 10-year ban in Canada

HP gets a 10-year ban in Canada for corruption charge and subsequent conviction. Try going to and search for the story or anything close to it.

You won't find anything. Hmmmm. Strange?

This isn't an old story. It happens to be written on a Canadian web site last Friday!

Thursday, September 18, 2014

HOW TO: Copy VHD file between storage accounts in Azure

I recently attempted to rebuild and reconstruct my Azure IaaS environment and had to figure out an easy way to copy and keep track of my progress. The following script was created to copy data between two storage accounts in my Azure environment. It will display progress throughout the copy process.

Requirements: Azure PowerShell module (

Note: The copy process may complete sooner than expected as a 125GB vhd file which is only full of 60GB of data will appear to complete at the 50% mark.

Start by pinning the Azure PowerShell module to your taskbar once you've installed it. You can right-click on the link and choose "Run ISE as Administrator" and this will open an editor window where you can paste in the script below:

##Set variables for the copy
$storageAccount1 = "source storage account here"
$storageAccount2 = "destination storage account here"
$srcBlob = "source file (blob)"
$srcContainer = "source container"
$destBlob = "destination file (blob)"
$destContainer = "destination container"

##Get the key to be used to set the storage context

$srcStorageAccountKey1 = Get-AzureStorageKey -StorageAccountName $storageAccount1
$srcStorageAccountKey2 = Get-AzureStorageKey -StorageAccountName $storageAccount2

##Set the storage context
$varContext1 = New-AzureStorageContext -StorageAccountName $storageAccount1 -StorageAccountKey ($srcStorageAccountKey1.Primary)

$varContext2 = New-AzureStorageContext -StorageAccountName $storageAccount2 -StorageAccountKey ($srcStorageAccountKey2.Primary)

##Start copy operation
$varBlobCopy = Start-AzureStorageBlobCopy -SrcContainer $srcContainer -Context $varContext1 -SrcBlob $srcBlob -DestBlob $destBlob -DestContainer $destContainer -DestContext $varContext2

##Get the status so we can loop
$status = Get-AzureStorageBlobCopyState -Context $varContext2 -Blob $destBlob -Container $destContainer

While($status.Status -eq "Pending"){

$status = Get-AzureStorageBlobCopyState -Context $varContext2 -Blob $destBlob -Container $destContainer

[int]$vartotal = ($status.BytesCopied / $status.TotalBytes * 100)
[int]$varCopiedBytes = ($status.BytesCopied /1mb)
[int]$varTotalBytes = ($status.TotalBytes /1mb)

$msg = "Copied $varCopiedBytes MB out of $varTotalBytes MB"

$activity = "Copying blob: $destBlob"

Write-Progress -Activity $activity -Status $msg -PercentComplete $varTotal -CurrentOperation "$varTotal% complete"

Start-Sleep -Seconds 2


Some things to remember...
  • An aborted or failed file copy may leave a zero byte file on your destination container. Use the GUI or both the "Get-AzureStorageBlob" and "Remove-AzureStorageBlob" commands to view and remove the failed data. Failure to do this will result in a hung PowerShell session.
  • Making a copy of a file in the same container is basically instantaneous.

Understanding client sign in behavior before and after a failure

In this post I'll cover the changes in Lync 2013's sign in process and discuss the various scenarios which could result in an outage or client-facing impact to the user experience.

If you've studied the differences in the sign-in behavior of Lync 2010 versus Lync 2013 you'll know the new "lyncdiscover" process improves the discovery of your pool's Edge server. For those of you who have not studied this change, I'll explain it briefly here.

How lyncdiscover works:
Expanding on Lync Server 2010's Mobility feature which came as a Cumulative Update (CU), the lyncdiscover process involves the publishing of two DNS names pointing to either your reverse proxy (when Internet facing) or to your front-end server/HLB VIP (when LAN facing).

The Lync 2013 client will query first and if it resolves and can connect, it assumes your endpoint is on the inside of the network (LAN). If the host name is unresolvable, we query and connect. In either case we if authenticate, an XML response is given to the endpoint containing the user's home pool SIP proxy and internal pool name among other things.

We then attempt a connection based on the response given in the XML document and cache the successful connection in a file called "EndpointConfiguration.cache' located in the user's profile.

The key takeaway is that the response is customized to your authenticated query and cached by the client. This is important when you have multiple Edge servers in your topology capable of proxying remote users to internal pools since you now have the ability to direct users to their regional Edge resources.

NOTE: This behavior is observed in the Lync 2010/2013 mobility clients, the Lync 2013 desktop client, and the Modern UI client for Windows 8.x. However only the desktop client will fall back to SRV lookup.

How does this compare with Lync 2010?
When we compare this capability to Lync 2010, the client would always query a DNS SRV record first (i.e. returning a static response. You could publish multiple SRV records with varying weights however this would always return a static ordered list. The result would yield all users in the organization signing into the same Edge proxy unless an outage occurred.

The following TechNet article explains the behavior well:

Failure scenarios and recovery options:
So you've done your reading about the sign in process and differences between Lync 2010 and 2013 and by now you're wondering what happens if the various components fail.

Failure Behavior Recovery Options
Query to lyncdiscover fails Fallback to SRV lookup Use GeoDNS solution to provide a response to the query
SRV lookup fails Query another SRV record in DNS Plan for adding a DNS SRV record for each Edge proxy adjusting weights to give priority
Next hop from Edge proxy fails Client sign in will fail No automatic recovery is possible.

Example scenario #1:
Let's say you have a front-end pool and Edge pool in New York and London. All systems are up except for the Edge in New York. Alice is a user homed on the New York pool and is connected externally with the Lync 2010 client. She connects for the first time and queries the DNS SRV record "" and receives a weighted response with New York's Edge proxy in priority sequence followed by London's Edge proxy. Alice queries the FQDN provided in the SRV record and attempts a connection to New York's Edge however this proxy is down. The Lync 2010 client will use the second, higher weighted DNS SRV record and attempt a connection to London's Edge proxy. Alice's connection is proxied to the next hop server which is London's front-end pool which will also proxy her connection to New York's front-end pool. Lastly, the Lync client caches the London Edge proxy and internal registrar from New York in her cache (EndpointConfiguration.cache file).

Now in this scenario Alice has signed and she will appear fully functional however she won't be able to establish media (voice/video/desktop or app sharing) with a user who doesn't have a public IP or an IP on the same LAN. There are a couple of reasons for this in play; one of which has to do with the New York pool's association with an Edge pool for media. The media relationship is established in the Lync topology file when the administrator created the pool and made a static association. Lync users authenticating to the pool will receive a Media Relay Access Server (MRAS) token which will be used in the STUN/ICE/TURN candidate exchange/negotiation making it possible to establish media sessions across NAT devices. If the Lync Edge pool in New York is down, the front-end pool cannot receive an MRAS token for the user, as such there will be no relay candidate for media establishment. When the Lync client negotiates media with another endpoint/user a check is first performed using the host IP (LAN/WiFi) to see if the two users are on the same network or are at least both directly reachable (both using public IP's). In the case where both IP's are routable, media negotiation should succeed.

Example scenario #2:
Given the above example topology of New York and London sites, assume all infrastructure is operational with the exception of the London pool which incurred in outage overnight. Bob is a user homed in the London pool and is connecting externally using the Lync 2010 client. This isn't Bob's first time connecting so we look at the EndpointConfiguration.cache file for the internal pool name/IP and attempt a connection since we don't know if Bob is internal or external. This will fail since Bob is Internet-facing in this scenario. The Lync client will use the same cache file and attempt a connection to the London Edge proxy/IP which will succeed however since the Edge server cannot communicate with it's next hop (London pool), his sign in fails.

It's important to understand there is no automatic recovery option available here. Since the connection to the Lync Edge in London was successful, but the next hop pool is down, Bob's Lync client will not fall back to a DNS SRV lookup. The Lync 2013 client behavior is the same given this scenario regardless of an administrator initiating a pool failover.

So you're probably wondering, if Bob was the CTO and you absolutely needed him logged in, how could you make this happen? Well you could obviously walk him through the manual configuration of connecting Lync to the New York Edge server by specifying the Edge FQDN (i.e. however the next hop front-end pool wouldn't be able to register the user until a pool failover was invoked. This puts pressure on the Lync administrator to invoke the pool failover command. Ultimately some intervention is required and this may include shutting down the VIP for the Edge pool or the servers themselves. Alternatively you could change the next hop to an available server and publish the topology. Just remember your firewall rules need to permit inbound TCP/5061 to the next hop.

Sticky Edge?
As I dug into this further for a customer of mine recently I thought about the scenario where a Lync remote user has signed into an Edge server which wasn't their "home Edge". In Alice's case I thought the London Edge proxy would be cached in the EndpointConfiguration.cache file making it nearly impossible for her to ever connect back to her home Edge proxy even if the service was restored.

Consider Alice's scenario again but this time on a large have a large population of 150,000 users signing into an Edge pool in London due to a failure in New York which resulted in their DNS SRV lookup signing them into the London Edge pool. In what case would they ever sign back into the New York proxy if the Lync client stored/cached their Edge proxy sign in? In a worst case scenario you could potentially have the majority of 150,000 users signing into London's Edge pool along with the London population without any realistic way of getting them back to New York! In what scenario would they ever sign back into New York???

The Good News!
Well the good news is that we "expire" Edge proxy information in the EndpointConfiguration.cache file if it has been more than 24 hours since the last access time of the file. The same is true for internal registrar pools if the time has been greater than 14 days. This means if Alice, being a New York user, finds the London Edge through DNS SRV query and signs in successfully, she will stay signed into the London Edge for 24 hours no matter how many times she signs in or out regardless of the New York Edge server's state. Once this time has expired, we will query Lyncdiscover again and find the New York Edge and sign Alice into that proxy if available.

Some Guidance...
If you require full redundancy of remote user sign in between two geographic locations with automatic failover and sign in, you won't ever achieve this. The biggest hurdle is the next hop from Edge to next proxy or registrar and even from TMG (or similar reverse proxy) to its next hop. The best you can do is make the DNS names available through GeoDNS/GSLB and ensure the Lync infrastructure is redundant (Enterprise Pools with redundant servers/NICs/etc.).
  • Use sub-domains for your web farm FQDNs (i.e. Using something like "lyncweb" as a sub from your SIP domain allows you to delegate the subdomain to GeoDNS.
  • Publish DNS SRV records for all possible Edge proxies for external users.
  • Publish DNS SRV records for all possible internal registrars for internal users.
  • Test everything!
Hopefully this sheds some light on the external access behavior of the various failure scenarios. Cheers!

Friday, March 7, 2014

RESOLVED: An attempt to route to an Exchange UM server failed and OWA/IM integration with Lync Server 2013 and Exchange Server 2013

I recently worked on an engagement where the integration of Exchange Server 2013 and Lync Server 2013 was performed including all the new bits like Unified Contact Store, Archiving, etc. During the course of the engagement we struggled with OWA IM integration configuration and found discrepancies in the guidance from Microsoft and other bloggers out there.

So I'd like to take the time to share my experiences on two fronts; in resolving the error(s) associated with Lync/UM integration, and the OWA IM integration debacle.

So first, UM integration...

Everything was working fine from a UM perspective until we started messing with OWA IM integration troubleshooting. During this effort we decided to remove the trusted application pool representing the FQDN of the mail environment (i.e. and all the trusted application servers within the pool. Topology was published and no issues were found, however this also didn't solve our OWA IM integration issue by the way.

One issue I did see was that the Lync environment wasn't discovering Exchange correctly and had removed the servers from it's internal "Trusted Servers" list. As a last troubleshooting step I restarted the front-end service on the primary Lync Standard Edition server however this again didn't resolve the OWA IM issue. So I added the trusted pool and servers back, published topology, restarted the front-end service on the primary Lync server, and validated they were back in the "Trusted Servers" list again.

I dropped the troubleshooting effort for the night only to realize the next morning that UM was broken across the board except for users homed on the primary Lync Standard Edition server. The only "change" made earlier that night was the topology change but how could this affect UM?

The following error was observed from my local Lync client when testing UM:
ms-diagnostics: 15030;reason="Failed to route to Exchange Server";source="lync01.contoso.local";dialplan="dp01.contoso.local";pstnreroutingenabled="false";appName="ExumRouting"
Server: ExumRouting/
 The following error showed up on the Lync server:
An attempt to route to an Exchange UM server failed.
The attempt failed with response code 504: mb01.contoso.local.
Request Target: [dp01@mb01.contoso.local], Call Id: [0d493c0beac55e4d1938b96d9f0488dc].
Failure occurrences: 27, since 2014-03-07 8:15:44 AM.
Cause: An attempt to route to an Exchange UM server failed because the UM server was unable to process the request or did not respond within the allotted time.
Check this server is correctly configured to point to the appropriate Exchange UM server. Also check whether the Exchange UM server is up and whether it in turn is also properly configured.
Additionally, I found the following error through CLS Logging trace:
TL_ERROR(TF_CONNECTION) [se01\se01]08E8.0CFC::03/07/2014-18:05:22.538.000002E9 (SIPStack,SIPAdminLog::WriteConnectionEvent:SIPAdminLog.cpp(389)) [3492250981] $$begin_record
Severity: error
Text: The peer is not a configured server on this network interface
Transport: TLS
Data: fqdn="se01.contoso.local"
Voicemail was working fine for users on the one Lync server but not for users on the SBS or secondary Lync Standard Edition server. Could the removal of the trusted application servers and pool from topology break UM? Well as a step to fix this I ran the ExchUcUtil.ps1 script to integrate the two environments together again and triple checked all the configuration in UM to see if something was missed or changed by accident. None of this worked.

Even though I had reverted my change related to the trusted application pool and servers being removed from topology builder, I had to restart the front-end services on the secondary Lync Standard Edition server AND the SBS for voicemail to start working again.

So the moral of the story, even though you don't think the change you're making is conceivably a client impacting event, it could be. I should have tested EVERYTHING before calling it a night.

Now onto the OWA IM integration issue...

TechNet ( claims there is no need to create a trusted application server or pool (if more than one E2013 FE) if the 2013 front-end/back-end roles are collocated and if you have a SIPURI dialplan in UM. In every case where I've configured this type of 2013 to 2013 interoperability, I always need to define the trusted application pool. In other words, the autodiscover from Lync to Exchange never works correctly. Once I populate the trusted application pool, OWA IM lights up!

It seemed obvious to me the linkage between the "Trusted Servers" list identified by the Lync server in Event ID 33022 had something to do with OWA IM not working. Adding it back into topology was reflected by Event ID 33022 and the functioning OWA experience.

Anyway, I invite your comments and feedback on these. Cheers!

Friday, November 23, 2012

Lync Server 2013 HA Design Changes and Considerations

Lync Server 2013 introduces new capabilities for recovering from a single server or pool failure and failing over between pools of servers; either Enterprise or Standard Edition.

This post discusses these capabilities, demonstrates their use, and offers suggestions for organizations wondering which path to choose.

Lync Server 2013...what's changed?
  1. Enterprise Edition pools now are recommended to have a minimum of three, yes THREE front-end servers. This is due to the "Windows Fabric" replication architecture based on Azure. The back-end SQL database is no longer the store for real-time data.
  2. (subject to change) Enterprise Edition pools use a quorum model similar to Exchange Server 2010/2013 in that a Majority Node Set (MNS) quorum leverages a tie-breaker for pools with even-numbered front-end servers. In the case of Lync Server 2013 this is the pool back-end SQL server.
  3. Enterprise Edition pools no longer support SQL Server clustering for HA.
  4. SQL Server mirroring is now the supported method of providing back-end database resiliency.
  5. For automatic failover of a SQL mirror, a SQL witness is required; this can be SQL Express. Collocation of other services, software, etc. are subject to further testing.
  6. Lync Server 2013 uses a Web Application Companion (WAC) server (aka Office Web Apps) to stream PowerPoint meeting content including full transition support and embedded videos.
  7. Lync servers can be "paired" with like-infrastructure (Enterprise to Enterprise and Standard to Standard) to ensure resiliency in the event of a site outage (DR). This pairing activity ensures replication of critical pool/server data and must be invoked by an administrator via manual PowerShell commands.
  8. Multiple Federation routes can be applied to the topology. For example, a Boston Standard Edition server can use a Boston Lync Edge server as its Federation route whereas a Seattle Enterprise pool can use a regional Seattle Lync Edge server/pool for Federation.
Now that Enterprise Edition pools can be paired with other EE pools, and Standard Edition servers can be paired with Standard Edition servers, this changes how we design Lync solutions in certain cases. I talk to customers who often suggest they need "High Availability" (HA) in their Lync infrastructure and this often comes from those who are implementing IM&P only. Instead of trying to meet some kind of unrealistic expectation or design to a requirement which centers around a term like HA, drive the conversation toward Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These two factors, along with an Service Level Agreement (SLA) percentage (i.e. 99.9%) should drive the outcome. Anyway, here is the rule of thumb I use personally today:

If the organization suggests they need HA are they willing to accept a Recovery Time Objective of >1 hour? If so, and the per-server user count does not exceed ~5000, use Lync Standard Edition. Two Lync Standard Edition servers could even be used to split the load of 5000 users in a location where 2500 are homed on each server and both servers are paired (backup for each other). The build list would look something like this:

2 x Lync Server 2013 Standard Edition servers (paired with each other in the same site or stretched between a primary and DR site)
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)

If the organization insists they cannot incur downtime for Lync components contained within a single site, and they insist "high availability" is a requirement, the infrastructure looks something like this:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

But wait....I need DR!
If the organization also insists they have a plan for Disaster Recovery, a second "warm" site would house the following minimum infrastructure:

3 x Lync Server 2013 Enterprise Edition servers
2 x Lync Server 2013 Edge servers
2 x Office Web App servers (WAC)
2 x SQL Standard or Enterprise Servers
1 x SQL Express, Standard, or Enterprise (witness)
2 x File servers using DFS
2 x Hardware Load Balancers (for the EE pool and WAC servers)

That's 24 servers to build a site redundant Lync Server 2013 Enterprise environment. This may seem a bit ridiculous however the point I'm illustrating is the value Standard Edition now brings in Lync Server 2013. Additionally, I haven't found an organization yet who would dedicate server hardware or VM's in this manner. You can collocate many of the roles and scale back on things like WAC and SQL mirroring. Lastly, organizations might suggest their DR infrastructure would accommodate lower user counts which may drive a design lacking redundancy at the "warm" site.

Hold on....what about Persistent Chat?
Okay, so you want a Persistent Chat pool as well....we need to add dual redundant servers at each site raising the total to 28 servers.

As you can see the case for paired Standard Edition servers quickly becomes favorable from a cost and complexity perspective albeit sacrificing availability in the event of a single server outage. The fact that hardware load balancers can be completely eliminated also tells a great story around simplicity. To date I have yet to see a successful implementation of OCS or Lync where hardware load balancers are in the mix at all. This is mostly due to lack of knowledge, lack of understanding on how the solution works, or in some cases simple reluctance to work together.

What if I have more than 5000 users at a single site and need DR?
Consider placing multiple Standard Edition servers paired with similar servers at your backup sites. You can split users homed between servers (i.e. 3000 on ServerA in SiteA and 3000 on ServerB in SiteA) to meet your capacity requirements.

What are the drawbacks to Lync Standard Edition anyway?
Well the first point people typically jump on is no "high availability". This is obviously due to the lack of a shared common data store whereby multiple front-ends connect and relate to. Here are some of the more important drawbacks when considering this approach:
  1. Restoration of service is a manual effort resulting in users being left with "Resiliency Mode" until this action is taken.
  2. Your Edge proxy to 'next hop' internal server can be only one SE server even if you have several of them. An outage to this next hop server results in an outage for all remote users' traffic. It is important to note as well that if Edge cannot contact the next hop, clients will not attempt to sign into another Edge proxy even if another exists (without manual intervention at each client system).
  3. Response Groups and Call Park are a manual effort to switch over.
  4. Assignment of users to a collection of SE servers takes thought and proper assignment so as to not overload a single server. In the case where you have two servers, decide if you're going to run them active/active or active/passive as this will change your user placement behavior. This can also be scripted for ease of user placement automatically.
  5. You could argue this is more complex to manage however the same argument is made for the HLB/SQL infrastructure required.
  6. Your PSTN conference DID is homed to a single server. If this server is down, the DID is as well. I have not yet tested the behavior of a pool failover whether this DID is restored on the backup registrar or not (TBD).
  7. Exchange OWA/UCS integration has a single point of failure due to the lack of multiple server definitions in the Exchange 2010/2013 CAS setup.
Certainly you will have to weigh your own requirements against what is both supported and recommended. This article is intended to keep us thinking on our toes when designing Lync solutions for our customers. Enjoy!