Thursday, February 5, 2015

Automating Azure Cloud Service VM Start and Stop

Through my travels I've come to learn of some great automation features in Microsoft Azure lately which have helped me trim some of the compute cost associated with Virtual Machines running during off-business hours.

As an avid consumer of my MSDN monthly Azure credits I found I was consuming compute cost during hours I wasn't working or using the environment and for a long time I didn't bother learning how to use the "Automation" feature in Azure due to the learning curve. I've taken some time to figure out how to use it and offer my findings in this post.

First lets talk about the various components which make up the Automation feature in Azure:

Azure Automation Components

Automation Account: The Automation Account is a placeholder for your other objects which include Runbooks, Assets, Scale configuration, etc.

Assets: An Asset can be a "Connection", a "Credential", a "Variable", or a "Schedule" object which can be linked or referenced in your Runbooks.

Runbooks: The Runbook is essentially the script containing the commands you want to issue to your environment. A Runbook can contain references to Assets such as a reusable "Credential".

Creating your Automation Credential

The newer way of connecting to your Azure tenant from the Automation environment is to use an account in your Azure AD environment, adding it as an Asset (Credential), then referencing it in your connectivity to your environment.
  1. Click on the Active Directory navigation link on the left side of your Azure portal
  2. Click on the Azure AD environment associated with your Azure tenant and click the Users tab
  3. Click the Add User button on the taskbar and add an account called "Azure Automation" with a UPN of "azureautomation@tentantdomain.onmicrosoft.com" 
  4. NOTE: Once you receive the password for the account you must log in at least once to change the password otherwise your Runbook will fail. You can do this by opening a private IE window and logging in to change the password. Remember this password as we will use it to create an Asset shortly.
  5. Click on the Settings link in the left navigation window then click on Administrators
  6. Click the Add button on the toolbar and add your Azure Automation AD account as a Co-Administrator to your subscription
  7. Click on the Automation navigation section and click on your Automation Account
  8. Click the Assets tab and click Add Setting on the toolbar
  9. Choose Add Credential, choose Windows PowerShell Credential, and type a name (i.e. AzureAutomationAccount) and click the Next button
  10. Type the User Name and Password making sure to specify the full UPN (azureautomation@tenantdomain.onmicrosoft.com) and click the Checkmark to complete

Creating your first Runbook...

  1. Click on the Automation link in your Azure tenant, then click the Create button on the toolbar at the bottom of the screen
  2. Give the account a name and set the Region
  3. Click on the account to open it, select the Runbooks tab, then click the New button on the toolbar
  4. Click Quick Create, give the Runbook a name (i.e. StartTestCloudService), give it a description, make sure your automation account is selected, and click on the create link
  5. Click the Runbook which will take you to the Author tab right away

Anatomy of your Runbook script

We need to make a connection to the Azure environment which is done by issuing the following command:

$cred = Get-AutomationPSCredential -Name AzureAutomationAccount

We store the credentials in a variable called $cred which we pass to the function below. This makes it possible for an Administrator to establish a co-administrator ID without giving the password out to another person who might be using the account for scripting purposes.

Add-AzureAccount -Credential $cred

This command will connect to your Azure tenant using the credentials stored in the Asset you created earlier.

Select-AzureSubscription -SubscriptionName "Visual Studio Ultimate with MSDN"

NOTE: Depending on your environment, you may need to change the subscription name to your actual subscription name. So how do you find this information? Well, I suggest downloading the Azure PowerShell tools from: http://azure.microsoft.com/en-us/downloads/#cmd-line-tools. Start the Azure PowerShell tool and log into your account using the Add-AzureAccount function above. After authenticating, type Get-AzureSubscription and note the SubscriptionName parameter's value.

You're now ready to start issuing commands to your VMs. To start VMs in a particular cloud service use:

Start-AzureVM -Name -ServiceName

To stop all VMs in a cloud service, use:

Stop-AzureVM -Name * -ServiceName -Force

Using the "-Force" parameter will ensure the script properly de-allocates your Cloud Service resources including the shared Virtual IP. Without this switch the runbook will fail to shut down all the VMs.

You can save and test your Runbook using the buttons on the bottom toolbar when in "editing" mode to make sure everything is working. When complete, click the Publish button which will allow you now to schedule your Runbook.

Scheduling my Runbook

Now that we've created a Runbook, we want to schedule it to run on a daily basis at 018:00 every day:
  1. Click on the Automation section of your Azure portal and open your Automation Account
  2. Click on your Runbook and click the Schedule tab
  3. Click on the Link button and choose Link to a New Schedule
  4. Specify a name (i.e. 18:00 every day) and click Next
  5. Select Daily and type 18:00 for the time leaving 1 as the recur every value and click the Checkmark
You can go back to your Automation Account then into Assets to view your "Schedule" asset which was created when you clicked the Link button in the above steps.

NOTE: If you need to change the start date, time, or frequency of your Schedule asset it doesn't appear to be and editable object through the UI or Azure PowerShell (i.e. Set-AzureAutomationSchedule) so you likely have to create a new one, link it, and un-link the old one.

That's all for today folks. Enjoy!

Thursday, January 8, 2015

RESOLVED Password write back issue (Event ID 6329 and 32009)

I recently set up a password write-back configuration for a customer giving them the ability to enable self-service resets of their Office 365 users. In testing the password change I could get all the way through the steps up to the point where the password needed to be modified then I would get an error in the web page and Event ID 6329 and 32009 would show up on the DirSync server.


In doing some research I found the MSOL_* account in AD used by DirSync w/password sync needs to have more rights than the default "Domain Users" group gives it. You can either delegate the password change capability or add the account to the "Domain Admins" group in AD. Immediately after doing this I was able to change the password.

Some additional information for reference:
  1. Excellent article on how to enable: http://msdn.microsoft.com/en-us/library/azure/dn683881.aspx
  2. Password write-back configuration steps: http://msdn.microsoft.com/en-us/library/azure/dn688249.aspx
NOTE:
  1. If you have the option in Azure AD Premium requiring users to register before being able to change their password, they must do so before they'll be recognized at the sign-in page (i.e. http://login.microsoftonline.com). For example, if I click on the "Can't access your account" link I'll be taken to https://passwordreset.microsoftonline.com asking for my username and character verification. You will not get past this step if you or your user's haven't registered for the service.
  2. Additionally, you must grant your Office 365 admin account an Azure AD Premium license in order to enable password reset feature(s).
  3. Lastly, it is advised to enter a telephone number on the "General" tab of the on-prem AD user so that the password reset contact method has at least one verification option.

Thursday, October 23, 2014

Ottawa shooting brings about Police State?


RESOLVED: Missing NETLOGON and SYSVOL shares on Azure domain controller

So I recently built a VM on Azure in Southeast Asia to do some testing for a customer. I had a few issues connecting the server to AD but eventually figured out Windows Firewall and a bad DNS configuration was the culprit. Soon after joining and promoting the box to a DC, I started my Lync Server 2013 installation.

This was when things went sideways...

I noticed the Lync Server tossing errors about AD being "partially" prepared which was odd since I had a well built out environment in North America. I thought it might have something to do with latency between Singapore and North America but soon realized that wasn't the issue. In an attempt to install the local configuration store it wouldn't get past the first page and produced an error saying something about not being able to find the configuration store. However, running a Get-CsManagementConnection did return the SCP for SQL.

I moved on to basic troubleshooting of the DC by running a "dcdiag" which revealed errors about the server being unsuitable and another error relating to a missing NETLOGON share. Ah hah!! I also noticed SYSVOL wasn't there as well. Coupled with those two were countless errors about DFS replication in Event Viewer. So after some digging I found the following worked for me:

Follow the instructions in this KB article: http://support.microsoft.com/kb/947022

In doing so, the SYSVOL share magically appeared!

Once the share was created I restarted the NETLOGON service only to notice the following in Event Viewer:


I manually created the "scripts" and "policies" folders in File Explorer and restarted the NETLOGON service again.

This worked and my Lync installation continued without issue!

FIXED: Lync Server 2013 pre-requisites fail to install on Windows Server 2012 R2 in Azure

So I've noticed recently that builds of Windows Server 2012 R2 which are pre-patched in Azure IaaS fail on the pre-requisite installation for Lync Server 2013. This also became apparent on a SQL Server 2012 installation as well and they're related to the same issue.

During the feature installation process we attempt to install .NET Framework 3.5 however the error you receive is that the 'source' location wasn't specified or can't find the files.


Warning seen from the wizard when installing the feature:


How to resolve: 

Install the following patch: http://support2.microsoft.com/kb/3005628 

...and retry your feature installation again.

Install Lync Server 2013 pre-reqs:

Install-WindowsFeature RSAT-ADDS, Web-Server, Web-Static-Content, Web-Default-Doc, Web-Http-Errors, Web-Asp-Net, Web-Net-Ext, Web-ISAPI-Ext, Web-ISAPI-Filter, Web-Http-Logging, Web-Log-Libraries, Web-Request-Monitor, Web-Http-Tracing, Web-Basic-Auth, Web-Windows-Auth, Web-Client-Auth, Web-Filtering, Web-Stat-Compression, Web-Dyn-Compression, NET-WCF-HTTP-Activation45, Web-Asp-Net45, Web-Mgmt-Tools, Web-Scripting-Tools, Web-Mgmt-Compat, Windows-Identity-Foundation, Desktop-Experience, Telnet-Client, BITS

Tuesday, October 7, 2014

Unwinding a cocked-up Azure deployment

First off, for those of you who believe my title is vulgar, read this. For everyone else, read on...

I started deploying Virtual Machines (VMs) to Azure only recently and found myself giddy as a school boy anticipating what I could build, however this quickly descended into a befuddling exercise leaving one heck of a mess in my Azure portal. As with any software offering with many knobs and buttons, and very little in the way of a traditional 'wizard' to walk new admins through the right way forward, you can get yourself in trouble quickly.

Working with customers in this area I've seen many different ways to deploy Cloud Services, VNets, VMs, and Storage Accounts. Because little guidance exists on how these various components impact each other, people generally forge ahead with little planning and eventually make mistakes.

This post will help demystify the core components of Azure IaaS and give you guidance on how to fix a cocked-up deployment. After all, you need to know what you're dealing with before you attempt to fix it. We'll begin by identifying and describing the building blocks of an Azure IaaS deployment:

1. Naming Conventions

This topic seems somewhat self explanatory however it's worth mentioning as it will make dealing in PowerShell easier and easier to adopt your topology by other Azure Admins. Consider using a format such as this:

[object name]-[location]-[data]

For example, if I'm creating a Virtual Network in South Central US with a network starting at 10.2.x.x then I might use:

vnet-southcentralus-10.2

Key Takeaway: Certainly don't take my suggestion as the 'right' way to do this either. I'm still working out and perfecting my own ideas on how descriptive these objects should be. You'll probably start with something similar, change it, then change it many more times before you get it right.

2. Virtual Networks

From my perspective vNETs are one of the most important components to get right so I'll start with them first. Like most people I just wanted to get started building VMs in Azure so I didn't pay much attention to this aspect. I can't say it enough now though; planning your vNET topology is foundational to a successful implementation!

Virtual Networks in Azure define the access boundary and the geographic location of your virtual machines. Start with planning your network address space you'll use in Azure including any segmentation via subnets.

 

Key takeaway: Create your vNETs first, define your address range, then carve out your subnets. Any VM on any subnet in the same vNET will be able to talk (route) without any intervention from you.

If you need VMs in different geographic regions, create those vNETs in those regions first. You can perform vNET to vNET routing through the use of Dynamic Routing Gateways (http://msdn.microsoft.com/en-us/library/azure/dn690122.aspx).

3. Storage Accounts

Since your VMs will need a place to reside, you should create the location in advance even though it can be created at the time you build your VM. Refer to my first point above around naming conventions for a reason why. You likely want to have control over what these objects are called vs. the Azure service giving a cryptic name like "p12asgd798asfg98weiop" as a storage account. You also get the opportunity to be in control of where that account is created geographically.


Once your Storage Account is created, add a container for your VHD files (i.e. vhds) to hold your VM disks.

Key takeaway: If you're building VMs in East and West US datacenters, create the associated storage accounts in those datacenters as well.

4. Disks

In Azure IaaS Disks are somewhat self explanatory as they are built on the basis of a VHD file and can be attached to a VM during creation time. If you delete a VM and want to recreate it, you'll need to define a disk based on the VM before doing so....more on this later.

5. Cloud Services

This mysterious object has been at the root of many bunged up deployments and remains for many people a source of frustration in terms of understanding its importance and relevance to the overall architecture. The Cloud Service is basically a container for similar VMs which a business would define as a "service". Let's take Lync Server 2013 for example...

In a Lync environment you have Front-End, Back-End (SQL), File shares, Edge servers, Office Web App servers, and Mediation servers. An Azure admin running a Lync pool would conceivably put all these VMs into a single Cloud Service. You might decide to have a Cloud Service for "core" bits such as Active Directory servers, and yet another for client VMs or development boxes. The point is, each Cloud Service should encase all that makes up that defined service. Why? Well you can start and stop a cloud service to bring an environment online/offline depending on need easier than any other method.


Be aware though that you can only perform one VM operation for each Cloud Service such as a start/stop/update. For this reason I've seen some people create a Cloud Service for each VM which can have long term consequences in terms of the number of CS objects and overall management overhead.

6. Virtual Machines

Now we arrive at the last item; the Virtual Machine itself. Up to this point we've defined everything we need to establish the basis for a solid Azure environment in which the VM will reside. If you've planned the environment to your liking, read no further.


The Virtual Machine will consume disk and network (vNET) resources in the geographic region you've created them in. Remember, the vNET in "West US" should have a Storage Account in "West US", making the VM reside in "West US" after all.

Now on to what we came here for....

Now...how to fix your Azure environment

I'll give you an example of what I did to resolve my many issues. Some of which you may want to implement and some you may simply toss away. My problem stemmed from the lack of conceptual understanding of the geographic placement of vNETs and Cloud Services. I wanted some VMs to run in the West coast datacenter and some in the East. My storage accounts were all in the West and some of my networking was East and some was West. A mess really....

To fix it I followed these "simple" steps:
  1. Define my naming convention for networks, VMs, storage accounts, etc.
  2. Build vNET and define address range including all subnets making sure no overlap existed with my existing environment.
  3. Build my storage accounts and set up my "vhds" container.
  4. Create my Cloud Service objects.
  5. Stopped my VMs
  6. Copied the VHD files to the new storage accounts (renaming them along the way). For more information on how to do this check out: http://jasonshave.blogspot.ch/2014/09/how-to-copy-vhd-file-between-storage.html
  7. Delete the existing VMs making sure to KEEP the attached disk objects.
  8. Create new disk objects based on the new home location of my VHD files from step 6 above.
  9. Recreate my deleted VM by choosing the "New VM", "From Gallery", then selecting "My Disks" to locate the objects from step 8 above.
  10. Be sure to deploy the VM into an existing Cloud Service, choose the correct vNET, and Storage Account.
  11. Repeat this for all your VMs!
While this seems like a daunting task, consider this a lesson learned. I know I did!!

Friday, October 3, 2014

Here is something strange....HP gets 10-year ban in Canada

HP gets a 10-year ban in Canada for corruption charge and subsequent conviction. Try going to cnn.com and search for the story or anything close to it.

You won't find anything. Hmmmm. Strange?

This isn't an old story. It happens to be written on a Canadian web site last Friday!

Thursday, September 18, 2014

HOW TO: Copy VHD file between storage accounts in Azure

I recently attempted to rebuild and reconstruct my Azure IaaS environment and had to figure out an easy way to copy and keep track of my progress. The following script was created to copy data between two storage accounts in my Azure environment. It will display progress throughout the copy process.

Requirements: Azure PowerShell module (http://azure.microsoft.com/en-us/documentation/articles/install-configure-powershell/)

Note: The copy process may complete sooner than expected as a 125GB vhd file which is only full of 60GB of data will appear to complete at the 50% mark.



Start by pinning the Azure PowerShell module to your taskbar once you've installed it. You can right-click on the link and choose "Run ISE as Administrator" and this will open an editor window where you can paste in the script below:

##Set variables for the copy
$storageAccount1 = "source storage account here"
$storageAccount2 = "destination storage account here"
$srcBlob = "source file (blob)"
$srcContainer = "source container"
$destBlob = "destination file (blob)"
$destContainer = "destination container"

##Get the key to be used to set the storage context
 
 

$srcStorageAccountKey1 = Get-AzureStorageKey -StorageAccountName $storageAccount1
$srcStorageAccountKey2 = Get-AzureStorageKey -StorageAccountName $storageAccount2

##Set the storage context
 
 
$varContext1 = New-AzureStorageContext -StorageAccountName $storageAccount1 -StorageAccountKey ($srcStorageAccountKey1.Primary)

$varContext2 = New-AzureStorageContext -StorageAccountName $storageAccount2 -StorageAccountKey ($srcStorageAccountKey2.Primary)

##Start copy operation
 
 
$varBlobCopy = Start-AzureStorageBlobCopy -SrcContainer $srcContainer -Context $varContext1 -SrcBlob $srcBlob -DestBlob $destBlob -DestContainer $destContainer -DestContext $varContext2

##Get the status so we can loop
 
 
$status = Get-AzureStorageBlobCopyState -Context $varContext2 -Blob $destBlob -Container $destContainer

While($status.Status -eq "Pending"){

$status = Get-AzureStorageBlobCopyState -Context $varContext2 -Blob $destBlob -Container $destContainer

[int]$vartotal = ($status.BytesCopied / $status.TotalBytes * 100)
[int]$varCopiedBytes = ($status.BytesCopied /1mb)
[int]$varTotalBytes = ($status.TotalBytes /1mb)

$msg = "Copied $varCopiedBytes MB out of $varTotalBytes MB"

$activity = "Copying blob: $destBlob"

Write-Progress -Activity $activity -Status $msg -PercentComplete $varTotal -CurrentOperation "$varTotal% complete"

Start-Sleep -Seconds 2

}

Some things to remember...
  • An aborted or failed file copy may leave a zero byte file on your destination container. Use the GUI or both the "Get-AzureStorageBlob" and "Remove-AzureStorageBlob" commands to view and remove the failed data. Failure to do this will result in a hung PowerShell session.
  • Making a copy of a file in the same container is basically instantaneous.

Understanding client sign in behavior before and after a failure

In this post I'll cover the changes in Lync 2013's sign in process and discuss the various scenarios which could result in an outage or client-facing impact to the user experience.

If you've studied the differences in the sign-in behavior of Lync 2010 versus Lync 2013 you'll know the new "lyncdiscover" process improves the discovery of your pool's Edge server. For those of you who have not studied this change, I'll explain it briefly here.

How lyncdiscover works:
Expanding on Lync Server 2010's Mobility feature which came as a Cumulative Update (CU), the lyncdiscover process involves the publishing of two DNS names pointing to either your reverse proxy (when Internet facing) or to your front-end server/HLB VIP (when LAN facing).

The Lync 2013 client will query lyncdiscoverinternal.contoso.com first and if it resolves and can connect, it assumes your endpoint is on the inside of the network (LAN). If the host name is unresolvable, we query lyncdiscover.contoso.com and connect. In either case we if authenticate, an XML response is given to the endpoint containing the user's home pool SIP proxy and internal pool name among other things.

We then attempt a connection based on the response given in the XML document and cache the successful connection in a file called "EndpointConfiguration.cache' located in the user's profile.

The key takeaway is that the response is customized to your authenticated query and cached by the client. This is important when you have multiple Edge servers in your topology capable of proxying remote users to internal pools since you now have the ability to direct users to their regional Edge resources.

NOTE: This behavior is observed in the Lync 2010/2013 mobility clients, the Lync 2013 desktop client, and the Modern UI client for Windows 8.x. However only the desktop client will fall back to SRV lookup.

How does this compare with Lync 2010?
When we compare this capability to Lync 2010, the client would always query a DNS SRV record first (i.e. _sip._tls.contoso.com) returning a static response. You could publish multiple SRV records with varying weights however this would always return a static ordered list. The result would yield all users in the organization signing into the same Edge proxy unless an outage occurred.

The following TechNet article explains the behavior well: http://technet.microsoft.com/en-us/library/gg398758.aspx

Failure scenarios and recovery options:
So you've done your reading about the sign in process and differences between Lync 2010 and 2013 and by now you're wondering what happens if the various components fail.

Failure Behavior Recovery Options
Query to lyncdiscover fails Fallback to SRV lookup Use GeoDNS solution to provide a response to the query
SRV lookup fails Query another SRV record in DNS Plan for adding a DNS SRV record for each Edge proxy adjusting weights to give priority
Next hop from Edge proxy fails Client sign in will fail No automatic recovery is possible.

Example scenario #1:
Let's say you have a front-end pool and Edge pool in New York and London. All systems are up except for the Edge in New York. Alice is a user homed on the New York pool and is connected externally with the Lync 2010 client. She connects for the first time and queries the DNS SRV record "_sip._tls.contoso.com" and receives a weighted response with New York's Edge proxy in priority sequence followed by London's Edge proxy. Alice queries the FQDN provided in the SRV record and attempts a connection to New York's Edge however this proxy is down. The Lync 2010 client will use the second, higher weighted DNS SRV record and attempt a connection to London's Edge proxy. Alice's connection is proxied to the next hop server which is London's front-end pool which will also proxy her connection to New York's front-end pool. Lastly, the Lync client caches the London Edge proxy and internal registrar from New York in her cache (EndpointConfiguration.cache file).

Now in this scenario Alice has signed and she will appear fully functional however she won't be able to establish media (voice/video/desktop or app sharing) with a user who doesn't have a public IP or an IP on the same LAN. There are a couple of reasons for this in play; one of which has to do with the New York pool's association with an Edge pool for media. The media relationship is established in the Lync topology file when the administrator created the pool and made a static association. Lync users authenticating to the pool will receive a Media Relay Access Server (MRAS) token which will be used in the STUN/ICE/TURN candidate exchange/negotiation making it possible to establish media sessions across NAT devices. If the Lync Edge pool in New York is down, the front-end pool cannot receive an MRAS token for the user, as such there will be no relay candidate for media establishment. When the Lync client negotiates media with another endpoint/user a check is first performed using the host IP (LAN/WiFi) to see if the two users are on the same network or are at least both directly reachable (both using public IP's). In the case where both IP's are routable, media negotiation should succeed.

Example scenario #2:
Given the above example topology of New York and London sites, assume all infrastructure is operational with the exception of the London pool which incurred in outage overnight. Bob is a user homed in the London pool and is connecting externally using the Lync 2010 client. This isn't Bob's first time connecting so we look at the EndpointConfiguration.cache file for the internal pool name/IP and attempt a connection since we don't know if Bob is internal or external. This will fail since Bob is Internet-facing in this scenario. The Lync client will use the same cache file and attempt a connection to the London Edge proxy/IP which will succeed however since the Edge server cannot communicate with it's next hop (London pool), his sign in fails.

It's important to understand there is no automatic recovery option available here. Since the connection to the Lync Edge in London was successful, but the next hop pool is down, Bob's Lync client will not fall back to a DNS SRV lookup. The Lync 2013 client behavior is the same given this scenario regardless of an administrator initiating a pool failover.

So you're probably wondering, if Bob was the CTO and you absolutely needed him logged in, how could you make this happen? Well you could obviously walk him through the manual configuration of connecting Lync to the New York Edge server by specifying the Edge FQDN (i.e. lync2013nyc-access-edge.contoso.com:443) however the next hop front-end pool wouldn't be able to register the user until a pool failover was invoked. This puts pressure on the Lync administrator to invoke the pool failover command. Ultimately some intervention is required and this may include shutting down the VIP for the Edge pool or the servers themselves. Alternatively you could change the next hop to an available server and publish the topology. Just remember your firewall rules need to permit inbound TCP/5061 to the next hop.

Sticky Edge?
As I dug into this further for a customer of mine recently I thought about the scenario where a Lync remote user has signed into an Edge server which wasn't their "home Edge". In Alice's case I thought the London Edge proxy would be cached in the EndpointConfiguration.cache file making it nearly impossible for her to ever connect back to her home Edge proxy even if the service was restored.

Consider Alice's scenario again but this time on a large scale.....you have a large population of 150,000 users signing into an Edge pool in London due to a failure in New York which resulted in their DNS SRV lookup signing them into the London Edge pool. In what case would they ever sign back into the New York proxy if the Lync client stored/cached their Edge proxy sign in? In a worst case scenario you could potentially have the majority of 150,000 users signing into London's Edge pool along with the London population without any realistic way of getting them back to New York! In what scenario would they ever sign back into New York???

The Good News!
Well the good news is that we "expire" Edge proxy information in the EndpointConfiguration.cache file if it has been more than 24 hours since the last access time of the file. The same is true for internal registrar pools if the time has been greater than 14 days. This means if Alice, being a New York user, finds the London Edge through DNS SRV query and signs in successfully, she will stay signed into the London Edge for 24 hours no matter how many times she signs in or out regardless of the New York Edge server's state. Once this time has expired, we will query Lyncdiscover again and find the New York Edge and sign Alice into that proxy if available.

Some Guidance...
If you require full redundancy of remote user sign in between two geographic locations with automatic failover and sign in, you won't ever achieve this. The biggest hurdle is the next hop from Edge to next proxy or registrar and even from TMG (or similar reverse proxy) to its next hop. The best you can do is make the DNS names available through GeoDNS/GSLB and ensure the Lync infrastructure is redundant (Enterprise Pools with redundant servers/NICs/etc.).
  • Use sub-domains for your web farm FQDNs (i.e. externalpoolname.lyncweb.contoso.com). Using something like "lyncweb" as a sub from your SIP domain allows you to delegate the subdomain to GeoDNS.
  • Publish DNS SRV records for all possible Edge proxies for external users.
  • Publish DNS SRV records for all possible internal registrars for internal users.
  • Test everything!
Hopefully this sheds some light on the external access behavior of the various failure scenarios. Cheers!

Friday, March 7, 2014

RESOLVED: An attempt to route to an Exchange UM server failed and OWA/IM integration with Lync Server 2013 and Exchange Server 2013

I recently worked on an engagement where the integration of Exchange Server 2013 and Lync Server 2013 was performed including all the new bits like Unified Contact Store, Archiving, etc. During the course of the engagement we struggled with OWA IM integration configuration and found discrepancies in the guidance from Microsoft and other bloggers out there.

So I'd like to take the time to share my experiences on two fronts; in resolving the error(s) associated with Lync/UM integration, and the OWA IM integration debacle.

So first, UM integration...

Everything was working fine from a UM perspective until we started messing with OWA IM integration troubleshooting. During this effort we decided to remove the trusted application pool representing the FQDN of the mail environment (i.e. mail.contoso.com) and all the trusted application servers within the pool. Topology was published and no issues were found, however this also didn't solve our OWA IM integration issue by the way.

One issue I did see was that the Lync environment wasn't discovering Exchange correctly and had removed the servers from it's internal "Trusted Servers" list. As a last troubleshooting step I restarted the front-end service on the primary Lync Standard Edition server however this again didn't resolve the OWA IM issue. So I added the trusted pool and servers back, published topology, restarted the front-end service on the primary Lync server, and validated they were back in the "Trusted Servers" list again.

I dropped the troubleshooting effort for the night only to realize the next morning that UM was broken across the board except for users homed on the primary Lync Standard Edition server. The only "change" made earlier that night was the topology change but how could this affect UM?

The following error was observed from my local Lync client when testing UM:
ms-diagnostics: 15030;reason="Failed to route to Exchange Server";source="lync01.contoso.local";dialplan="dp01.contoso.local";pstnreroutingenabled="false";appName="ExumRouting"
Server: ExumRouting/5.0.0.0
 The following error showed up on the Lync server:
An attempt to route to an Exchange UM server failed.
The attempt failed with response code 504: mb01.contoso.local.
Request Target: [dp01@mb01.contoso.local], Call Id: [0d493c0beac55e4d1938b96d9f0488dc].
Failure occurrences: 27, since 2014-03-07 8:15:44 AM.
Cause: An attempt to route to an Exchange UM server failed because the UM server was unable to process the request or did not respond within the allotted time.
Resolution:
Check this server is correctly configured to point to the appropriate Exchange UM server. Also check whether the Exchange UM server is up and whether it in turn is also properly configured.
Additionally, I found the following error through CLS Logging trace:
TL_ERROR(TF_CONNECTION) [se01\se01]08E8.0CFC::03/07/2014-18:05:22.538.000002E9 (SIPStack,SIPAdminLog::WriteConnectionEvent:SIPAdminLog.cpp(389)) [3492250981] $$begin_record
Severity: error
Text: The peer is not a configured server on this network interface
Peer-IP: 10.222.10.69:55561
Transport: TLS
Result-Code: 0xc3e93d6a SIPPROXY_E_CONNECTION_UNKNOWN_SERVER
Data: fqdn="se01.contoso.local"
Voicemail was working fine for users on the one Lync server but not for users on the SBS or secondary Lync Standard Edition server. Could the removal of the trusted application servers and pool from topology break UM? Well as a step to fix this I ran the ExchUcUtil.ps1 script to integrate the two environments together again and triple checked all the configuration in UM to see if something was missed or changed by accident. None of this worked.

Even though I had reverted my change related to the trusted application pool and servers being removed from topology builder, I had to restart the front-end services on the secondary Lync Standard Edition server AND the SBS for voicemail to start working again.

So the moral of the story, even though you don't think the change you're making is conceivably a client impacting event, it could be. I should have tested EVERYTHING before calling it a night.

Now onto the OWA IM integration issue...

TechNet (http://technet.microsoft.com/en-us/library/jj688055.aspx) claims there is no need to create a trusted application server or pool (if more than one E2013 FE) if the 2013 front-end/back-end roles are collocated and if you have a SIPURI dialplan in UM. In every case where I've configured this type of 2013 to 2013 interoperability, I always need to define the trusted application pool. In other words, the autodiscover from Lync to Exchange never works correctly. Once I populate the trusted application pool, OWA IM lights up!

It seemed obvious to me the linkage between the "Trusted Servers" list identified by the Lync server in Event ID 33022 had something to do with OWA IM not working. Adding it back into topology was reflected by Event ID 33022 and the functioning OWA experience.

Anyway, I invite your comments and feedback on these. Cheers!