Friday, November 27, 2009

Issues using DPM 2010 with an Exchange 2010 DAG

I've been "playing", well not really. I've been testing....ya that's it....testing Microsoft's beta version of System Center Data Protection Manager 2010 in conjunction with Microsoft Exchange Server 2010. The tests I've been conducting involve backing up and restoring Exchange servers in a Database Availability Group (DAG). I've also been trying to break the DAG replication by pulling the network cable from a server and watching the outcome. For the most part everything is working quite well however I did come across an interesting issue which deserves discussion.

Exchange 2010 Setup: 4 server DAG with 3 database copies and 1 "lag" copy.
DPM Setup: 1 DPM 2010 server protecting the active copies of my 3 databases in the DAG.
Test procedure: Moved active copy of "Database01" from one server to another (waited about 1-2 minutes) then unplugged the network cable from the server hosting the active copy.

I noticed first off that my lag copy of the database was in a "suspended" state. I've seen this once before but wasn't sure what or why this happened.


(Database "suspended")

I logged into my server hosting the lag copy and looked into Event Viewer. I found an Event ID 117 which stated "the copy of Database01 on this server experienced an error that requires it to be reseeded".


(Database needs to be reseeded)

A few lines down I found Event ID 3145 which stated there was a missing log file which has caused the incremental reseed to fail.


(reseed is required due to missing log file)

I decided the only way to get back on track is to perform an "update database copy" command through the EMC.



I then selected my source server...









I now have a healthy copy of my database....buy why did this happen?

My only rational thought on this one is that DPM ran a synchronization which truncated the logs on my server hosting the active database (jcsexchvan). When I perfomed a "move active database" as part of my tests, it moved to my other server (jcsexchcal). I then performed a physical failure of that server and my 3rd server (jcsexchedm) picked up the active copy. Somewhere in this transition the log file truncation process kicked off on "jcsexchedm". It was at that point my lag copy database on server "jcsexchtor" tried to pull the necessary log files and couldn't find them because the last "Active Manager" in the DAG told it to get them.

This scenario seemed to be a bit of a "perfect storm" case which I was happy to capture. However, during the writing of this article I've tried it again (move active copy, then fail that server) and I get the same result!!

I'd like to get to the bottom of this so if anyone has any insight I'd appreciate feedback. The only thing I can say for sure here is that you'd better make sure you have some form of monitoring software like System Center Operations Manager 2007 R2 to catch these events otherwise you could be in serious trouble.

Cheers.

Sunday, November 8, 2009

Little known fact about load balancing OCS 2007 R2 Edge servers...

I spent some time helping a team member recently with an issue relating to OCS 2007 R2 Edge servers and F5 Neworks load balancers. The effort involved troubleshooting Live Meeting connectivity to remote users. The behavior was such that the remote user would connect to the Live Meeting briefly, then disconnect.

A call to Microsoft support resulted in the engineer indicating port 8057 shouldn't be load balanced. Upon posting to an internal Microsoft forum site, I was informed that this is in fact true. Also, the Microsoft employee indicated the way you should be configuring the web conferencing edge server configuration is to list the actual server names in the internal fqdn entry of the properties of OCS. To do this follow these steps:
  1. Right-click your Enterprise Pool and choose Properties, then Web Conferencing Properties.
  2. You need to add entries for each web conferencing Edge server in your environment. Click the Add button and type the name of the server in the internal dialog box (i.e. serverA.dmz.contoso.com). Then type the external load balanced "shared name" (i.e. webconf.contoso.com).
  3. Repeat the same process for each Edge server you're load balancing to making sure the internal name represents the actual server name.
Now thinking about this setup you would assume the certificate bound to the internal interface should represent the "shared name" of the internal load balanced virtual IP right? Well I'm told the answer is no. If your Edge servers' internal interface fqdn is "ocsedge.contoso.com", you don't need subject alternative names for each server along with it. What strikes me as odd here is that I've just spelled out the need for specifying the Edge server's fqdn within the pool server yet you don't have to "line up" the name in the certificate.

If you investigate the documented firewall rules on the Microsoft web site, port 8057 over MTLS is used. I'm still puzzled as to how you can have MTLS working without the certificate names matching the name(s) defined in the pool settings.

Another poorly documented configuration is the firewall rule required to make Live Meeting work with hardware load balancers and multiple Edge servers. Yes, you need to permit port 8057 from "any" to the DMZ but don't send it through your load balancer. Make sure the rule permits 8057 traffic from the LAN to the internal interfaces of each Edge server.

That's all for now.

Cheers.