Friday, November 27, 2009

Issues using DPM 2010 with an Exchange 2010 DAG

I've been "playing", well not really. I've been testing....ya that's it....testing Microsoft's beta version of System Center Data Protection Manager 2010 in conjunction with Microsoft Exchange Server 2010. The tests I've been conducting involve backing up and restoring Exchange servers in a Database Availability Group (DAG). I've also been trying to break the DAG replication by pulling the network cable from a server and watching the outcome. For the most part everything is working quite well however I did come across an interesting issue which deserves discussion.

Exchange 2010 Setup: 4 server DAG with 3 database copies and 1 "lag" copy.
DPM Setup: 1 DPM 2010 server protecting the active copies of my 3 databases in the DAG.
Test procedure: Moved active copy of "Database01" from one server to another (waited about 1-2 minutes) then unplugged the network cable from the server hosting the active copy.

I noticed first off that my lag copy of the database was in a "suspended" state. I've seen this once before but wasn't sure what or why this happened.

(Database "suspended")

I logged into my server hosting the lag copy and looked into Event Viewer. I found an Event ID 117 which stated "the copy of Database01 on this server experienced an error that requires it to be reseeded".

(Database needs to be reseeded)

A few lines down I found Event ID 3145 which stated there was a missing log file which has caused the incremental reseed to fail.

(reseed is required due to missing log file)

I decided the only way to get back on track is to perform an "update database copy" command through the EMC.

I then selected my source server...

I now have a healthy copy of my why did this happen?

My only rational thought on this one is that DPM ran a synchronization which truncated the logs on my server hosting the active database (jcsexchvan). When I perfomed a "move active database" as part of my tests, it moved to my other server (jcsexchcal). I then performed a physical failure of that server and my 3rd server (jcsexchedm) picked up the active copy. Somewhere in this transition the log file truncation process kicked off on "jcsexchedm". It was at that point my lag copy database on server "jcsexchtor" tried to pull the necessary log files and couldn't find them because the last "Active Manager" in the DAG told it to get them.

This scenario seemed to be a bit of a "perfect storm" case which I was happy to capture. However, during the writing of this article I've tried it again (move active copy, then fail that server) and I get the same result!!

I'd like to get to the bottom of this so if anyone has any insight I'd appreciate feedback. The only thing I can say for sure here is that you'd better make sure you have some form of monitoring software like System Center Operations Manager 2007 R2 to catch these events otherwise you could be in serious trouble.


No comments:

Post a Comment