Wednesday, January 19, 2011

Frustrating AD problems

I've found myself thrown into a problem with a couple of AD servers that has me beating my head against the wall. I've read through quite a few entries on the web, but haven't made much progress.

The primary server apparently lost power and the AD database was corrupted. Attempts to recover it with ntdsutil failed, and while I was able to do a system state restore, the backup was from about three months ago. That, and the two servers haven't been replicating properly for a few months now. When I run dcdiag on either server, I get warnings about the latency being over the tombstone lifetime. The fact that both machines are apparently past the tombstone date worries me, and I'm wondering is if it is still possible to salvage this domain or if I should just burn it down and start from scratch.

Here are some of the problems with the first DC:

  • Nobody can log in with their domain credentials.
  • I can't add a new user in the AD Users and Computers snapin. I get an error saying 'The directory service was unable to allocate a relative identifier'.
  • There are messages in eventvwr about being unable to execute group policy client-side extension scripts.

To make a long story short, we have a small AD domain running with two sites over three subnets (Site 1 - 192.168.1.0/24, Site 2 - 192.168.2.0/24 & 192.168.3.0/24) Each site has a 2003 server running AD, with the first site's server holding the FSMO roles. For what it's worth, the two DCs are Server 2003 virtual machines running on Linux VMWare Server 1.x hosts.

I've checked DNS and it seems to be running fine. netdom query fsmo returns the first server when run on either machine. I can ping both DCs from each other by IP address and by FQDN. The windows firewalls are turned off and I don't see anything untoward in the router firewalls. There seems to be some sort of connectivity issue that I can't pin down.

  • The tombstone lifetime is the time deleted items are kept, once a server has been offline or a backup is restored from a time before this the rest of the domain will refuse to replicate to it.

    I am a little confused do you have 3 DC or 2 ether way can any users in any site log on with there domain creds? is the bad DC the one holding all of the FSMO roles?

    the quick answer would be to try gracefully demoting the BAD DC and provisioning a new DC for that site.

    From RichardD
  • This is Not Good. Computer account passwords only have a lifetime of 30 days by default, so your 3 month old restore means that your computer accounts are all out of whack. This explains at least some of the symptoms you have been seeing.

    You're also gone beyond the tombstone lifetime which is 60 days, which RichardD has covered.

    Some good general info on the backup and restore procedure here I recommend you read before proceeding: clicky.

    OK, deep breath.

    The best scenario is if you have a good recent backup of one of your other DCs (and by recent I mean within the last week). Even one from the other site will do. Shutdown the just restored DC, forcibly remove it from the domain using NTDSUTIL (to make sure it doesn't replicate off any badness during a graceful demotion), unplug it from the network, and reinstall Windows as a precaution to ensure that you don't accidentally bring it back on. Then restore the other DC from the good backup and sieze the FSMO roles if required. Finally rejoin the bad DC from scratch (dcpromo, "new domain controller in an existing domain", etc). Let everything replicate and you should be good.

    Worst case scenario is that you don't have a good backup, but let's not go there yet, eh. :)

    From mh
  • I would have seized the FSMO roles from another DC, and turned up a new DC with a different name in the site with the failed DC. The only time I would use system state to recover AD would be if I did not have another DC. If an authoritative restore is performed from a stale system state backup this introduces even more really difficult to solve issues. This will not help you but may be of some help to others that read this. Sorry.

    Using Ntdsutil.exe to transfer or seize FSMO roles to a domain controller

    http://support.microsoft.com/kb/255504

    From Greg Askew
  • First, a point of terminology: there are no "primary servers" in Active Directory. The "PDC Emulator" FSMO role holder relates to time sync, NT 4.0 compatibility, and some specific user password validation behaviours. The AD database on the PDC Emulator FSMO role holder isn't special otherwise and it's some magic "primary" copy.

    Your mistake was restoring the backup of the failed DC since you had other DCs out there. The proper strategy, as other posters have pointed out, would've been to seize the FSMO roles held by the failed DC to a working DC, perform a metadata cleanup to remove the failed DC from the directory, then bring up a new machine named the same as the failed DC. (If the failed DC was doing other things-- file and print sharing, etc, then you'd be stuck rebuilding that, too. That's one more reason why Microsoft doesn't recommend using DC's for other roles...)

    You don't have connectivity problems. Rather, you broke the rules re: tombstones by performing what had to be an authoritative restore of a backup that was much, much too old. Now you're getting RID pool allocation issues.

    It goes w/o saying that you need to get your backups in order before you do anything else. Don't overwrite any old backups, either, until this is squared away. You might need them.

    I'd transfer the FSMO roles away from the malfunctioning DC and then demote it back to being a member server computer. If it won't demote cleanly, seize the FSMO roles to another DC, forcibly demote the malfunctioning DC, perform a metadata cleanup, and reinstall the OS on the malfunctioning DC (you can keep any data files / folders intact on it while you do that, but you'll have to reinstall any apps).

    Before you re-promote the failing DC, solve your replication issues between the other DCs. If necessary, demote one of them so that you end up with only one DC holding all the FSMO roles.

    I've got half a mind to be worried that your RID pool master has gone "back in time" and might start issuing RIDs that were already issued. Having never seen the source code to AD, I don't know if it has any mechanism to "detect" when the RID pool master has been rolled back in time or not. I recall a Customer issue that smacked of such a problem (restored an old RID pool master backup, found that when they created new user accounts in the future existing user accounts would "disapper" from the AD), so it's possible.

0 comments:

Post a Comment