All Exchange 2013 Servers become unusable with permissions errors


Overview

The title might sound a bit scary but this one was actually a pretty easy fix. It’s a lesson in not digging yourself into a deeper hole than you’re already in during troubleshooting. I wish I would’ve had this lesson 10yrs ago 🙂

Scenario

The customer was unable to login to OWA, EAC, or Exchange Management Shell on any Exchange 2013 SP1 server in their environment. The errors varied quite a bit, when logging into OWA they would get:

“Something went wrong…

A mailbox could not be found for NT AUTHORITY\SYSTEM.”

When trying to open EMS you would receive a wall of red text which would essentially be complaining about receiving a 500 internal server error from IIS.

In the Application logs I would see an MsExchange BackEndRehydration Event ID 3002 error stating that “NT AUTHORITY\SYSTEM does not have token serialization permission”.

Something definitely seemed to be wrong with Active Directory as this was occurring on all 3 of the customers Exchange 2013 servers; one of which was a DC (more on that later).

Resolution

So one of the 1st questions I like to ask of customers is “when was the last time this was working?” After a bit of investigation I was able to find out that the customer had recently been trying unsuccessfully to create a DAG from his 3 Exchange 2013 SP1 servers. They could get two of the nodes to join but the 3rd would not (the one that was also a DC). The customer thought it was a permissions issue so they had been “making some changes in AD” to try to resolve them. I asked if those changes were documented; the silence was my answer….. 🙂

However, this current issue was affecting all Exchange 2013 servers & not just the one that’s also a DC so I was a bit perplexed as to what could’ve caused this.

So a bit of time on Bing searching for Token Serialization errors brought me to MS KB2898571. The KB stated that if the Exchange Server computer account was a member of a restricted group then Token Serialization Permissions would be set to Deny for it. These Restricted Groups are:

  • Domain Admins
  • Schema Admins
  • Enterprise Admins
  • Organization Management

The KB mentioned running gpresult /scope computer /r on the Exchange servers to see if they were showing as members of any of the restricted groups (see article for further detail & screenshots of the commands). I ran this command on all 3 Exchange 2013 servers & it showed their Computer accounts were all members of the Domain Admins group. In Active Directory Users & Computers I looked at each Exchange Server Computer account (on the Member Of tab) & unfortunately there were no direct ACL assignments so I had to search the membership chain of each common group that the servers were members of. The common groups that all Exchange Server Computer accounts were members of were:

  • Domain Computers
  • Exchange Install Domain Servers
  • Exchange Servers
  • Exchange Trusted Subsystem
  • Managed Availability Servers

Eventually I found that the Exchange Install Domain Servers group had been added as a member of the Domain Admins group during the customers troubleshooting efforts to get all their servers added as DAG members. I removed the Exchange Install Domain Servers group as a member of the Domain Admins group & then rebooted all of the Exchange servers. After the reboots the issues went away & the customer was able to access OWA/EMS.

Now this is where I had to explain to the customer that it was not supported to have an Exchange Server that was also a Domain Controller as a member of a Failover Cluster/DAG. This was why they were having such a hard time adding their Exchange server/DC as a member of their DAG.

Conclusion

I have a saying that I came up with called “troubleblasting”. i.e. “John doesn’t troubleshoot, he troubleblasts!” It started out as just a cheesy joke amongst colleagues back in college but I’ve started to realize just how dangerous it can be. It’s that state you can sometimes get into when you’re desperate, past the point of documenting anything you’re doing out of frustration, & just throwing anything you can up against the wall to see what sticks & resolves your issue. Sometimes it can work out for you but sometimes it can leave you in a state where you’re worse off than when you started. Let this be a lesson to take a breath, re-state what you’re trying to accomplish, & if what you’re doing is really the right thing given the situation. In this case, an environment was brought to its knees because a bit of pre-reading on supportability was not done beforehand & a permission change adversely affected all Exchange 2013 servers.

If you can make it to Exchange Connections in Las Vegas this September, I’ll be presenting a session on “Advanced troubleshooting procedures & tools for Exchange 2013”. Hopefully I can share some tips/tools from the field that have proven useful & can keep you from resorting to the “Troubleblasting Cannon of Desperation” 🙂

Quick Exchange 2013 DAG Setup Guide


Background:

Had a co-worker ask for some basic DAG setup instructions in Exchange 2013 so I wrote a quick little guide. This covers the high points around creating the DAG as well as configuring the DAG member NICs & networks.

Step 1 – Pre-Stage DAG Computer Account
Reference. When deploying a DAG on Exchange Servers running Server 2012 you need to pre-stage the DAG computer account. The above link points to the official TechNet article for doing this but here are the basics of it:

  • Create a Computer Account in AD with the name of the DAG. For example, DAG-A.
  • Disable the Computer Account.
  • In Active Directory Users & Computers click View>Advanced Features. Go to the Computer Account & select Properties>Security tab.
  • From here you have two options; either Grant the Exchange Trusted Subsystem Full Control permissions to the DAG Computer Account or give the Computer Account of the first node you plan to join to the DAG Full Control permissions over the DAG Computer Account Object.
  • Reference2

Step 2 – Configure DAG NIC’s
Reference. Exchange 2013 performs automatic DAG network configuration depending on how the NIC’s are configured. This means if the NIC’s are configured correctly then you should not have to manually collapse the DAG Networks post DAG Setup. Upon adding the nodes to the DAG, it looks for the following properties on the NICs & makes a decision based on them:

  • NIC Binding Order
  • Default Gateway Present
  • Register DNS Checked

The DAG needs to separate MAPI/Public networks from Replication networks. This enables the DAG to properly utilize a network that the administrator has provisioned for Replication traffic & to only use the MAPI/Public networks for Replication if the Replication networks are down.

You want your MAPI/Public NICs to be top of the binding order in the OS & any Replication, Management, Backup, or iSCSI networks at the bottom of the binding order. This is a Core Windows Networking best practice as well as what the DAG looks for when trying to determine which NIC’s will be associated with the MAPI/Public DAG Networks.

The DAG also looks for the presence of a Default Gateway on the MAPI/DAG network NIC. Going along with another Windows Networking best practice, you should only have 1 Default Gateway configured in a Windows OS. If you have additional networks with different subnets on the DAG nodes then you would need to add static routes on each of the nodes using NETSH. More on this later.

Finally, NIC Properties>IPv4 Properties>Advanced>DNS>Register this connection’s addresses in DNS should be unchecked on all adapters except for the MAPI/Public NICs. This means all Replication, iSCSI, dedicated backup or management NICs should have this option unchecked. Again, this is a Windows Networking best practice but is vital for proper Automatic DAG Network Configuration in Exchange 2013.

Step 3 – Configure Routing if Needed (optional depending on DAG design)
If your DAG stretches subnets & you’re using dedicated Replication networks then they should be on their own subnet isolated from the MAPI/Public network. A common setup for a network such as this might be:

Site-Austin:
MAPI Network 192.168.1.0/24; Default Gateway 192.168.1.254
Replication Network 10.0.1.0/24; Default Gateway $Null

Site-Houston:
MAPI Network 192.168.2.0/24; Default Gateway 192.168.2.254
Replication Network 10.0.2.0/24; Default Gateway $Null

Now with the above configuration you would have some form of routing taking place between the two MAPI subnets. You would also have routing between the two Replication subnets. However, because you should only have 1 Default Gateway configured per server, DAG nodes in each site would be unable to communicate with each other over the Replication networks. This is where static routes come into play. You would run the following commands on the nodes to allow them to ping across to each other between the 10.0.1.x & 10.0.2.x networks (in the below example, REPL is the name of each node’s Replication NIC):

On Nodes in Site-Austin: “netsh interface ip add route 10.0.2.0/24 “REPL” 0.0.0.0”

On Nodes in Site-Houston: “netsh interface ip add route 10.0.1.0/24 “REPL” 0.0.0.0”

This is the preferred format for this command. There are some references to using the local interface IP instead of 0.0.0.0 but the format I use above is what is recommended by the Windows Networking Team. Reference.

According to our Networking Development Groups, the recommendation actually is that on-link routes should be added with a 0.0.0.0 entry for the next hop, not with the local address (particularly because the local address might be deleted) and with the interface specified.”

This all assumes there is physical routing in place between the two subnets, like a Router, layer 3 Switch, or a shared virtual network in Hyper-V/ESX.

Verify connectivity between nodes over these 10.0.x.x networks using Tracert or Pathping. Note that these steps are only required if your DAG spans subnets & has replication networks in different subnets. While it technically should work, it is not recommended to stretch subnets for DAG Networks across the WAN.

It should also be noted that there should be no routing between the MAPI Networks & the Replication Networks. They should be on isolated networks that have no contact with each other. Also, Microsoft wants no greater than 500ms round trip latency between DAG nodes when you have DAG members across latent network connections. It’s important for customers to realize that you should not set your expectations around this number alone. You could easily have a connection over 500ms & not experience copy queues if you have only 20 mailboxes with low usage profiles. Alternatively, you could have a connection with only 50ms of round-trip latency but see high copy queues if you have thousands of high-usage mailboxes & a small bandwidth pipe. Just know that this number is not an end all be all.

Step 4 – Create DAG & Add Nodes
This part is pretty straightforward & you can use the EAC to do it. Just remember to give the DAG an IP address in every MAPI subnet where you have DAG nodes. So in our scenario above you would give the DAG 2 IP addresses; one in the 192.168.1.0 subnet & another in the 192.168.2.0 subnet.

Step 5 – Manually configure DAG Networks if needed
Reference. If you have dedicated management networks, dedicated backup networks, or iSCSI NIC’s then you would actually have to perform some manual steps after your DAG is setup. These networks should be ignored by the DAG & for cluster use. In order to do this we must first enable Manual DAG Network Configuration, which is disabled by default. We would then need to configure the iSCSI or similar network to be ignored by the cluster. Perform the following steps:

  • Get-DatabaseAvailabilityGroup
  • Set-DatabaseAvailabilityGroup <DAGName> -ManualDagNetworkConfiguration:$True
  • Get-DatabaseAvailabilityGroupNetwork
  • Set-DatabaseAvailabilityGroupNetwork <iSCSI/Backup/Mgmt NetworkName> -IgnoreNetwork:$True

Finally, let’s validate everything. Run the below command:

Get-DatabaseAvailabilityGroupNetwork | Format-List Identity,ReplicationEnabled,IgnoreNetwork

Verify that the iSCSI/Backup/Mgmt networks have IgnoreNetwork set to True (the MAPI & Replication networks should have this set to False). Also verify that the Replication Networks have ReplicationEnabled set to True. Finally, verify that the MAPI network has ReplicationEnabled set to False. This prevents the MAPI network from being used for Replication by default. It can still be used for Replication if all other possible replication paths go down.

References:
http://technet.microsoft.com/en-us/library/ff367878.aspx

http://technet.microsoft.com/en-us/library/dd298065(v=exchg.150).aspx

http://blogs.technet.com/b/scottschnoll/archive/2012/10/01/storage-high-availability-and-site-resilience-in-exchange-server-2013-part-2.aspx

http://blogs.technet.com/b/askcore/archive/2009/05/26/active-route-gets-removed-on-windows-2008-failover-cluster-ip-address-offline.aspx

http://technet.microsoft.com/en-us/library/dd298008(v=exchg.141).aspx

The Problem with Hardware VSS Providers and Cluster Technologies like CSV and DAG


In solutions like DAG and CSV you can have issues with VSS backups completing if you are attached to a SAN and using a hardware provider.
The reason for this is because the LUN needs to pause the processes accessing the LUN but if another server is the one in control  of data on that LUN its unable to do that on a single host.
Here are some details as well as ways to resolve this issue.

Scenarios:

1. CSV Issue

  • imageimageMultiple Servers with a shared CSV Volume and VMS distributed across nodes may fail if you are using hardware VSS providers because it wants to snapshot the entire LUN but the node you are running the snap shot from doesn’t have access to all the VMS in order to pause them before committing the snapshot.
  1. You can resolve this in one of 2 ways.
    1. Move all the VMs to a single node or host until the backup is completed.
    2. Disable or remove your hardware based VSS provider.

 

 

 

2. DAG Issue

imageimage

This issue may come up not because you are sharing LUNS and have active data  on separate nodes (as above) but because you may use a separate provider for Active and Passive backups. When you try to backup a LUN that has both active and passive databases a hardware provider may try to use two different writers to snapshot the LUN. You can verify this by moving all active databases to one node to backup.

  1. You can resolve this in one of 3 ways.
    1. Do not put multiple databases on a single LUN.
    2. Move all Databases to one node before running backup
  2. 3. Disable you hardware based VSS provider

 

NOTE: Disabling your hardware provider will likely cause your backups to take much longer

References

  • Disable Equal Logic VSS Writer – Run C:\Program Files\EqualLogic\bin>eqlvss /unregserver”
  • Disable Hardware VSS in DPM – Add the following key to the registry [Software\Microsoft\Microsoft Data Protection Manager\Agent\UseSystemSoftwareProvider]
  • How VSS Works
  • If you know how to disable other providers please let me know and I will add it to this document!

DAG Cross Site\Subnet networking – Additional Configuration


When you add servers to a DAG it will create a network for every subnet\NIC that server is connected to, this is nice because as soon as you add the server it can replicate with the other nodes.
However there are some post configuration steps you need to take otherwise replication will occur over the MAPI\Client network and never use the replication network.

  1. You should NEVER have multiple gateways, if you have a private\heartbeat network that is routed you need to remove the gateway and add a static route.
    Example: You configure the gateway on the public NICsimage  
    and configure the following static routes:
    Site A  image
    Site B  image
     
  2. Next you will notice your “DAG Networks” may look something like thisimage
    The issue with this configuration is that there is no clearly defined replication or mapi networks, so what we need to do is collapse them into 2 dag networks.
  3. Modify the networks to include both subnets, (I named mine for easy identification.
    i.e. Combine 10.0.1.x with 10.0.2.x and 192.168.2.x and 192.168.1.x
    image
  4. I would also recommend disabling replication on the MAPI or client network, (it will be used anyway if the replication network is not available.

You should now be replicating over the replication network, you can verify with the following:

  • Get-MailboxDatabaseCopyStatus <DatabaseName> -ConnectionStatus | fl name, outgoingconnections,incomminglogcopyingnetwork

Exchange 2007\2010 Database Replication issues – Re-seed


Database replication problems can be caused by many things

  • Hardware Failure
  • Network Outage
  • Administrative Change
  • Log Corruption

If you have replication issues this WILL cause problems in moving the over the active node of your cluster, that is one reason it is very important you do not use cluster administrator to perform the move but rather use either EMC or EMS.

EMS Process = Move-ClusteredMailboxServer -id ClusterName -TargetMachine Node2 (2007)
                      Move-ActiveMailboxDatabaseMailbox Database-ActivateOnServer Node2 (2010)
EMC Process =
Server Configuration -> Mailbox -> Right Click ServerName -> Manage Clustered Mail Box Server -> browse for server -> next -> move (2007)
Organization Configuration -> Mailbox -> Database management -> Rt Click Database -> Move Active mailbox Database -> browse for server -> ok -> move (2010)

If you suspect you have a replication issue or you did but have now resolve the problem and are ready to “Re-Seed” your databases here is the process

EMS Process = Open Exchange management Shell ON the PASSIVE NODE

  1. First verify that the database is not replicating correctly
    get-storagegroupcopystatus or (Get-MailboxDatabaseCopyStatus for 2010)”
    If the state is other than healthy or the numbers for the queues are increasing then you have a replication issue.
    CopyQueue length (Logs queuing up to be sent to the passive node)
    ReplayQueue length (Logs queuing up to be replayed to the passive node database)
     
  2. To re-seed a database (remove all the passive contents and re-copy)
    • On the PASIVE node first suspend the Storage group in question
      Suspend-StorageGroupCopy -id clustername\storagegroup” (2007)
      Suspend-MailboxDatabaseCopy -id mailboxname\servername” (2010)
      then force an update removing the old data on the passive.
      Update-storagegroucopy -id clustername\storagegroup -DeleteExistingFiles” (2007)
      Update-MailboxDatabasecopy -id mailboxname\servername -DeleteExistingFiles” (2010)
      Or if you just want to reseed the entire Server (2007)
      • Get-StorageGroup -Server clustername | suspend-StorageGroupCopy
      • Get-StorageGroup -Server clustername| Update-StorageGroupCopy -DeleteExistingFiles

EMC Process = Open Exchange Management Console on the PASSIVE NODE

  1. Server Configuration -> Mailbox -> Select Cluster– > rt click Storage Group -> Suspend Storage group Copy (2007)
    Organization Configuration -> Mailbox -> Database management -> Rt Click Database copy -> suspend database copy (2010)
  2. Server Configuration -> Mailbox -> Select Cluster– > rt click Storage Group -> update storage group copy -> -> check “Delete Existing”->  Next -> Update (2007)

                                      image

Organization Configuration -> Mailbox -> Database management -> Rt Click Database copy -> update Database copy -> update (2010)

                                                   image

Glossary
EMS = Exchange Management Shell
EMC = Exchange Management Console
Variable
Powershell command