Understanding Pool Failover (more than you wanted to know most likely)

In this post we will dissect the actual pool failover and failback process and what is happening under the hood when you run an invoke-cspoolfailover.  To be honest, I’m not sure what practical purpose this information serves other then I find it super interesting.

Scenario

In this scenario we have two front-end servers: Lyncfe03 and Lyncfe04 both Standard Edition servers.  They are setup in a pool pairing relationship.  To start with we will check the backup status on both servers.

pic1

Here we can see that both servers are in a FinalState and NormalState so we know that pool fail-over is safe to perform.

Failing Over

When the pool failover happens the front-end server enters an unique state that doesn’t allow client registrations.  More importantly the second pool is aware of this state and allows clients to register and get access to the information because the routing groups that were once assigned to the other pool are now loaded on the new pool.

Before we do the failover let’s check out what the registration configuration is in our environment.  You will see the importance of this in a moment.

pic2

Now that we have this basic information down we can now invoke our failover.  For this we will use this command:

Invoke-CsPoolFailOver -PoolFqdn lyncfe04.thelab.info -Verbose

So what do we see happening in the verbose output of the failover process?  I’ll indent the output and comment along the way.

Get-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -LocalStore:$false
WARNING: Cannot find “RegistrarConfiguration” “Registrar:lyncfe04.thelab.info” because it does not exist.

Here we search specifically for a registrar configuration for the specific pool.  None were found.

Backup-CsPool -PoolFqdn lyncfe04.thelab.info -LocalStore:$false -SteadyState -Category UserData
Attempting to get into steady state.

Since we did not use the -DiasterMode switch when we did the failover process we do a steady state backup of UserData.

Invoke-CsManagementStoreReplication -ReplicaFqdn lyncfe04.thelab.info
Get-CsManagementStoreReplicationStatus -ReplicaFqdn lyncfe04.thelab.info

We then ensure that the CMS is up to date.

Get-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’
WARNING: Cannot find “RegistrarConfiguration” “Registrar:lyncfe04.thelab.info” because it does not exist.
New-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -PoolState FailingOver

We check for the registrar configuration again and see it doesn’t exist.  So we create a new registrar configuration and set the PoolState to FailingOver.  This is where this gets interesting.

Backup-CsPool -PoolFqdn lyncfe04.thelab.info -LocalStore:$false -FullBackup -Category UserData
Attempting to get into final state.
Sync-CsUserData -PoolFqdn lyncfe04.thelab.info -Target
WARNING: Hydrating Routing Groups Owned by lyncfe04.thelab.info on Pool lyncfe03.thelab.info.
WARNING: Hydrating Routing Group {10613DBE-C944-5463-BFA3-7C1EA28D0EB9}.
WARNING: Hydrating Routing Group {8BAD09F4-4E95-58C5-BE4F-E8D7953CD0C8}.
WARNING: Hydrating Routing Group {DF7B2BD1-DD18-5B3E-A470-B3BD46B6225B}.

We then do a full backup of the user data and do a sync of user data into the database.  Now since our pool (lyncfe04) is in a FailingOver state, we then see that any routing groups that were assigned to LyncFE04 get populated on LyncFE03.

Set-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -PoolState FailedOver
Reset-CsPoolRegistrarState -PoolFqdn lyncfe04.thelab.info -ResetType ServiceReset -Force -NoReStart
Stop-CsWindowsService -ComputerName lyncfe04.thelab.info -LeaveClsAgentRunning -NoWait -Verbose

Now that our routing groups have moved over, we then change our RegistrarConfiguration from FailingOver to FailedOver.  We also reset the pool we failed from and stop services on it (leaving replication running).

Once this has completed, we can run a Get-CsRegistrarConfiguration again and now we see:

pic3

As you can see LyncFE04 is now in a FailedOver state.  When this happens we can see a similar event on the LyncFE03 front-end server:

pic4

Restart Services on 04

So let’s go back and restart services on LyncFE04 while the registrar service is still in a FailedOver state.  When all of the services start up we see in the event viewer on LyncFE03 that LyncFE04 goes into an online state (for backup and mediation services) but from a registrar stand-point it remains in a FailedOver state.

pic5

So simply starting services is not enough to move registrar services.  This isn’t a surprise.

Failback

When we do the failback we see this as a series of verbose information.

Get-CsWindowsService -Name ‘RTCSRV’ -ComputerName ‘lyncfe04.thelab.info’
Stop-CsWindowsService -LeaveClsAgentRunning -Name RTCSRV -ComputerName lyncfe04.thelab.info -NoWait -Verbose
Start-CsWindowsService -Name LYNCBACKUP -ComputerName lyncfe04.thelab.info -Verbose
Start-CsWindowsService -Name REPLICA -ComputerName lyncfe04.thelab.info -Verbose
Invoke-CsManagementStoreReplication -ReplicaFqdn lyncfe04.thelab.info
Get-CsManagementStoreReplicationStatus -ReplicaFqdn lyncfe04.thelab.info
Get-CsManagementStoreReplicationStatus -ReplicaFqdn lyncfe04.thelab.info
Get-CsManagementStoreReplicationStatus -ReplicaFqdn lyncfe04.thelab.info
Backup-CsPool -PoolFqdn lyncfe03.thelab.info -LocalStore:$false -SteadyState -Category UserData -FailedOverPoolOnly
Attempting to get into steady state.
Get-CsWindowsService -Name ‘RTCSRV’ -ComputerName ‘lyncfe04.thelab.info’
Start-CsWindowsService -Name RTCSRV -ComputerName lyncfe04.thelab.info -NoWait -Verbose
Set-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -PoolState FailingBack
Backup-CsPool -PoolFqdn lyncfe03.thelab.info -LocalStore:$false -FullBackup -Category UserData -FailedOverPoolOnly
Attempting to get into final state.
Sync-CsUserData -PoolFqdn lyncfe04.thelab.info
WARNING: Hydrating Routing Groups Owned by lyncfe04.thelab.info on Pool lyncfe04.thelab.info.
WARNING: Hydrating Routing Group {10613DBE-C944-5463-BFA3-7C1EA28D0EB9}.
WARNING: Hydrating Routing Group {8BAD09F4-4E95-58C5-BE4F-E8D7953CD0C8}.
WARNING: Hydrating Routing Group {DF7B2BD1-DD18-5B3E-A470-B3BD46B6225B}.
Set-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -PoolState Active

So I have highlighted a few items above that are worth nothing.  First you can see that we restart services on LyncFE04.  Then we set the registrar services from FailingBack and then into an Active State.  Once we have changed the PoolState we can see the routing groups rehydrate.

Manual Change

So now that we see what is actually happening in the background when the pool failover and failback happens we can get creative.  What happens if we just change the registrar PoolState manually?

Set-CsRegistrarConfiguration -Identity ‘service:Registrar:lyncfe04.thelab.info’ -PoolState FailedOver

When we run this command nothing happens immediately.  After about 90 seconds these events appear on LyncFE04:

pic6

And approximately 30 seconds after that you see this on the LyncFE03 pool:

pic7

In the background we see the client disconnect from LyncFE04 and automatically connect to LyncFE03.  Although nothing is logged on the front-end servers it appears as though the routing groups have once again been loaded on LyncFE03 again.

Conclusion

So what have we learned from this?

The Set-CsRegistrarConfiguration can be used to determine the current status of the pool on failover and failback.  In the event of an emergency, you could potentially just “change” the settings on the registrar configuration.

The last item that is worth nothing is that the registrar configuration is created “on-the-fly” when pools are failed over and failed back.  That means if you have a setting that is special within the global configuration you should create these registrar configuration before hand otherwise the new ones created as part of the failover process will have just the default settings.

Share

7 comments on “Understanding Pool Failover (more than you wanted to know most likely)”

  1. Shane Reply

    Nice article but you are using than incorrectly throughout. You should be using the word then. It is a bit distracting to keep reading the word than when you should be using then.
    Ex:
    “We than do a full backup of the user data and do a sync of user data into the database. Now since our pool (lyncfe04) is in a FailingOver state, we than see that any routing groups that were assigned to LyncFE04 get populated on LyncFE03″.

    This should be…”We then do a full backup of the user data and do a sync of user data into the database. Now since our pool (lyncfe04) is in a FailingOver state, we then see that any routing groups that were assigned to LyncFE04 get populated on LyncFE03”.

  2. Mobin Reply

    I recently had to do a Enterprise DR Failover/ Failback and faced some issues. While searching through the Internet I found your explanation to understand the whole process much better.

    Thanks

  3. Englishtoolhater Reply

    Good job man. Regardless of the tool who thinks this is an English class 🙂

  4. Hans A Reply

    how can we put this really automatic instead of manually run this command?

    • Richard Brynteson Reply

      You really can’t nor would you want to. If you had a network occurrence where it was flapping up and down, you could accidentally be sending users back and forth all day long.

  5. Joel C. Reply

    What about removing resiliency and backup services from a secondary site? What ramifications will this have on production? i.e. running the Deployment setup on each prod FE server after removing resiliency, etc.

  6. Jerryn Reply

    Really great article – I have been looking for a solid piece of information on the inner workings on the failover process, and this helps greatly.

Leave a Reply

Your email address will not be published. Required fields are marked *