Understanding how Windows Fabric Works (with regards to Lync)

So I decided to continue with my “understanding how” series of posts with one about the Windows Fabric.  The great thing about Windows Fabric is that you shouldn’t need to do anything and it should just “work”.  But sometimes I think it’s fun to look at how things actually work under the hood.

Overview

A picture is worth a thousand words and this picture has been used numerous times at Lync Conference and other Lync Ignite classes and I think it does a great job of giving a high-level view of what Fabric does and why 2013 choose to use it.

pic1

As the picture details, one of the biggest changes the Windows Fabric allowed is to move user services from the back-end database to the front-end servers.  This gives Lync 2013 several advantages in that you can scale further by adding more servers to the pool, relies less on the back-end database, is self-healing as it automatically monitors the state of other servers in the fabric and gives you a decentralized approach to limit the single point of failure.  One of the biggest grips about the addition of the fabric is the three server enterprise edition server recommendation (NEVER do a two server pool!!!) and the minimum number of servers required for pool functionality.  I think the benefits of fabric outweigh these negatives that some people will point out.  (Note: I believe Lync 2013 suffers from a new limitation and that is SQL Express on the front-end servers but that is a discussion at a later time.)

The Routing Group

To me the greatest advantage of using Windows Fabric is how Lync Server 2013 utilizes the back-end database.  In our examples below we will pretend that we have four front-end servers: lyncfe00, lyncfe01, lyncfe02 and lyncfe03.  When the fabric service (and hence front-end services) start up for the first time they go through a process in which routing groups are assigned to servers. These routing groups than correspond back to users.  So let’s walk through the database and see where we see the routing group.

So we connect to any of the RTCLocalRTC databases running on the front-end of any server.  Remember, Windows Fabric means we don’t need to rely on the back-end database so all of this data is sitting on the front-ends.

pic2

In this image is two SQL queries.  The first is looking at the RoutingGroupAssignment table.  In that table will display all of the routing groups in this pool.  Therefore in my example, we can see six different routing groups.  When you add more users to a pool the total number of routing groups continues to grow.  The second query is against the FrontEnd table and here we can see that 12040 is lyncfe00.thelab.info server.  As you can see in my lab, of the six routing groups that exist, there are two routing groups assigned to each of the three front-end servers that are currently running.

The next question is what user(s) are assigned to each routing group.  We can find this via PowerShell but where is this information actually stored?  For this, we go to Active Directory.  Looking at testuser01 Active Directory information and the table of RoutingGroups we can see this:

pic3

As you can see here, the msRTCSIP-UserRoutingGroupID in active directory corresponds to a routing group defined within Lync.  Some of the numbers are reversed from Active Directory to what is seen in the RTCLocal database.  What is important to know is that this routing group ID doesn’t change once it’s set (unless your server gets more routing groups because more users are added to the pool).

EXPERT NOTE: When you invoke a pool-failover the Routing Groups are assigned from PoolA to PoolB.  If you run these same commands you would see the new routing groups assigned to the new servers.  So a users routing group shouldn’t change all that often within Active Directory.

I mentioned we can see this same information via PowerShell as well.  By running a get-csuserpoolinfo we can get the full fabric information.

pic4

Loading the Fabric

Now that we understand what the routing group is and where we can see it assigned to a server, why do I see three servers in my user pool information?  The Windows Fabric will always assign three servers to every routing group.  In my test users case, we can see that lyncfe00 (primary), lyncfe01 (active secondary), lyncfe02 (idle secondary).  If we were to look at another user we could potentially see a different order.

When a pool of front-end servers are started primary servers are assigned to each routing group.  Once a server is assigned as a primary server, that server reaches to the back-end database and hydrates the user information (contacts, conference, etc.) for all users assigned to that routing group.  Once the primary server has gathered the information it needed, it replicates that information to both the secondary active and secondary idle servers.  Once this process has completed, the front-end service will go from a starting state to a running state.  When a user makes a change (testuser01 creates a new conference for example) it is the responsibility of the primary server to write that information into the back-end database and also add that information to both of the secondary servers.  The writes to the other front-end servers are synchronous for all nodes in the routing group – that is not the case to the back-end database.

In the event a server goes down the secondary active is promoted to primary.  Since the secondary servers have been receiving synchronous changes there is no need for the old secondary server/new primary server to hydrate any information from the back-end database.  Assuming the pool has more than three servers, a new secondary idle server will be assigned to the routing group and that server will hydrate it changes from the new primary server.  It should be noted that hydration of a new server can take anywhere from 15 to 30 minutes.

Large Pools Risk

Because of how Windows Fabric works there is a new risk administrators need to be aware of and that is the large pool/simultaneous server failure.  Let’s take this example.  Your front-end pool contains 12 servers.  You have deployed those 12 servers as virtual machines over four different physical hosts.  Odds are, there will be at least one routing group that is assigned to the three front-end servers running on that one physical host.  What happens to an end user if the physical hosts crashes that contains all three servers assigned to that one routing group?  The end user will immediately disconnect from Lync and will remain disconnected until their routing group gets assigned to new servers within the pool and the routing grounds get hydrated from the back-end database.  My testing has always been hit or miss in this department.  I’ve had a scenario where I took down all three servers that were assigned to the routing group and it took approximately 30 minutes for the routing groups to come back online.  In another test, I’ve seen where I have left them down for hours and they never came back online.  The moment I returned one of the servers within the routing group back online, it than immediately took over and assigned two new servers as routing group members.

If all three servers fail at the same time within the routing group than a quorum lost recovery needs to be run on the server.

It get’s worse.  Since each replica set also has quorum (Primary, Secondary and Secondary Idle) you would never want to lose two servers at the same time. In this scenario, where two of the three servers in a replica set are lost at a given time, your replica set will lose quorum and you will drop into limited functionality mode.  The only recourse is to issue a quorum loss recovery command or get at least one of those two servers back online.

So the recommendation when it comes to virtualization.  Never have more than a single server on a physical host.  (Why virtualize than?  Good question.)

Windows Fabric Configuration

There is nothing you need to do as an administrator to configure the Windows Fabric.  However, if you wanted to look at the configuration, you can browse to the C:Program FilesWindows FabricbinClusterManifest.current.  This file is updated every time the Windows Fabric Service starts up so manually changing this file makes no good sense.

Even Number of Servers

One of the interesting side affects of the Windows Fabric process is that you need to keep quorum.  This gets more interesting if you have an even number of front-end servers.  Anyone who is familiar with clustering in Windows 2003 you know that you need that third (odd) member to vote in the fabric process and the Windows Fabric is no different.  So in the event you have an even number of front-end servers the Primary SQL back-end will be added a voting member.

IMPORTANT: It’s ONLY the primary member who is a voting member in the fabric.  So if you are using a SQL mirror and you have failed over from the Primary to the Mirror, your Windows Fabric is one vote down already!

So let’s take a moment to explore a four server front-end pool that uses SQL mirroring.  Since it’s an even number of servers in the pool we know the SQL server is going to be involved in the voting process.  If you had failed over the primary to the mirror for whatever reason and than took down two of your four front-end servers, you would only have two of five members in the Windows Fabric and front-end services would shut down after five (5) minutes on the remaining two front-end servers.  The lesson is simple – if you have an even number of servers – you need to make sure you pay attention to where your SQL mirror is.

Do I Need a Back-end Database

Of course you need the back-end database but because of the Windows Fabric you rely on it much less.  If you have every taken down your back-end DB, you will notice that unlike Lync Server 2010, the Lync client will continue to function with all of it’s conferencing and contact information.  The Lync Server has a set timeframe that it can “survive” without the back-end database.  That default length of time is 30 minutes.  (You can use the Get-CsRegistrarConfiguration cmdlet to change this length – but it’s not recommended to go too nuts with the length of time.)

Never Two Servers

First, whenever you have fewer than three servers you don’t have enough servers to complete the Windows Fabric.  Therefore, you don’t really have a primary, secondary and secondary idle server.  The impact to this is that you no longer rely on the primary server to hydrate the secondary server.  Instead, the pool acts much like Lync Server 2010 did.  Server A and Server B both gets their information directly from the back-end database.  This also means you cannot survive a back-end database failure like you can when you have three servers.  So in the event of a back-end DB failure, the Lync clients will disconnect/drop into limited functionality mode immediately like Lync Server 2010.

Second, if you ever go from two servers to three servers in the pool you will need to reset the entire pool.  This is important to note because unlike 2010 you would NEVER want to add that third server to topology builder during the mid-day because the topology change will immediately invoke the pool reset and all users will be disconnected as the front-end services restart.

Conclusion

The key takeaways from this article should be around how the fabric works, how data is loaded from one server to another and a some details about why you never want two (or fewer) servers in a Lync 2013 pool.

 

 

32 comments on “Understanding how Windows Fabric Works (with regards to Lync)”

  1. Markus Johansson Reply

    Hey Great article
    Have you played around with Res Kit tool RepTester.exe anything?

    • Richard Brynteson Reply

      I have a little bit. There isn’t much around for documentation and didn’t find any particular process that gave me great insight into it. But it would be good to play around with it again.

    • Richard Brynteson Reply

      I suppose a better headline would be Understanding How Windows Fabric works with regards to Lync Server 2013. So yes, this isn’t an in-depth bit level review of just the Windows Fabric process. Heck of an honor to have you even notice the post.

  2. Justin Morris Reply

    Excellent post Richard, especially regarding the large pool and virtualization consideration. I definitely learned a thing or two about WinFab, cheers.

  3. Ryan Reply

    Richard,
    Nice post. You mention the Front Ends will be OK for 30min if the backends are unavailable. Wouldn’t things such as response groups be unavailable during the time the backends are down? From what I can tell Response Groups are only stored on the Lync Backend databases. If you could provide some detail on this (both when the CMS is and is not stored on your SQL backends that are offline).

    Thanks,
    Ryan

    • Richard Brynteson Reply

      Correct. Any services that are dependent on the backend server (which is at this time is RGS service) will be offline immediately. This would be true if the backend database hosted CMS or not. That part doesn’t matter.

  4. John Reply

    The RoutingGroupAssignment table shows the primary server, but where in SQL are the additional servers (secondary, tertiary) for PrimaryPoolMachinesInPreferredOrder indicated?

    • Richard Brynteson Reply

      That info doesn’t appear to exist within SQL but rather directly within the Fabric framework. I’ve looked before where to find that info and never found it – although I didn’t lose sleep over looking for it either.

  5. Muthupandi Mk Reply

    Superb Explanation. It clears me well while designing Lync 2013 along with VM .

    Great article Richard.

  6. Luke Jochem Reply

    Just a query/observation about your ‘Odd number of servers’ section. Its seems like you’re getting you’re odds an evens mixed around. You mention a four server pool as an odd numbered pool so SQL will be involved to make the 5th vote…. Shouldn’t it read “The lesson is simple: If you have an EVEN number of servers you need to pay attention where SQL mirror is”… correct??

  7. Pingback: Windows Network Controller Architecture | LogicPundit Blog

  8. Mike Wilczynski Reply

    There is another very little documented feature of WinFabric but sometimes really critical. I mean Data Collector Set “FabricTraces” being deployed with WinFabric and Lync 2013. Trace log files from this data collector set are growing rapidly (depends on size of Lync 2013 environment, in my case 25GB/day) and because they are being saved in hidden folder can eat all of the free space and crash the FE server. To prevent filling all the disk space by the logs there is parameter “LogDeletionAgeInDays” in file %ProgramFiles%Windows FabricbinFabricFabric.Config%ver%settings.xml. It has been set by default to 3, but sometimes it’s not enough to prevent free space running out. Anyone have an idea how to change and apply shorter rotation log’s schema through settings.xml (not using Data Manager, ps or bat scripts, etc)?

    • Rob Reply

      Logman update trace FabricLeaseLayerTraces -f bincirc –cnf
      This makes the fabric trace log circular.

  9. Pingback: Windows Fabric + Skype for Business 2015 | Mastering Lync

  10. Mohamed Reply

    Cheers Richard. This is the most obvious and detailed article I’ve ever read about Windows Fabric.

  11. John Reply

    If all the Front End servers in a pool have their Lync services stopped, does Windows Fabric still function? If so, would it change routing groups?

    • Luke Reply

      AFAIK Windows Fabric is its own service so yes to the first question… and yes to the second question or at least it would try. Depends very much on your environment (ie; # of servers, pools ,etc).

  12. Pingback: Problema con la distribucion de usaurios en los “Routing Groups” de Lync 2013 « Mr.Lync

  13. Pingback: PapyCloud | The Hidden Logs That Could Crash Your Lync Servers

  14. Moshiur Rahman Reply

    recently i have deployed a skype for business server 2015 and it was working fine. But when i added another front end server in the pool and tried to publish the topology, i got error in publishing saying “Fabric version for computer (FQDN of the new front end server) is unknown)”. Any idea regarding this?

    • Richard Brynteson Reply

      Make sure your new front-end is the same version of Windows, patches, etc.

  15. Kind of Lync/Skype Admin Reply

    Richard,
    In case of two FE solution, do you have more details about this:
    “Server A and Server B both gets their information directly from the back-end database. This also means you cannot survive a back-end database failure like you can when you have three servers.”

    I have living in a dream where Lync/Skype depends of the SQL is free of fabric and the Front Ends behaves same way in all cases. This is the first time I see the comment like that. But there is still so much other stuff which I have not read yet.

    • Richard Brynteson Reply

      Even in a two FE pool, the servers still rely on fabric services to place users, load the pools, etc. It simply doesn’t do anything of the primary, secondary, third pool setup. If you don’t want fabric, you are going back to Lync 2010.

  16. Shahid Reply

    Dear Richard i have a question to you and other readers out here, the question is do we still have to have knowledge of windows fabric and other core concepts of Lync/SFB?? everything is moving to cloud and on-premise deployment would go away, am i right?

    • Richard Brynteson Reply

      I’m sure Microsoft would love “everything” to move to the cloud I don’t think this is going to happen at the speed of which they want. So although it may appear as though its all cloud all the time, I can tell you that 85% of our deployments we do are all on-prem still. There are so many missing features it’s not an option for many organization. That will change but will be a bit.

  17. Pingback: The Hidden Logs That Could Crash Your Lync Servers! » Thoughts From a Bot Named Flinch

Leave A Reply

Your email address will not be published. Required fields are marked *