The Atlas File System – The foundation of the Rubrik Platform
One of the core selling points of the Rubrik platform is the notion of something called “unlimited scale” – the ability to start small and scale as large as you need, all the while maintaining their masterless deployment! Up until a few weeks ago I was unaware of how they actually achieved this, but after witnessing Adam Gee and Roland Miller present at Tech Field Day 12 in San Jose I have no doubts that the Atlas file system is the foundation upon which all of Rubrik is built.
As shown above we can see how the platform is laid out by Rubrik – with the Atlas file system sitting within the core of the product and communicating with nearly every other component in the Rubrik platform. Now picture each node containing exactly this same picture, scaling up to whatever number of nodes you might have – each node containing its’ own Atlas file system, with its own local applications accessing it – however the storage is distributed and treated as one scalable blob of storage addressable by a single global namespace.
Atlas – a distributed scalable file system.
As shown above other core modules such as Callisto, Rubriks distributed metadata store and the Cluster Management system all leverage Atlas under the hood – and in turn Atlas utilizes these for some of its functions. For instance, to make Atlas scalable it leverages data from the Cluster Management system to grow and shrink – when a new brik is added, Atlas is notified via the CMS, at which point the capacity from the new nodes is added to the global namespace, thus increasing our total capacity available, as well as flash resources to consume for things such as ingest and cache. It should also be noted that Atlas does take care of data placement as well, so adding a new node to the cluster will trigger it to re-balance, however it’s got the “smarts” to process this as a background task and take into affect all of the other activities occurring within the cluster, which it gets from the Distributed Task Framework – Meaning we won’t see a giant performance hit directly after adding new nodes or briks due to the tight integration between all of the core components.
Adding disk and scaling is great, however the challenges of any distributed file-system is how to react when failures occur, especially when dealing with the low costs of commodity hardware. Atlas performs file system replication in a way that it provides for the failure at a disk level and a node level, allowing for 2 disks, or 1 full node to fail without experiencing data loss. How Atlas handles this replication depends solely on the version of Rubrik in your datacenter today. Pre 3.0 releases used a technology called mirroring, which essentially triple replicated our data across nodes. Although triple replication is a great way to ensure we don’t experience any loss of data it does so at the expense of capacity. The Firefly release, 3.0 or higher, implements a different replication strategy via erasure coding. By its nature, erasure coding essentially takes the same data that we once would of replicated three times and splits it into chunks – the chunks are then processed and alternate chunks are encoded and created which can be used to rebuild the data if need be. It’s these chunks that are intelligently placed across disks and nodes within our cluster to provide availability. The short of the story here is that erasure coding gives us the same benefit of triple replication, without the cost of having triple the capacity – therefore more space will be available within Rubrik for what matters most, our data.
Aside from replication of our data Atlas employs other techniques to keep our data available as well – items such as self healing and CRC detection allows Atlas to throw away and repair data as it becomes corrupt. Now these are features within file-systems we expect to see, but Atlas can handle this a little different due to it’s distributed architecture. The example given was with three briks, each containing four nodes – when a node fails, or data becomes corrupt Atlas actually repaired the data on a surviving node within the same brik, ensuring we are still spread out across briks. If a brik happens to fail, the chunk of data would then be required to be on the same brik as another, but would be placed on another node, allowing still for node failure. It’s this topology-aware deployment that really allows Rubrik to maximize it’s data availability and provide protection across not only nodes within a brik, but between brik failures as well, maximizing the failure tolerance guarantees they are providing.
Perhaps some of the most interesting ways Atlas works though are around how it exposes its’ underlying functions and integration points in the applications running on top of it, the Rubrik applications. First up, the meat of Rubriks solution, mounting snapshots for restore/test purposes. While all of our backup data is immutable, meaning it by no means can be changed in any way, Atlas does leverage a “Redirect on Write” technology in order to mount these backups for test/dev/restore purposes. What this means is that when a snapshot is requested for mount, Atlas can immediately assemble the point in time using incremental pointers – no merging of incrementals to full backups or data creation of any kind – a simple presentation of the full VM in that point in time is presented. Any writes issued to this VM are redirected, or written elsewhere and logged – thus not affecting the original source data whatsoever, all the while allowing the snapshot to be written to.
Atlas also exposes a lot of its underlying functionality to applications in order to create performance as well. Take for instance the creation of a scratch or temporary partition for example – if Rubrik needs to instantiate one of these it can tell Atlas that this is indeed temporary – thus, Atlas doesn’t have the need to replicate the file making up the partition at all as it doesn’t necessarily require protection and can simply be tossed away when we are done with it. And that tossing away, the cleaning up after itself can also be set from an application level. In that same example we could simply set a ttl or expiry on our scratch file, and let the normal garbage collection maintenance job clean up during its normal run, rather than wasting time and resources in having the application make second or third calls to do it. Applications can also leverage Atlas’s placement policies, specifying whether files or data should be placed on SSD or spinning disk, or even specify whether said data should be located as close as possible to other data.
So as you can see that although Rubrik is a very simple and easy policy based, set and forget, type of product there is a lot of complexity under the hood. Complexity that is essentially abstracted away to the end-user, but available to the underlying applications making up the product. In my mind this paves the way for a quick development cycle. Being able to leverage the file-system for all its worth while not having to worry about “crazy” configurations customers may have. We have certainly seen a major influx of custom-built file systems entering our data centers today – and this is not a bad thing. While the “off the shelf”, commodity type play may fit well for hardware, the software is evolving – and this is evident in the Rubrik Atlas file system. If you want to learn more definitely check out their Tech Field Day 12 videos here – they had a lot more to talk about than just Atlas!