Category Archives: Posts

Runecast – Proactive performance for your VMware environment! – Part 1 – Configuration

Have you ever opened up the VMware Hardening Guide and checked your environment against every single item listed?  How about combed through the VMware Knowledge Base looking for all KB articles that apply to the exact software builds and hardware you have?  No?  How about taken a list of industry best practices and ensured that you are indeed configured in the best possible way?  Of course we haven’t – that would certainly take a lot of time and most organizations simply don’t have the resources to throw at those types of tasks.  All that said what if I told you that there was a piece of software that could pretty much instantly tell you whether you are or are not compliant in those exact three scenarios?  Interested yet?  I thought you might be…

Enter Runecast

logoBefore writing this review I’d never heard of Runecast, so first, a little bit about the company.  Runecast was founded in 2014 in the quaint ol’ city of London in the UK.  Their goal, to provide pro-active monitoring to our vSphere environments in order to save us time, prevent outages before they  happen, ensure compliance at all times and simply make our environments more secure.  Now there is only four things listed there – but they are four things that Runecast does really, really well.  With that said, I could talk about how much I enjoyed doing this review forever, but it’s best just to jump right in and get monitoring…

Configuration

runecast-addvcenterAs far as installation goes Runecast come bundled as a virtual appliance, so it’s just a matter of deploying the analyzer into our environment.  To help you get started Runecast offers a 30 day full-featured free trial that you can try out!  Configuration wise we really only have a couple of steps to perform; pointing the Runecast Analyzer at our vCenter Server and configuring our ESXi hosts to forward their logs.  After deployment you should be brought to a screen similar to the one shown to the left.  Simply follow the ‘Settings’ link and enter in your required vCenter Server information into Runecast as shown below.

runecast-vcenteradditiondetails

Remember how we mentioned that configuration is divided into two steps.  The first, connecting to our vCenter environment is now complete.  The second, setting up the forwarding of logs is completely optional and can be completed at any time.  We can still get valuable data from Runecast without having log forwarding set up, however in order to achieve a more holistic view of our environment we will continue to setup log forwarding.

There are many ways to setup our ESXi hosts to send their logs to Runecast.  We can set them up manually, use some a PowerCLI script, or enter the Runecast Analyzer information into our Host Profile.  The Runecast interface has the smarts to configure this for us as well.  This review will follow the steps in order to setup log forwarding from within the Runecast Analyzer UI.

Selecting the “Status” section from the Log Analysis group, and then clicking on the ‘wrench’ icon will allow us to configure one or many of our hosts to send their log files to Runecast.  This process provides the same results as if we were to go and set the syslog advanced setting directly on the hosts configuration. That said, utilizing Runecast for this seems like a much more automated and easier process.   As you can see below, we also have the option to send our VM log files as well which is a good idea if you are looking for complete visibility into your virtualization stack.

runecast-logging

As far as configuration goes we are now done!  That’s it!.  2 simple steps and we are ready to start detecting problems within our environment.  The process of going out and collecting data from our vCenter Server is called ‘Analyze’ within Runecast.  Our analysis can be configured to occur via a schedule by navigating to the settings page (gear icon in top right) or can be run on-demand by clicking the ‘Analyze Now’ button from any screen within the application.

runecast-analyze

How long this process takes greatly depends on the size of your environment.  My test environment, be it simple and small, only took a couple of minutes to gather the data.  I’m sure this time would increase in a 32 host cluster with 1000 or so VMs though.    That said, for the amount of data it gathers and the amount of comparisons going on behind the scenes Runecast does a very efficient job at processing everything.

Navigating back to the ‘Dashboard’ as shown below immediately let’s us start to explore the results of this analysis process.  Almost instantaneously we can see many issues and best practices that can be applied within our environment.  As you can see below I had a number of issues discovered – and I’ve only had Runecast up and running for less than 5 minutes.

runecast-dashboard

Runecast Terminology

Lets take a minute and dig a little into the data that is displayed on the ‘Dashboard’ screen.  Mostly everything that Runecast monitors and does is rolled up here, giving us an at-a-glance view of everything you need to know.  Let’s break down the items that we are seeing here…

Issues – The term “issue” within Runecast basically represents a detected problem in our infrastructure – this can come from any single or combined instance of configuration settings, log file analysis, or software and hardware versions.  Although the source of discovering issues could be from configuration settings or log files, all issues belong to one of three categories within Runecast; Knowledge Base articles, Security Guidelines, or Best Practices, explained below…

KB’s – Runecast actively piles through the vast amounts of VMware Knowledge Base articles and displays to us any that may apply to our environment based on the hardware and software versions and configuration we are running.

Best Practices – All of our inventory objects and configuration items are routinely scanned to determine whether or not they meet any best practices related to VMware.  This allows us to see if we simply Pass or Fail in terms having our environment running in it’s best possible configuration.

Security Compliance – Security Compliance takes all of the items within the official VMware Security Hardening guides and compares that to of the configuration of our infrastructure.  At a glance we are able to see how we stack up against the recommended security practices provided by VMware.

It’s these four items; Issues, KB’s, Best Practices, and Security Compliance that are at the core of the Runecast analytical engine.  Runecast automatically combs through all of these items and determines which ones apply to our environment, then reports back in a slick clean UI, allowing us to see whether we are in compliance or not!  In the next part of our review we will go into each of these items in a lot more detail – explaining how to drill down, resolve, and exclude certain metrics from our dashboards.  For now , I certainly recommend checking out Runecast for yourself – as you saw, it’s a simple install that can be up and running in your environment very quickly.  So, while you wait for part 2 of the review head on over to the Runecast page and grab yourself a free 30 day trial  to start reporting on your environment.  I’m sure you will be surprised at all of the abnormalities and non-compliant configurations you find right off the hop – I know I was!  Stay tuned for part 2.

Automation using the Nakivo API

apiThe Software Defined Data Center – It’s everywhere.  You can’t go to any big trade show in the IT industry without hearing the phrase “Software Defined X” being tossed around at all of the booths.  Over the last decade or so we have seen software take center stage in our data centers – being the glue that holds everything together.  With this focus on software it’s extremely important that companies develop and support API’s within their products – one, it’s our way of taking application x and integrating it with application y.  Secondly, its important for the success of the company – without an API organizations may look elsewhere for a solution that provides one – and without an API vendors cannot securely control the access into their solutions, leaving customers developing unsupported and faulty applications to get around it.

One big example that shows the benefit of API integrations that I always like to use is that of the deployment of a VM.   Sure, we use our Hypervisor of choice to take our templates and clone VMs from them, providing some sort of automation and orchestration around the configuration of that said VM – but the job doesn’t simply end here – we have monitoring solutions we may need to add our VM into, we have IP management tools in order to integrate into to retrieve IPs and DNS information, and most importantly, we have to ensure that our newly created VM is adequately protected in terms of backup and recovery.   With so many hands inside of the data center creating VMs our backup administrators might not always know a certain solution has been created – and when a failure occurs, there’s a pretty good chance we won’t be able to recover without any backups – so it’s this situation we will look at today…

Automatically protecting our VMs

Our software of choice today will be Nakivo Backup and Replication – a company based out of Silicon Valley providing data protection solutions.   Nakivo provides full API integration into their backup suite allowing administrators and developers to create automation around the creation, modification, and removal of jobs.  The scope of our integration will be as follows – Let’s create a simply vRealize Orchestrator workflow that will allow us to simply right-click a VM from within the vSphere Web Client, and add this VM into an already existing backup job.  From here I’ll let your imagination run wild – maybe you integrate this code into your VM deployment workflow to automatically protect it on creation – the point is that we have a starting point to look at the possibilities of consuming Nakivo’s API and creating some automation within your environment for backup and recovery.

nakivoapi-apidoc

A little about the Nakivo API

Before we get into the actual creation of the vRO workflow it’s best we understand a little bit about the Nakivo API itself.  Nakivo provides an API based around JSON content – so all of our requests and responses will be formatted within JSON format.  These requests will all go through using POST, and are always provided to the /c/router realm (ie https://ip_of_nakivo:4443/c/router) As far as authentication goes Nakivo utilizes cookie based authentication – what this means is that our first request will be sent to the login method, upon which we will receive a JSESSIONID which we will have to pass with every subsequent request in order to secure our connection.  As we can see from the example request below they need to be formatted in such a way that we first specify and instance (IE AuthenticationManagement, BackupManagement, InventoryManagement, etc) and a method (IE login, saveJob, getJob, etc).  From there we attach the data associated with the method and instance, as well as a transaction id (tid).  The transaction id can utilize an auto increment integer if you like, or can simply be set to any integer – it’s main purpose is to group multiple method calls into a single POST – which we won’t be doing anyways so you will see I always use 1.

var requestJSON = “{‘action’: ‘AuthenticationManagement’,’method’:’login’,’data’: [admin,VMware1!,true],’type’: ‘rpc’,’tid’: 1}”;

Above we show an example of a login request in JavaScript, because this is the language of choice for vRealize Orchestrator which we will be using – but do remember that you could use PHP/JAVA/PowerShell – whatever language you want so long as you can form an HTTP request and send JSON along with it.

On with the workflow

Before diving right into the code it’s best to take a look at the different sections or components that we will need to run through in order to add a given VM to a Nakivo job through vRealize Orchestrator.  With that said we can break the process down into the following sections…

  • Add Nakivo as an HTTPRest object within vRO
  • Create workflow w/ VM as an input object and the Nakivo HTTPREST as an argument
  • Create some variables in regards to our VM (IE Name, Cluster, etc)
  • Login to Nakivo to retrieve session
  • Retrieve our target job
  • Find VMs Cluster ID within Nakivo (ClusterID is required in order to find the actual VM within Nakivo).
  • Gather VM information from within Nakivo
  • Gather information about our repository from within Nakivo
  • Build  JSON request and add VM to job

With our workflow broken down into manageable chunks let’s go ahead and start coding

Add Nakivo as an HTTPRest object.

If you have ever worked with the HTTPRest plugin within vRO then this will seem like review – however for those that haven’t let’s take a look at the process of getting this setup.  From within workflow view simply run the ‘Add a REST host’ workflow located under the HTTP-REST/Configuration folders.  As far as parameters go simply give the host a name, use https://ip_of_nakivo:4443 as the URL, and be sure to select ‘Yes’ under the certification acceptance as shown below

nakivoapi-addhost

The remaining steps are somewhat invalid as it pertains to adding Nakivo as a REST host within vRO – for authentication I selected basic and provided the credentials for Nakivo – however this really doesn’t matter as we are going to use cookie/header based authentication through our code anyways – however something needs to be selected and inputted within vRO.  After clicking submit the NakivoAPI REST host should be added to our vRO inventory.

Workflow creation

As far as the workflow goes I’ve tried to keep it as simple as possible, requiring only 1 input attribute and 1 input parameter as follows

  • Input Attribute (Name: NakivoAPI – Type: RESTHost – Value: set to the Nakivo object created earlier

Nakivoapi-attribute

  • Input Parameter (Name: sourceVM – Type: VC:VirtualMachine )

nakivoapi-parameter

Code time!

After this simply drag and drop a scriptable task into the Schema and we get started with the code!  I’ve always found it easier to simply just display all the code and then go through the main sections by line afterwards.  As far as the javascript we need you can find it below…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
var vmName = sourceVM.name
var cluster = System.getModule("com.vmware.library.vc.cluster").getComputeResourceOfVm(sourceVM);
var clusterName = cluster.name;
 
// login and retreive sessionID
var requestJSON = "{'action': 'AuthenticationManagement','method':'login','data': [admin,VMware1!,true],'type': 'rpc','tid': 1}";
var request = NakivoAPI.createRequest("POST", "/c/router", requestJSON);
request.setHeader("Content-Type","application/json");
var response = request.execute();
var headers = response.getAllHeaders();
var cookie = headers.get("Set-Cookie");
 
// retrieve target job
requestJSON = "{'action': 'JobManagement','method':'getJob','data': [1],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
var jsonResponse = JSON.parse(response.contentAsString);
var job = jsonResponse.data;
 
// find clusterID
requestJSON = "{'action': 'InventoryManagement','method':'collect','data': [{'viewType':'VIRTUAL_ENVIRONMENT'}],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
jsonResponse = JSON.parse(response.contentAsString);
 
// reduce to datacenters
var vcenter = jsonResponse.data.children[0];
var datacenters = vcenter.children;
var datacenter;
var cluster;
for ( var p in datacenters)
{
	for (var c in datacenters[p].children)
	{
		if (datacenters[p].children[c].name == clusterName)
		{
			cluster = datacenters[p].children[c];
		}
	}
}
var clusterid = cluster.identifier;
 
// look in cluster for VM info...
requestJSON = "{'action': 'InventoryManagement','method':'list','data': [{'nodeType':'VMWARE_CLUSTER','nodeId': '" + clusterid + "','includeTypes': ['VM'] }],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
jsonResponse = JSON.parse(response.contentAsString);
var vms = JSON.parse(response.contentAsString);
vms = vms.data.children;
var vm;
for (var p in vms)
{
	if (vms[p].name == vmName)
	{
		vm = vms[p];
	}
}
 
// get more info on VM
requestJSON = "{'action': 'InventoryManagement','method':'getNodes','data': [true, ['"+ vm.vid + "']],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
var vminfo = JSON.parse(response.contentAsString);
vminfo = vminfo.data.children[0];
var vmdisk = vminfo.extendedInfo.disks[0].vid;
 
// get target storage
requestJSON = "{'action': 'InventoryManagement','method':'list','data': [{'includeTypes': ['BACKUP_REPOSITORY'] }],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
jsonResponse = JSON.parse(response.contentAsString);
var targetVid = jsonResponse.data.children[0].vid;
 
//build data portion of JSON to add VM to job
var jsonSTR = '{ "sourceVid": "' + vminfo.vid + '","targetStorageVid": "' + targetVid + '","mappings": [{"type": "NORMAL","sourceVid": "' + vmdisk + '"}], "appAwareEnabled": false}';
var json = JSON.parse(jsonSTR);
 
//push new object to original job
job.objects.push(json);
System.log(JSON.stringify(job));
 
// let's try and push this back in now....
requestJSON = "{'action': 'JobManagement','method': 'saveJob', 'data': [" + JSON.stringify(job) + "],'type': 'rpc','tid': 1}";
request = NakivoAPI.createRequest("POST","/c/router",requestJSON);
request.setHeader("Content-Type","application/json");
request.setHeader("Cookie", cookie);
response = request.execute();
 
// done!!!!

Lines 1 -3 – Here we simply setup a few variables we will need later on within the script; vmName, which is assigned the name attribute of our input parameter sourceVM as well as going out and running a built in action to get the name of the cluster that the VM belongs to.  Both these variables will be needed when we attempt to get all of the information we need to add the VM to the backup job.

Lines 5-11 – This is our request to login to Nakivo.  As you can see we simply create our request and send the login method, along with the associated login data to the AuthenticationManagement interface.  This request basically authenticates us and sends back the JSSESSIONID that we need in order to make subsequent requests, which we store in a cookie variable on line 11.

Lines 13 – 20 – Again, we make a request to Nakivo to get our job that we want to add the backup to.  I only have one job within my environment so I’ve simply utilized the getJob method and sent the data type of 1 (the jobID) since I know that is my one and only job id in the system.  If need be you may need to write a similar request to get the job id if you don’t know it – Nakivo does provide methods within their API to search for a job ID by the job name.  Also note, since this is a subsequent request after a login we are sending our cookie authentication data on line 17 – also, we are taking our response data and storing it in a variable named job on line 20 – we will need this later when we update the job.

Lines 22 – 45 – This is basically a request to the InventoryManagement interface that we can use to find out what the id (as it is within Nakivo) of the cluster housing the virtual machine.  First, on line 23 we build a request to basically return our complete virtual infrastructure inventory – upon which we parse through from lines 35-44 looking for a match on our cluster names.  I’ve  had to loop through data centers as my test environment contains more than one virtual data center.  Finally, on line 45 we simply assign the clusterid variable the nakivo identifier of the cluster.

Lines 47 – 73 – Here we use our cluster identifier and basically list out the VM inventory within it.  After looping through when we find a match on our VM name, we simply assign it to a vm variable.  We then, on line 66 send a request to the InventoryManagement interface again, this time looking at the Virtual Machine level and sending the identifier of our newly discovered VM.  Once we have the response we assign the identifier of the VMs disk(s) on Line 73 to a variable.  Again, I know this environment and I know the VM only contains one disk so I’ve hard coded my index – if it was unknown, or truly automated you would most likely have to loop through the disks here to get your desired output.

Lines 75 – 82 – This block of code is used to get the identifier of the target storage, or repository within Nakivo.  Again we need this information for our final request that will add the VM to the job – and again, this is a known environment so I could simply hard code my array index on line 82 to return the proper repository (as there is only one).

Lines 84 – 90 – Here we are simply building out the JSON variable that we need in order to push all of the information we have previously gathered above.  We basically form our string on line 85, convert it to JSON directly after, and push it into the original job variable we set on line 20.

Lines 92-99 – Ah, finally – This block basically takes all of our hard work and pushes the job back into the saveJob method of the Nakivo JobManagement interface.  Once executed you should manually see your job info within Nakivo update reflecting the new VMs added to the job.

So there you have it! A completely automated way of selecting a VM within vRealize Orchestrator and adding it to a Nakivo backup job – all without having to open up the Nakivo UI at all!

But wait, there’s more!

Ahh – we eliminated the need of opening up the Nakivo UI but how about eliminating the Orchestrator client as well – and simply just executing this job from directly within the vSphere Web Client – sounds like a good idea to me!  If you have properly, and I say that because it can sometimes be difficult – but if you have properly integrated vRO and vSphere then doing this is a pretty easy task.

Within the ‘Context Actions’ tab on our vRO configuration within the web client simply click ‘+’ to add a new action.  As shown below we can simply browse our workflow library and select our newly created Nakivo workflow and associate that with the right-click context menu of a virtual machine.

nakivoapi-context

What we have essentially done now is allowed our administrators to simply right-click on a VM, browse to ‘All vRealize Orchestrator Actions’ and click on our workflow name.  From there the vRO workflow will take the associated VM (the one we right-clicked on) and assign it to our sourceVM parameter – meaning we’ve taken the complete process of logging into Nakivo, editing our backup job, adding a new VM, and saving it and converted it to a simple right click, followed up by a left click – without having to leave the vSphere Web Client!

nakivoapi-rightclick

So in all this is a pretty basic example of some of the things we can do with the Nakivo API – and it followed a pretty simple and stripped down workflow – but the point is Nakivo offers a wide variety of methods and integration points into their product.  Pretty much anything you can do within the GUI can be performed by making calls to the API.  This is what helps a product integrate into the Software Defined Data center – and what allows administrators to save time, provide consistency, all the while ensuring our data is protected.  Nakivo also has a wide variety of documentation and also a Java SDK built around their API, complete with documents and explanations around all of the interfaces provided.  If you are interested in learning more, about Nakivo’s API or Nakivo’s products in general head on over to their site here – you can get started for the low cost of free!  Until next time, happy automating!

Lessons learned from #vDM30in30

Phew!  I’m not sorry to say that #vDM30in30 is over with!  Not to say it wasn’t a lot of fun, but honestly, it’s a lot of work – especially when juggling family, travel, the day job and all!  One might think that simply blasting out 30 pieces of content in 30 days would be relatively easy – but it’s not!  That said, I learned a lot about my writing process and styles during this challenge, and as my final, and, unfortunately only 28th post of the month I’d like to share those with you…

The challenge of topics

It’s not easy coming up with topics to write about, especially when writing so often.  I was lucky enough to have had a handful of ideas already sitting in my draft folders – and #vDM30in30 finally gave me the opportunity to write about them.  That said, I know I had thought of more throughout the month and simply forgot to write them down.  So whatever your means of tracking your ideas are (drafts, post-its, bullet journals) write them down!  I found that if I didn’t commit it to something I would forget it.  Needless to say I have a dozen or so topics just sitting in my drafts now – which leads me to the next challenge…

The challenge of time

Surely this is probably the biggest hurdle of all – finding the time to articulate yourself and get a blog post written.  I find that this varies for me – for some topics I’ll simply start writing and have a complete post hashed out in an hour or so.  Others I find myself having to go do research, read other blogs, whitepapers, trying to fully understand what I’m writing about 🙂  Those are the ones that sometimes take days – 10 minutes here and there, revisiting the same ol’ things.  For me I’m best to dedicate all the time I need to write the post in one sitting – otherwise I have  a hard time reading my own writing once I revisit the post.  That said, time is tricky thing to find – we have families, commitments, other things we need to take care of – what I did was always critique myself with what I was doing.  If I was watching a habs game I would try and at least do something “blog productive” while doing so.  Those endless hours on an airplane – perfect for editing and getting things ready!  My advice here, just use your time wisely and don’t sacrifice the things you love the most just to write a blog post – the kids will eventually go to sleep – do it then 🙂

The challenge of writing

Perhaps this is the oddest hurdle to overcome.  Sometimes the words just come, other times I struggle trying to explain myself.  There were times where, even though I knew I would have a hard time coming back to complete a post I simply had to walk away.  If you are burnt out, nothing will make sense.  Take breaks, either small or large – we are all different just find what works for you.  For me, that was walking…

So I’m happy to say that even though I was two shy of the infamous thirty – I did learn some things about my writing process and styles.  With that said, here’s a look at what I accomplished throughout the month of November on mwpreston.net.

Tech Field Day 12 Stuff

My favorite Veeamy things…

Other vendor stuff

My Friday Shorts

Randoms

So there you have it!  Thanks all for following along and reading and I hope to participate next year as well.  All that said, don’t expect a post per day to continue here – I need some sleep!

The Atlas File System – The foundation of the Rubrik Platform

One of the core selling points of the Rubrik platform is the notion of something called “unlimited scale” – the ability to start small and scale as large as you need, all the while maintaining their masterless deployment!  Up until a few weeks ago I was unaware of how they actually achieved this, but after witnessing Adam Gee and Roland Miller present at Tech Field Day 12 in San Jose I have no doubts that the Atlas file system is the foundation upon which all of Rubrik is built.

atlascore

As shown above we can see how the platform is laid out by Rubrik – with the Atlas file system sitting within the core of the product and communicating with nearly every other component in the Rubrik platform.  Now picture each node containing exactly this same picture, scaling up to whatever number of nodes you might have – each node containing its’ own Atlas file system, with its own local applications accessing it – however the storage is distributed and treated as one scalable blob of storage addressable by a single global namespace.

Disclaimer: As a Tech Field Day 12 delegate all of my flight, travel, accommodations, eats, and drinks are paid for. However I did not receive any compensation nor am I required to write anything in regards to the event or the presenting companies. All that said, this is done at my own discretion.

Atlas – a distributed scalable file system.

As shown above other core modules such as Callisto, Rubriks distributed metadata store and the Cluster Management system all leverage Atlas under the hood – and in turn Atlas utilizes these for some of its functions.  For instance, to make Atlas scalable it leverages data from the Cluster Management system to grow and shrink – when a new brik is added, Atlas is notified via the CMS, at which point the capacity from the new nodes is added to the global namespace, thus increasing our total capacity available, as well as flash resources to consume for things such as ingest and cache.  It should also be noted that Atlas does take care of data placement as well, so adding a new node to the cluster will trigger it to re-balance, however it’s got the “smarts” to process this as a background task and take into affect all of the other activities occurring within the cluster, which it gets from the Distributed Task Framework –   Meaning we won’t see a giant performance hit directly after adding new nodes or briks due to the tight integration between all of the core components.

Adding disk and scaling is great, however the challenges of any distributed file-system is how to react when failures occur, especially when dealing with the low costs of commodity hardware.  Atlas performs file system replication in a way that it provides for the failure at a disk level and a node level, allowing for 2 disks, or 1 full node to fail without experiencing data loss.  How Atlas handles this replication depends solely on the version of Rubrik in your datacenter today.  Pre 3.0 releases used a technology called mirroring, which essentially triple replicated our data across nodes.  Although triple replication is a great way to ensure we don’t experience any loss of data it does so at the expense of capacity.  The Firefly release, 3.0 or higher, implements a different replication strategy via erasure coding.  By its nature, erasure coding essentially takes the same data that we once would of replicated three times and splits it  into chunks – the chunks are then processed and alternate chunks are encoded and created which can be used to rebuild the data if need be.  It’s these chunks that are intelligently placed across disks and nodes within our cluster to provide availability.  The short of the story here is that erasure coding gives us the same benefit of triple replication, without the cost of having triple the capacity – therefore more space will be available within Rubrik for what matters most, our data.

rubrik-selfheal Aside from replication of our data Atlas employs other techniques to keep our data available as well – items such as self healing and CRC detection allows Atlas to throw away and repair data as it becomes corrupt.   Now these are features within file-systems we expect to see, but Atlas can handle this a little different due to it’s distributed architecture.  The example given was with three briks, each containing four nodes – when a node fails, or data becomes corrupt Atlas actually repaired the data on a surviving node within the same brik, ensuring we are still spread out across briks.  If a brik happens to fail, the chunk of data would then be required to be on the same brik as another, but would be placed on another node, allowing still for node failure.  It’s this topology-aware deployment that really allows Rubrik to maximize it’s data availability and provide protection across not only nodes within a brik, but between brik failures as well, maximizing the failure tolerance guarantees they are providing.

Perhaps some of the most interesting ways Atlas works though are around how it exposes its’ underlying functions and integration points in the applications running on top of it, the Rubrik applications.  First up, the meat of Rubriks solution, mounting snapshots for restore/test purposes.  While all of our backup data is immutable, meaning it by no means can be changed in any way, Atlas does leverage a “Redirect on Write” technology in order to mount these backups for test/dev/restore purposes.  What this means is that when a snapshot is requested for mount, Atlas can immediately assemble the point in time using incremental pointers – no merging of incrementals to full backups or data creation of any kind – a simple presentation of the full VM in that point in time is presented.  Any writes issued to this VM are redirected, or written elsewhere and logged – thus not affecting the original source data whatsoever, all the while allowing the snapshot to be written to.

Atlas also exposes a lot of its underlying functionality to applications in order to create performance as well.  Take for instance the creation of a scratch or temporary partition for example – if Rubrik needs to instantiate one of these it can tell Atlas that this is indeed temporary – thus, Atlas doesn’t have the need to replicate the file making up the partition at all as it doesn’t necessarily require protection and can simply be tossed away when we are done with it.  And that tossing away, the cleaning up after itself can also be set from an application level.  In that same example we could simply set a ttl or expiry on our scratch file, and let the normal garbage collection maintenance job clean up during its normal run, rather than wasting time and resources in having the application make second or third calls to do it.  Applications can also leverage Atlas’s placement policies, specifying whether files or data should be placed on SSD or spinning disk, or even specify whether said data should be located as close as possible to other data.

So as you can see that although Rubrik is a very simple and easy policy based, set and forget, type of product there is a lot of complexity under the hood.  Complexity that is essentially abstracted away to the end-user, but available to the underlying applications making up the product.  In my mind this paves the way for a quick development cycle.  Being able to leverage the file-system for all its worth while not having to worry about “crazy” configurations customers may have.  We have certainly seen a major influx of custom-built file systems entering our data centers today – and this is not a bad thing.  While the “off the shelf”, commodity type play may fit well for hardware, the software is evolving – and this is evident in the Rubrik Atlas file system.  If you want to learn more definitely check out their Tech Field Day 12 videos here – they had a lot more to talk about than just Atlas!

VembuHIVE – A custom built file system for data protection

vembu-logoVirtualization has opened many doors in terms of how we treat our production environments.  We are now vMotioning or Live Migrating our workloads across a cluster of hosts – we are cloning workloads with much ease and deploying new servers into our environments at a very rapid rate.   We have seen many advantages and benefits to the portability and encapsulations that virtualization provides.  For a while, our backups though were treated as the same – simply copies of our data sitting somewhere else – only being utilized during those situations when a restore was required.  That said over the past 5 years or so we have seen a shift in what we do with our backup data as well.  Sure, it’s still primarily used for items such as restores, both on a file and image level – but backup companies have began to leverage that otherwise stale data in ways we could only imagine.  We see backups being used for analytics, compliance, and audit scans.  We see backups now being used in a devops nature – allowing us to spin up isolated, duplicate copies of our data for testing and development purposes.  We have also saw the ‘restore’ process dwindling away, with the “instant” recovery feature taking its’ place, powering up VMs immediately from within the deduplicated and compressed backup files, drastically decreasing our organizations RTO.

So with all of this action being performed on our backup files a question of performance comes into play.  No longer are we ok to simply store our backups on a USB drive formatted with a traditional file systems such as FAT or NTFS.  The type of data we are backing up, the modern virtualization disk images such as VHDx and VMDK depend on something more from the file system it’s living on – which is why Vembu, a data protection company out of India have developed their own file system for storing backups, the VembuHIVE.

Backups in the HIVE

beehiveWhen we hear the word VembuHIVE we can’t help but turn our attention towards bees – and honestly, they make the perfect comparison as to how the proprietary file system from Vembu performs.  A bee hive at its basics is the control center for bees – a place where they all work collectively to support themselves and each other – the hive is where the bees harvest their magic, organizing food, eggs, and honey.   The VembuHIVE is the central point of storage for Vembu’s magic, storing the bits and controlling how files are written, read and pieced together.  While VembuHIVE can’t produce honey (yet), it does produce data.  And it’s because of the way that VembuHIVE writes and reads our source data that we are able to mount and extract our backups in multiple file formats such as ISO, IMG, VMDK and VHDX – in a near instant fashion.

In essence, VembuHIVE is like a virtualized file system overlaid on top of your existing file system that can utilize utilities that mimic other OS file systems – I know that’s a mouthful but let’s explore that some more.

Version Control is key

In my opinion the key characteristic that makes VembuHIVE run is version control – where each and every file produced is accompanied by metadata controlling what version, or point in time, the file is from.  Probably the easiest comparison is to that of GIT.

versioncontrolWe all know of GIT – the version control system that keeps track of changes to our code.  GIT solved a number of issues within the software development ecosystem.  For instance, instead of copying complete projects before making changes we could simply branch out on GIT – which would basically track changes to source code and store only those lines which have changed – allowing us to easily roll back or to any point in time within our code – reverting and redoing any changes that were made.  This is all done by only storing changes and creating metadata to explain those changes – which in the end gives us a very fast way to revert to different points, fork off new points, all the while utilizing our storage capacity in the most efficient way possible.

VembuHIVE works much in the same way as GIT however instead of tracking source code we are tracking changed blocks within our backup files – allowing us to roll back and ahead within our backup file chain.  Like most backup products Vembu will create a full backup during the first run, and subsequently utilize CBT within VMware to copy only changed blocks during incremental backups.  That said, the way it handles and intelligently stores the metadata of those incremental backups allows Vembu to essentially present any incremental backup as what they call, a virtual full backup.  Basically, this is what allows Vembu BDR to expose our backups, be them full or incremental, in various file formats such as vmdk and vhdx.  This is done without performing any conversion on the underlying backup content and in the case of incremental backups there is no merging of changes to the previous full backup before hand.  It’s simply an instant export of our backups in whatever file format we chose.  I mention that we can instantly export these files, but it should be noted that these point in time backups can be instantly booted and mounted as well – again, no merge, no wait time.

VembuHIVE also contains most of the features you expect to see in a modern file system as well.  Features such as deduplication, compression and encryption are also available within VembuHIVE.  As well, VembuHIVE contains built-in error correction on top of all of this.  Every data chunk within the VembuHIVE file system has it’s own parity file – meaning when data corruption occurs, VembuHIVE can reference the parity file in order to rebuild or repair the data in question.  Error correction within VembuHIVE can be performed at many levels as well, protecting data from a disk image level, file-level, chunk-level or backup file-level basis – I think we are covered pretty good here

Finally we’ve mentioned a lot that we can instantly mount and exports our VMs on a VM level basis, however the intelligence and metadata within the VembuHIVE file system goes way beyond that.  Aside from exporting as vmkd’s or vhdx’s, VembuHIVE understands how content is organized within the backup file itself – paving the way for instant restores on an application level – think Exchange and Active Directory objects here.  Again, this can be done instantly, from any restore point at any point in time without performing any kind of merge process.

In the end VembuHIVE is really the foundation of almost all the functionality that Vembu BDR provides.  In my opinion Vembu have made the correct decision by architecting everything around VembuHIVE and by first developing a purpose built, modern file system geared solely at data protection.   A strong foundation always makes for a strong product and Vembu has certainly embraced that with their implementation of VembuHIVE

Friday Shorts – VeeamON, Storage Protocols, REST, and Murica!

“If that puck would’ve crossed the line Gord, that would’ve been a goal!” – Pierre McGuire – A Mr Obvious, annoying hockey commentator that drives me absolutely insane! (Sorry, watching the Habs game as I put all this together :))

Jambalaya and Backups – Get there!

veeam_logoVeeam had some big announcements this year along with a slew of releases of new products, beta’s and big updates to existing products.  All that said we can only assume that VeeamON, the availability conference focussed on the green is going to be a big one!  This year it takes place May 16-18 in New Orleans – a nice break from the standard Vegas conferences!  I’ve been to both VeeamON conferences thus far and I can tell you that they are certainly worth it – all of Veeams engineers/support is there so if you have a question, yeah, it’ll get answered and then some!  So, if you can go, go!  If you can’t, if it’s a money thing – guess what???  Veeams raffling off 10, yes 10 fully paid (airfare, hotel, conference) trips over the holidays – so yeah, go sign up!

But we have a REST API?

apiAlthough this post by John Hidlebrand may be a month old I just read it this week and it sparked some of my own inner frustrations that simmer around deep inside me 🙂  John talks about how having a REST API is just not enough at times – and I completely agree!  I’m seeing more and more companies simply state, oh yeah, we have a REST API, we are our first customer!  That’s all said and great – but guess what, you wrote it and you know how to use it!  All to often companies are simply developing the API and releasing it, but without any documentation or code examples on how to consume it!  John brings up a good point about, hey, how’s about having some PowerShell cmdlets built around it?  How about having an SDK we can consume?  Building your application off of a REST API is a great start don’t get me wrong, but if you want people to automate around your product – help us out a little please 🙂

In through iSCSI, out through SMB, in through SWIFT, out through REST

isolonFellow Veeam Vanguard and TFD12 delegate Tim Smith has a post over on his blog describing a lot of the different storage protocols on the market today and how EMC, sorry, Dell-EMC Isilon is working to support them all without locking down specific data to a specific protocol. If you have some time I’d certainly check it out!

Happy Thanksgiving Murica!

I’ve always found it odd that Canadians and Americans not only celebrate thanksgiving on different days, but in different months as well!   Come to find out there are quite a few other differences as well.  You can see the two holidays compared on the diffen.com site.  It makes sense that we here in Canada celebrate a bit earlier – especially if our thanks revolves around the harvest.  I mean, no one wants to wait till November in Canada to harvest their gardens and crops – you’d be shoveling snow off of everything!  Either way – Happy Thanksgiving to all my American friends – may your Turkey coma’s be long-lasting!