Picture yourself in this scenario – you walk into work on a Monday morning where you are promptly greeted by just about every IT staff member in the building. They quickly tell you that certain mission critical services are down and the company is losing money as we speak. But that’s OK, you are here, the infamous backup guy – the guy no one cares about unless things go down. None the less you sit at your desk and begin the restore process. “No problem”, you say, “services should be back up in 10 minutes or so…”. The VM is fully restored and powered on – you sit, watching the console and the Windows spinning wheel and all of a sudden you see this!
Yikes! Your backups, just like their production counterparts are corrupt – you try a different restore point, no go, they are all corrupt. Believe it or not this is a common scenario that is played out inside of organizations everywhere. Backups, just like any other production workload or service need to be tested in order to ensure that they are indeed restorable. The best way of testing these backups is definitely done by performing full restores on the data – however doing so after each and every backup job can be quite time consuming and inefficient in terms of resource usage.
Enter Nakivo
Nakivo does a great job at backing up and protection your VMware and Amazon environment and has a lot of features included within their product, Nakivo Backup and Replication. I’ve previously reviewed the Nakivo product as a whole here if you’d like to check it out. This post, will focus more on one feature – a feature that helps to prevent situations like the one above – Nakivo Screenshot verification. There is no worse feeling than having terabytes of backups that prove to be unreliable and in turn, useless – which is the exact reason why Nakivo has developed Screenshot Verification inside of their Backup & Replication software – to give you piece of mind that when push comes to shove, your backups will indeed be restorable!
What is Screenshot Verification?
Screenshot verification is a simple concept with a lot of underlying technology at play – in its basic form, Nakivo will verify each VM backup after a new restore point is completed. This is done by booting the VM directly from its’ corresponding deduplicated and compressed backup files located on a Nakivo backup repository, a process that Nakivo calls Flash VM Boot. During a Flash VM boot Nakivo creates a new VM on a specified ESXi server. It then takes the disks from within the backup files and exposes them as iSCSI targets, upon completion, the disks are mounted to the new VM as vRDMs. A snapshot is created in order to provide disposal of any changes and the newly created VM is powered on, isolated from the production network. Once booted, Nakivo utilizes VMware tools in order to take a screenshot of the booted OS. After the screenshot is taken the newly created VM is discarded and backup files are brought back to a consistent state, and the screenshot is included within any job reports, either generated through the UI or emailed.
It’s this screenshot that gives you the “piece of mind” that when the time comes to restore your VMs, they will indeed be restorable! A simple picture of the Windows login screen or Linux bash shell, or lack there of, certainly would’ve helped in the above scenario – alerting us that the next time we try and reboot our production VM or restore from a backup that problems may occur – giving us the leeway and time to fix the situation or restore to our last known good restore point on our own terms rather than doing so during an emergency.
How do we set this up?
As far as how to setup and configure Nakivo Backup and Replication as a whole I would recommend checking out my previous review here – but focusing solely on Screenshot Verification let’s go through the steps below… **Note, we are setting this up for one of our backup jobs, however we can also enable screenshot verification for our replication jobs as well **
Screenshot verification, all be there a lot of moving parts underneath is actually a simply Enable/Disable feature within the backup job. Nakivo has done a great job of abstracting away all of the complicated technology underneath and presenting us with some simple and easy to use configurable options. On the last step of a job wizard, we see the Screenshot verification setting at the bottom of the first column (as shown below)…
Upon selecting ‘settings’ we are presented with some more options which we can configure. The target container is the placeholder in which we will register the newly created VM that will be mounted to the backup files. This can be your average vSphere object that VMs belong to such as a host or cluster. Target datastore is where we would like to place the configuration files (vmx) of the VM that is created. Verification Options allows us to do things such as limit the amount of VMs which will be verified simultaneously. Running too many VM Screenshot Verification tests at once can produce a heavy load on your backup repositories, causing major delays in boot time depending on your hardware configuration – it’s best to tune this to your liking. Also configurable here are things like RTO, which in this case defines the number of minutes that the VM has to fully boot and initialize VMware tools. If this time is exceeded, the VM will be labeled as failed and the placeholder VM is discarded. We can also set the delay between when the guest OS has booted, and the actual execution of the screenshot.
Honestly, this is all we need to do! Simply save your job and on your next job run Screenshot verification should take place. As shown below, we can see the events that take place within vCenter during a Screenshot verification test, along with the placeholder VM that is created in order to perform these tests, noting the creation and deletion of the VM, along with any required iSCSI setup. This is all automated by Nakivo and requires no manual setup on your part.
So we have now seen that the Screenshot verification has been executed, but what does it look like in one of the reports/emails. Right-clicking any job within Nakivo gives us the ability to run a few reports – the one we are most interested in now is the ‘Last run report’. After generating and opening the ‘Last run report’ for our job with screenshot verification enabled we should see new information included in the report. As shown below we see that we have a ‘Last verification’ row now, indicating whether or not that the screenshot verification was successful – in addition, we can also see the actual screenshot that was taken by Nakivo. Below we see the actual login screen, giving us a pretty good indication that if we were to restore from this backup we would be successful.
Hey, Let’s have some fun!
As you can see, Screenshot verification is a very valuable tool giving us that piece of mind that our backups are actually restorable. But where’s the fun in that right? Let’s break some stuff and see how Screenshot verification reacts….
So, on my production VM let’s mimic some corruption and see if we can’t get Nakivo to detect it before we do! In order to do this I’ve run the following commands on my production VM within an administrative console (***NOTE*** Don’t do this in production, please, please don’t do this in production )
takeown /F C:\Windows\System32\WinLoad.exe
cacls C:\Windows\System32\WinLoad.exe /G administrator:F
del C:\Windows\System32\WinLoad.exe
bcdedit /set recoveryenabled No
The first three lines are pretty self explanatory, taking ownership, assigning rights, and deleting WinLoad.exe – the file that actually executes the loading of Windows upon boot. The last line simply disables the automatic repair, Microsoft’s line of defense for preventing people from doing stupid things like this Anyways, we’ve essentially botched our server here, however we won’t notice until we do a reboot, something that probably doesn’t happen that frequently in a production environment – thus, it’s probably going to go unnoticed for quite some time – that is, unless we are utilizing Nakivo’s screenshot verification on our backup jobs.
Let’s go ahead and run our backup job again on this same VM. This time, we will see Nakivo report a failure on the backup job, specifying that screenshot verification has failed – upon further investigation, we can see below what appears on the console of our VM that used for the verification, and is exactly what would happen to our production VM if we were to reboot it! Even though our newly created backup is not restorable, at least we now know and it won’t be a surprise to us in an emergency situation like the previous scenario. This gives us time – time to come up with a plan, whether that be restoring from a known good backup, coming up with some sort of failover plan or even building a new server.
So in the end screenshot verification proves to be a very valuable tool in any backup administrators belt – whether that being knowing that your backups can be restored successfully, or sometimes even more important, knowing that they can’t – and in some cases, Screenshot verification can be leveraged to prevent production outages by getting a preview of things to come upon the next reboot! The Flash VM Boot technology makes Screenshot verification a no-brainer in my opinion. If you are using Nakivo, you should be enabling this on all of your mission critical VMs. To learn more about Screenshot verification and other Nakivo features check out their help center here. Fancy trying it for yourself? You can get a full featured trial here, or if you are a VMUG member, VCP, or vExpert why not grab a free NFR license to tinker with! If that isn’t enough options for you Nakivo also offers a fully featured free edition – yes, all of the same features of their premium paid versions, just limited to a couple VMs. Thanks for reading!