It’s no surprise to anyone that storage is growing at an incredible rate – rich media, sensor devices, IoT – these are all affecting the amount of storage capacity that organizations need today and it’s only going to get worse in the future! Organizations need somewhere to put this data, somewhere safe and protected, somewhere where availability is key. For most that somewhere ends up being the cloud! Public cloud services such as Amazon S3 give us access to oodles of storage on a pay as you go basis – and they remove the burden of having to manage this. SLA’s are agreed upon and our data is just available when we need it! That said, public cloud simply may not be an option for a lot of companies – the businesses that simply can’t, or sometimes won’t move to cloud, yet still want the agility and availability that cloud provides. These organizations tend to move to an on-premises solutions – SANs and storage crammed into their own data centers – but with that comes a whole new bucket of challenges around scaling and availability…
How do we scale a SAN?
Most all storage out there today is designed in much the same way. We have a controller of sorts, providing network and compute resources to move our data in and out of a number of drives sitting behind it. But what if that controller goes down? Well, there goes all of our infrastructure! To alleviate this we add more controllers and more disk – This seems like a pretty common storage solution today – 2 controllers, each hosting a number of shelves full of drives, with dual path interconnects connected to the rest of our data center. In this situation if we lose a controller we don’t necessarily lose access to our data, but we most certainly lose half of the bandwidth into it. So, we yet again add more controllers and more disk – sitting with 4 controllers now – at which point the back of our racks and our interconnect infrastructure is getting so complex and complicated that we will most certainly hit struggles when the time comes to scale out even more.
So what is the perfect ratio of controller to disk, or cpu to disk? How do we minimize complexity while maximizing performance? And how do we accomplish all of this within our own data center? Lower ratios such 1 CPU for every 8 disks introduces complexity with connectivity – Higher ratio’s such as 1 CPU for 60 disks provides a huge fault domain. Is it somewhere in the middle? Igneous Systems has a answer that may surprise you!
[symple_box color=”yellow” fade_in=”false” float=”center” text_align=”left” width=””]Disclaimer: As a Tech Field Day 12 delegate all of my flight, travel, accommodations, eats, and drinks are paid for. However I did not receive any compensation nor am I required to write anything in regards to the event or the presenting companies. All that said, this is done at my own discretion.[/symple_box]
RatioPerfect – 1:1 – Compute : Disk
Igneous presented at Tech Field Day 12 in November showcasing their managed on-premise cloudy solution– It looks much like a traditional JBOD – a 4u box containing 60 drives – but underneath the hood things are certainly different. Igneous, calling it their RatioPerfect architecture takes a 1:1 solution in terms of CPU to Disk. Throwing out expensive Xeon CPU’s and the controller methodology, RatioPerfect is essentially an army of nano servers, each equipped with its’ own ARM CPU, memory, and networking attached directly to each and every disk – essentially giving each disk its’ own controller!
These “server drives” are then crammed inside a JBOD – however instead of having dual SAS controllers within the JBOD they are replaced by dual Ethernet switches. Each nano server then has two addressable MACs and two paths out to your infrastructure 10Gbe uplinks – you can almost picture this as a rack of infrastructure condensed down into a 4U unit, with 60 network addressable server/storage devices sitting inside of it, with 60 individual fault domains. Don’t worry – it’s IPv6 – no need to free up 120 addresses
Why the need?
To your everyday storage administrator working in a data center you might not see the need for this – 60 fault domains – seems a little excessive right? The thing is, Igneous is not something that managed by your everyday storage administrator – in fact, the “human” element is something Igneous would love to eliminate totally. Igneous set out to provide the benefits of public cloud, on premises, complete with flexible pricing and S3 compatible APIs. The sheer nature of public cloud is that we don’t have to manage it – it’s simply a service right? The same goes for Igneous – all management including installation, configuration, troubleshooting, upgrades is handled centrally by Igneous – you simply consume the storage – when you need more, you call, and another shelf shows up!
The design of Igneous’s management plane is key to their success. With the “fleet” model in mind, Igneous built a management plane that proactively monitors all their systems deployed – being able to contrast and compare events and metrics to detect possible failure scenarios and rely heavily on automation to fix these issues before they are indeed, issues. That said, no matter the amount of predictive analysis and automation the time will come when drives physically fail – and the nano server design of Igneous, coupled with the custom built data path deployed allows a single Igneous box to sustain up to 8 concurrent drive failures with out affecting performance – certainly buying them enough time to react to the situation. The on-premises management plan is simply a group of micro-services running on commodity x86 servers – meaning software refreshes and upgrades are a breeze and non-disruptive at that. It’s this design and architecture that allows Igneous to move fast and implement rapid code changes just as we would see within a cloud environment.
In the end Igneous certainly does contain an army of ARM processors working to bring the benefits and agility of public cloud to those who simply can’t move their data to cloud due to volume, or won’t due to security reasons. Yeah, it’s a hardware appliance but you don’t manage it – in fact, you don’t even buy it – just as we “rent” cloud the Igneous service is a true operation expense – no capital costs whatsoever. It’s funny – they sell a service, essentially software and storage that you consume, but it’s the hardware that left the lasting impression on me – not to often hardware steals the show at a Tech Field Day event. If you are interested in learning more certainly take a look at their Tech Field Day videos – they cover all of this and A LOT more! Thanks for reading!
Many vendors have been working on a similar hardware model with Ceph for a while now. See http://ceph.com/community/500-osd-ceph-cluster/