ORIGINALLY POSTED 31st October 2008
11,476 views on developerworks
I’ve posted a long reply to clarify a few things over on the Anarchists post, which is a master piece of missing the point, mis-directions and mis-interpretations, purely aimed at trying to inject further FUD into accepted benchmarks performed by the Storage Performance Council.
Lets start with Quicksilver, and our recent 1 Million IOPs statements. Quicksilver was a technology statement – not a science experiment as BarryB would lead you to think. The entire point of this was to show how a scale-out clustered architecture – the foundation of the SVC methodology – is much more suited to achieve the ultimate benefits that SSD storage technologies can provide. IBM has multiple storage architectures, to suit different users needs. DS3,4,5,6 and 8000 all provide a scale-up architecture where a single pair of controller units attach to one or more expansion units and provide a single instance from which to manage and provision. Once you reach the performance and scaling limits of such a scale-up architecture, you need to buy another control unit pair, and now have two boxes to manage, decide where to provision from etc. In a large enterprise this means many tens of control units to understand, maintain and manage.
IBM also offers scale-out architectures that can help to alleviate these problems. SVC and XIV. Here you are not limited to a pair of control units, you can scale out as your needs must, and still maintain a single instance management and maintenance interface. In such an architecture, the performance and scalability of the system grows in a linear fashion as you add more “nodes”.
Quicksilver was a “MADE IN IBM LABS” joint effort with Research and Development, building on the commodity storage model introduced by SVC, and now XIV. This was all made quite clear, as was how and what we benchmarked. I think Mr Burke needs to re-read my very detailed report back in August. There was not enough space in the press release, nor is it the correct place to discuss technical details, hence why my blog was used to fill in the details behind what and how we benchmarked the project, and how it was built.
It was not necessary to go into the kind of details that BarryB has been misinformed about – as this is not a shipping product, does not necessarily reflect any end product we offer to the field, it was as I say a demonstration that scale-out fits better than scale-up when moving to very large IOPs single cluster systems in the future. Maybe thats why he’s attacking so much, does EMC have such an architecture or solution? No.
At StorageExpo I spoke with a few of our mutual blog followers and several of them thought that our banter was great, and that more often than not actually showed EMC in a bad light. “A company that has to resort to attacking the competition all the time is running scared” – this was a quote from one reader – not from me.
Anyway, onto BarryB’s 5 misdirection points. None of which really matter since this isn’t something we are selling to customers today.
Yes we used more than 8 nodes. This was something we did not want to dwell on, as its not something that has been tested to the levels that we would be happy with to enable field support. As our node hardware doubles in performance every 18 months or so, the size of the cluster is less relevant – remember we aren’t stuck to 3 year ASIC and custom hardware development. We did build a cluster of 14 nodes, 2 of which were being used by our Database team for some database benchmarking, 10 of which were needed to reach the 1M IOPs, and as we wanted to do some advanaced function tested we actually had 12 running at the time of the 1.1M number. So 10 nodes active, is more like 110K IOPs per node – not bad for a 1U control unit. Probably better than an entire DMX system infact (speculation on my part – but without any official numbers from EMC, this is based on testing and customer provided information).
This was a single cluster however, so managed through one interface and system image. I think Mr Burke (or his informant) has confused our “I/O Group” – node pair – with cluster. A node pair is a caching domain within the cluster, but as we were not using caching for this benchmark this again is not relevant.
The number of backend storage units is not relevant either. Again this is not GA hardware, this was using 1U and 2U System X hardware, which have limited numbers of PCIe slots, if we’d used a 4U server this number would be a lot less, however our code is already configured to understand the 1U and 2U platforms, so from a minimal code development point of view it was easier to use what already worked. I don’t see this as relevant to the discussion.
Performance of SSDs varies based on the workload. Unlike a HDD, which can roughly do the same number of read and write IOPs and is less dependent on block size, this is not true for SSDs. Most vendors quote high figures which are read only. Writes are the achillies heel of flash, and despite clever re-allocation, LSA, write avoidance, etc there is a fundamental impact to writes when you have to erase blocks. In a sustained write workload of any mixture, there is an impact to the high level sustainable write workload. Some vendors products actually again suffer even more when you mix reads an writes. EMC for example quote 30x the performance of enterprise HDD’s for their EFDs. That would be 30x 300 IOPs then, ~9K IOPs. So here the FusionIO ioDrive is doing more than 2.5x that. Nothing too shabby them. Looks at Intels recent SSD, 35K read IOPs, 3K write IOPs…
Workload. I thought I’d covered this pretty darn clearly. So this is just FUD. To re-iterate. This was a generated 70/30 mixed workload. So 70% read, 30% write, with a block size of 4K. This is industry accepted as a pseudo-typical database workload – and has been for many years. This is often quoted as 70/30/50, the 50 meaning 50% cache hit. In our case, this was all miss, no cache hits, totally random workload. So 70/30/0.
Just to be clear, since BarryB is asking, the tool used is an internal exerciser that has existed in IBM for longer than I have. Its a simple C based program that allows you to control all the necessary parameters and is multi-threaded per device being exercised, so multiple streams of I/O per device. If I recall this was 16 per device. The host systems were IBM Power5 p590 servers. We have two of them in the lab, each split into 4 idential LPARs, with 8 4Gbit fibre ports per LPAR. For the benchmark we ran 5 LPARs in total, so 40 4Gbit host ports. Again, I’m not sure what relevant this has.
One of the reasons we’d not ship this exact config is that it has no RAID type protection – as benchmarked. I’d not recommend such a RAID-0 setup to anyone, unless they didn’t care about their data. Now that we have Vdisk Mirroring, we could provide RAID-10, however at the time of this benchmark, the VDM function was still in test and so was an unnecessary risk to add to the project. Since this is not shipping, again I’d not see this as a relevant point, remember what we were demonstrating….
Scale-up vs Scale-out. From that point of view, I’d say we did a realistic benchmark – full disclosure of the nature of the 1.1M (70/30/0 4K) and showed that the system was capable of maintaining this degree of IOPs while also keeping response time under 700 microseconds.
How many Scale-up control unit pairs would you need in a typical system to maintain the same IOPs and response time? I don’t know for sure, but it would be a large number based on the vague performance data in BarryB’s own EFD presentation. Say 10 for the sake of argument (being kind) – that would be 10x full frame controller units, vs 10x 1U controller units. Not only is this much more scalable, but look at the power usage, floor space, and cooling…
The final point I’d make, re-iterates fellow IBM blogger Tony Pearson, why re-invent a board of already participating vendors (the SPC) to start again. There is no limit to the number of SPC-X benchmarks that could be generated, so come on, join the rest of the industry, and if you feel a new benchmark is needed I’m sure this can be done. As for dotconnector. Well at the time I did wonder, but now you have confirmed that this was nothing more than an anti-marketing ploy against the SPC. So it went dark about a year ago, thats not any of the participants fault, if the blog owner doesn’t post then we can’t reply or join in. A blog is not the correct forum for such an initiative, maybe a wiki at a push, so join back in with the already active SPC. Starting another initiative would be like Ferrari following up with their threats to leave Formula 1, they would likely start their own “Ferrari Won” championship where they’d be guaranteed to win…