Colleague and friend Sebatian Hassinger sent me Jeff Dean's presentation Designs, Lessons and Advice from Building Large Distributed Systems. The presentation is fascinating in quite a few ways, not least of which is the (implied) statements it makes about requirements for business service management at large scale. For example, here is an excerpt from the slide entitled The Joys of Real Hardware:
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures, slow disks, bad memory, misconfigured machines, flaky machines, etc.
The bullets listed above resonate with my Agile Business Service Management thinking. They can simply be thought of as the reality underlying BSM at scale. The scale and scope of operating on top of such envirnments necessitate new techniques in BSM. For example, Jeff discusses Protocol Buffers as one such technique used by Google to attain the requisite efficiencies. Likewise, treating infrastructure as code is - as we say in chess - a practically forced variant. In both cases, the traditional wall between development and operations is moot.
And people wonder why I still like proper reliable machines and operating systems - in other words mainframes. Sorry, I am never going to be convinced that WINDOZE is a production operating system.
Wow - what an incredible list of hardware support or management "issues" that any provider of infrastructure services (...cloud anyone?) must expect to address during the first year! (I wonder what the list looks like for cluster environments that have become "legacy?") This reinforces to me that service management is definitely a factor that could impact the hype of cloud computing into the "trough of disillusionment" if not adequately addressed.
From the business perspective, cloud computing speeds the provisioning of infrastructure services into commodity status (...with fantastic results), thereby moving the management issues from the IT datacenter to the cloud service provider. Such service management issues don't disappear, they are just outsourced by the business unit, usually bypassing IT. The hope and expectation is that the cloud service provider will get the management of their infrastructure components right.
While I fully understand where Peter is coming from, I would like to highlight the risk often associated with mainframes. Even for code that has been developed since 2003, the level of unit test coverage is usually very low. Moreover, it is extremely rare to find a CIO who has carried out an honest to goodness risk assessment on legacy code.
This state of affairs creates a fascinating phenomenon: the running of the code is frequently outsourced to a vendor who is exceptionally good operationally. The vendor's operational excellence compensates for the deficits and defects in the code.
Israel
With respect to Bill's point "The hope and expectation is that the cloud service provider will get the management of their infrastructure components right" - economies of scale are clearly on the side of the service provider. See the post Internet-Scale BSM http://www.bsmreview.com/blog/2009/11/internet-scale-bsm.htm or contact BSM Review's Annie Shum http://www.bsmreview.com/experts.shtml#shum who is an expert on the subject.
Israel