The Joys of Real Hardware

| 4 Comments | No TrackBacks

Colleague and friend Sebatian Hassinger sent me Jeff Dean's presentation Designs, Lessons and Advice from Building Large Distributed Systems. The presentation is fascinating in quite a few ways, not least of which is the (implied) statements it makes about requirements for business service management at large scale. For example, here is an excerpt from the slide entitled The Joys of Real Hardware:

Typical first year for a new cluster:

 

~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

~1 network rewiring (rolling ~5% of machines down over 2-day span)

~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

~5 racks go wonky (40-80 machines see 50% packetloss)

~8 network maintenances (4 might cause ~30-minute random connectivity losses)

~12 router reloads (takes out DNS and external vips for a couple minutes)

~3 router failures (have to immediately pull traffic for an hour)

~dozens of minor 30-second blips for dns

~1000 individual machine failures

~thousands of hard drive failures, slow disks, bad memory, misconfigured machines, flaky machines, etc.

The bullets listed above resonate with my Agile Business Service Management thinking. They can simply be thought of as the reality underlying BSM at scale. The scale and scope of operating on top of such envirnments necessitate new techniques in BSM. For example, Jeff discusses Protocol Buffers as one such technique used by Google to attain the requisite efficiencies. Likewise, treating infrastructure as code is - as we say in chess - a practically forced variant. In both cases, the traditional wall between development and operations is moot.

No TrackBacks

TrackBack URL: http://www.bsmreview.com/cgi-bin/mt/mt-t.cgi/46

4 Comments

And people wonder why I still like proper reliable machines and operating systems - in other words mainframes. Sorry, I am never going to be convinced that WINDOZE is a production operating system.

Wow - what an incredible list of hardware support or management "issues" that any provider of infrastructure services (...cloud anyone?) must expect to address during the first year! (I wonder what the list looks like for cluster environments that have become "legacy?") This reinforces to me that service management is definitely a factor that could impact the hype of cloud computing into the "trough of disillusionment" if not adequately addressed.

From the business perspective, cloud computing speeds the provisioning of infrastructure services into commodity status (...with fantastic results), thereby moving the management issues from the IT datacenter to the cloud service provider. Such service management issues don't disappear, they are just outsourced by the business unit, usually bypassing IT. The hope and expectation is that the cloud service provider will get the management of their infrastructure components right.

Leave a comment

   

Type the characters you see in the picture above.

About this Entry

This page contains a single entry by Israel Gat published on November 18, 2009 6:15 AM.

The Voice of the CIO: IBM study reveals CIO roles - from value-creator to cost-cutter was the previous entry in this blog.

The CMDB Distributed Management Taskforce (DMTF) - a standard for connecting CMDBs and MDRs is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages