Colleague and friend Sebatian Hassinger sent me Jeff Dean's presentation Designs, Lessons and Advice from Building Large Distributed Systems. The presentation is fascinating in quite a few ways, not least of which is the (implied) statements it makes about requirements for business service management at large scale. For example, here is an excerpt from the slide entitled The Joys of Real Hardware:
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures, slow disks, bad memory, misconfigured machines, flaky machines, etc.
The bullets listed above resonate with my Agile Business Service Management thinking. They can simply be thought of as the reality underlying BSM at scale. The scale and scope of operating on top of such envirnments necessitate new techniques in BSM. For example, Jeff discusses Protocol Buffers as one such technique used by Google to attain the requisite efficiencies. Likewise, treating infrastructure as code is - as we say in chess - a practically forced variant. In both cases, the traditional wall between development and operations is moot.