Thursday, November 08, 2018

[Links of the Day] 08/11/2018 : large scale study of datacenter network reliability, What to measure in production, Failure Mode effect Analysis

  • A Large Scale Study of Data Center Network Reliability : the authors study reliability within and between Facebook datacenters. One of the key findings is the growth in complexity, heterogeneity and interconnectedness of datacenter increase the rate of occurrence of unwanted behaviours. Moreover, this seems to be also a key potential limiting factor for world scale spanning infrastructure undergoing rapid organic growth.
  • Understanding Production: What can you measure? : what do you need to monitor and measure in production. Very good summary of many blog post out there.
  • Failure Mode Effects Analysis (FMEA) : once you start reaching a certain production scale and more stringent requirement kicks in ( unless you were unlucky enough to have them at the get-go). You might want to run a failure modes and effects analysis (FMEA) is a step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. While it was mainly designed to address shortcomings in the manufacturing industry, it is still extremely useful for IT system analysis, especially when you want to prepare yourself pre-rollout of a chaos monkey like system.