Wednesday, April 13, 2016

[Links of the day] 13/04/2016: distributed system debugging tool, SRE book note, resilient ad serving at scale

  • Shiviz : tools for debugging distributed system at scale via helping visualization of logs generated throughout the cluster [repo]
  • Notes on Google's Site Reliability Engineering book : this provide a good overview of the content of the book and provide enough details that can be used to research each aspect / chapter independently.
  • Resilient ad serving at Twitter-scale : its interesting to see that their is a correlation between query latency and revenue  for ad serving. This stem from the fact that latency for answering an ad query is dependent of the number of participant in the auction,and obviously the more participant the higher the revenue. However with the increased latency the higher the risk is to time out and hence revenue loss. Twitter use an adaptative system in order to maximise revenue while maintaining resiliency (availability), scalability, resource-utilization.