Tuesday, September 30, 2014

Links of the day 30 - 09 - 2014

Today's links 30/09/2014: bitcoin, strip, DB performance and distributed systems

Monday, September 29, 2014

Links of the day 29 - 09 - 2014

Today's links 29/09/2014 : #mesos , cloud, performance , optimization , virtual cpu, root cause analysis

Friday, September 26, 2014

Links of the day 26 - 09 - 2014

Today's links 26/09/2014 : video stream recognition HW,startup lecture

Thursday, September 25, 2014

There is no unicorn in your BigData

Recently, companies have started to heavily invest in data science solutions. These analytic solutions come under many names such as BigData - Machine Learning - Deep Learning - Business Intelligence etc. and there is a lot of misconceptions about these solutions out there presently. This next series of blog postings will try to explore the various issues as well as pitfalls surrounding these technologies and provide cautionary advice on how best to avoid them. In this first posting, I will be looking at why it is a pointless exercise to hope to find the “next big idea” in your data and instead, one should be leveraging the information extracted towards operational excellence as well as market dominance. 

Why is there no Unicorn in your BigData ?

You won't be able to find the next big thing when you mine your own data because what you are basically doing is building a highly efficient expanding system to extract all value from data of the infinite continuum within a finite domain [1].

Ok, so what does this really mean? Let’s break down the key elements of this statement: (i) Finite Domain, (ii) Infinite Continuum and (iii) Expanding System.
  • (i) "Finite Domain": as a company, you are exploring the data generated from within a finite market bounded by physical and economical limitations. Simply put, there is a maximum value that can be extracted from the industry or marketplace ecosystem in which you are evolving in and you are extracting the information from this same domain.
  • (ii) "Infinite Continuum": within this finite ecosystem exists an infinite continuum and basically this translates into an infinite number of variation of products in order to cover completely the potential consumer domains. This is a corollary of Cantor's diagonal argument
  • (iii) "Expanding System": in order to deliver the optimal product implementation to capture completely the market share, a company needs to adjust an infinity of small details, while perfection is not achievable, you can try to move towards it as you cannot capture the infinite complexity with finite software code as it would require you to run it indefinitely, as demonstrated by Turing's halting problem
Startups in their initial phases tend to excel in maximizing the process of product discovery and refinement by iterating extremely fast. They are nimble and less rigid than their structured older siblings. These established companies can emulate such approach by leveraging "BigData" and other machine learning techniques to move towards this ideal, albeit at the different pace, but with less risk and effectively compensating the lack of agility for larger data-sets to tap into. 

However, leveraging this data won’t magically enable them to break away from the pre-existing market limitations as they are trapped by the very premise they started from. Either they already have the "unicorn" idea and leveraging analytic, operational excellence, and sheer luck will enable them to iterate as fast as possible towards explosive growth. Or, they are already in an established ecosystem and in order to develop novel business models or innovative products they will need to venture far into the unknown. Data Analytic will only help once they do this jump as it won’t help them to recognize and even less validate true but unprovable novel ideas when they encounter them unfortunately

Enterprises looking to leverage data science solution should be careful to understand the true benefit they can extract from such solutions and avoid a false hope that this will miraculously bring them to the unicorn pasture field.

In the following post, we will be looking why such technologies must still be embraced sooner than later.

[1] Inspired by the excellent post: Gödel Incompleteness For Startups , Max Skibinsky - January 2013

Links of the day 25 - 09 - 2014

Today's links 25/09/2014: #NoSQL , #BigData , #performance , #startup , system analysis
  • Memory System Characterization of Big Data Workloads : characterization of the memory access patterns of various Hadoop and noSQL #bigdata workloads - 2013 IEEE big data conference.
  • NoSQL benchmarks : Virtualization can add  20% to 200% overhead for #nosql workload. Good amount of room for hyper-visor improvement there.
  • StartUp lecture : annotated note, video and audio of "How to Start a Startup Lecture 1 " by Sam Altman - the President of Y Combinator.
  • SysDig : a very nice tool for doing system-level exploration: capture state / activity from a running Linux instance then save, filter, analyze. While it is only for post-mortem analysis this would make a great addition to the continuous integration toolkit.

Wednesday, September 24, 2014

Links of the day 24 - 09 - 2014

Today's links 24/09/2014 : #Storage, #Cloud , #Datacenter, #Watson , #IBM

Tuesday, September 23, 2014

Links of the day 23 - 09 - 2014

Today's links 23/09/2014: #AI , #Compression , #realtime data

  • blosc : an extremely fast, multi-threaded, meta-compressor library optimized to leverage CPU cache line layout to maximize throughput. Designed to transmit data to the processor cache faster than a memcpy() OS call and Leverages SIMD (SSE2) and multi-threading capabilities present in modern multi-core processors. There is APIs for C and Python. Moreover it can use different, very fast compressors.
  • The Log:What every software engineer should know about real-time data's unifying abstraction by linkedin engineering team.
  • Artificial Intelligence: Foundations of Computational Agents (2010) : book by David Poole and Alan Mackworth

Monday, September 22, 2014

Links of the day 22 - 09 -2014

Today's links 22/09/2014:  resource dis-aggregation, #cloud , #datacenter , #orchestration , #Memory
  • Bash Booster : a single file library, which provides various features useful during setup environment and preparing servers. It has been written using Bash only and requires nothing else.
  • The Rack Endgame: Converged Infrastructure and Disaggregation : Similar post to mine about how server resources dis-aggregation is the natural evolution of the data center technology ( especially at web-scale).
  • Algorithmic Memory :  add physical (HW) memory macro and algorithm in order to speed up RAM performance ( +~15% silicon space deliver 2x to 4x performance improvement)[slides]
Intel's p-channel silicon-gate technology with buried-contact design techniques implemented the 1103 1024-bit memory

Friday, September 19, 2014

Links of the day 19 - 09 - 2014

Today's links 19/09/2014: #deeplearning ,#SSL , #ZFS

  • DeepLearning.University : An annotated bibliography of recent publications (2014-) related to Deep Learning.
  • Keyless SSL: Cloud flare explain the technology behind the remote SSL keys concept. Allowing to store the keys outside from the actual server and enable remote termination of the connection if the server storing the key goes offline.
  • File systems, Data Loss and ZFS : deep dives in how data loss can occur, and how ZFS prevents it.

Thursday, September 18, 2014

Links of the day 18 - 09 - 2014

Today's links 18/09/2014: #KVM HA , #SDN , #HPC SSH

Wednesday, September 17, 2014

Links of the day 17 - 09 - 2014

Today's links 17/09/2014: Trees Algorithm and Credit Cards Skimmers
  • Blog post series on trees and associated algorithms: 
  • All about skimmers :a nice gallery of credit card skimmers and their different variant over the past 5 years. 

Tuesday, September 16, 2014

From Converged Infrastructure To Disaggregated Datacenter

Recently the (hyper)converged infrastructure model has been picking up steam. Converged infrastructure refers to the tying together of server, storage, networking, virtualization and sometimes other resources into an integrated solution that is managed as a whole rather than through separate management systems.

Smaller to medium sized companies, or those seeking a simpler IT environment, are interested for scalability and easier management. Leading systems vendors are making their move and start to offer their own solution or acquire a company that delivers it which keeps adding to the momentum. Vmware also released its own solution with the EVO:Rail at the latest VMWorld14.

Moreover the next evolution of the data center is already appearing and will directly threaten to a certain extent the converged infrastructure solution : dis-aggregated infrastructure. This solution tries to go a step futher than the simple aggregation. It tries to provide a finer granularity of resource management by pooling the various element in pools ( memory - compute - io - storage) that can be freely composed without being affected by the limitation of traditional server architecture. In a certain way you will be able to compose the right server for the right services - VM - Docker- whatever the buzz of the moment..
Solutions are already profiling at the horizon. On Intel side we have the Rack Scale Architecture solution(RSA). RSA a “rack fabric,” using optical interconnects that allows for a much greater level of dis-aggregation and much greater modularity. The ultimate goal of RSA is completely modularized servers with pooled CPUs, pooled RAM, pooled I/O, and pooled storage. We also have FusionCloud-Sphere-Cube from Huawei. This solution is slightly less advance on the Hardware front however it is a generation or two ahead of the Intel one as it offer a fully integrated software and hardware product solution while Intel is still at the proof of concept stage. But let see what the future reserve, Intel is in for the long haul.

These solutions tend to focus on the hyperscale part of the datacenter spectrum as it requires a significant upfront investment to profit from what the technology has to offer. One of the consequence is that we will see the Converged and Hyper converged solution taking the low to mid sections of the market while the dis-aggregated datacenter approach will initially tap into the mid-high layer of the market. This will further fragment the market and intensify the competition as well as consolidation within the vendor ecosystem. 

Software will be key to the success of such architecture,  currently most cloud system architecture are not well sized or designed to handle fine grained resources management. We are orchestrating operation at a minute timescale and server granularity today. However, we will be managing individual Core, GB of RAM , IO device, Networks at second or even smaller timescale tomorrow and current architecture are not tailored to handle such rapid update rate. 
Last but not least, it will become increasingly more difficult for single solution vendor ( storage - network , etc..) to stay relevant with the rise of such model and they will be forced to go the OEM route or be absorbed if they want to survive. 
As the resources get commodities the hardware becomes a utility and the margin diminish making it difficult for specialized vendor to survive ( look at Cisco and their UCS push). 

Links of the day 16 - 09 - 2014

Today's links  16/09/2014 : memcached, facebook, consensus

Monday, September 15, 2014

Links of the day 15 - 09 -2014

Today's links 15/09/2014: Programming Ebooks, Intel Haswell CPU, R

Friday, September 12, 2014

Links of the day : 12 - 09 - 2014

Today's links 12/09/2014 : Lambda Calculus ,  Monitoring, #DeepLearning,  #BigData

Thursday, September 11, 2014

Links of the day 11- 09 - 2014

Today's links : Rack Scale Architecture, #IDF14 , Concurrency
  • Rack Scale Architecture: Platform and Management : Intel at #IDF14 is slowly delivering its RSA platform. Nice to see that soon we won't compose server but directly the compute/ memory and IO resource out of pool. Which will result in increased efficiency from greater flexibility of the overall setup. 
  • Understanding Hybrid Concurrency Models : classification the different concurrency paradigm - threads and events - and combination of thereof.

Wednesday, September 10, 2014

Links of the day 10 - 09 - 2014

Today's links: #docker , #unikernel , #jit , #virtualization , #IDF14 , Business Model

Tuesday, September 09, 2014

Links of the day : 09 - 09 - 2014

Today's links : #Docker, #bittrorrent, #xeon, #intel, #CI

Monday, September 08, 2014

Links of the day 08 - 09 - 2014

Today's link : text indexing, CI, code analysis and electronics.

  • bleve : modern text indexing for Go with loosely equivalent structure to lucene
  • Continuous Integration illustrated  : if you want to explain to your kids what you do all day :) 
  • srclib : is a polyglot code analysis library, built with hackability in mind.
  • All about circuits : everything you ever wanted to learn about electricity and electronics 

Friday, September 05, 2014

Links of the day 05 - 09 - 2014

Today's links : Spam, Consensus, Distributed system, Raft

  • Raft explained: Visual explanation of the Raft consensus algorithm
  • Modern anti-spam and E2E crypto interesting and really enlightening email from an ex-Googler describing the past - current and future of  spam war and the potential effect of cryptography  

Thursday, September 04, 2014

Links of the day 04 - 09 -2014

Today's links: Machine learning, Visualization, Checkpoint, Algorithms.

  • Machine learning for Beginners : Loads of Machine learning resources, for beginner and non beginner alike. 
  • Visualizing Garbage Collection Algorithms : Nice visual of different garbage collection algorithms. 
  • DMTCP: Really cool : System-Level Checkpoint-Restart in User Space with no kernel modification ! Maybe if you combine that with #Docker you could enable the migration of service state across docker image enabling live migration without the whole kernel modification fuss.

Wednesday, September 03, 2014

Links of the day : 03 - 09 - 2014

Today's Links: Anti-agile, Fault tolerance , Memory Management

Tuesday, September 02, 2014

Links of the day 02 - 09 - 2014

Today's links : SSH, Linux Kernel I/O stack, Linux  Kernel Locking,Systemd, Data Visualization

Monday, September 01, 2014

Links of the Day : 01 - 09 - 2014

Today's link : Machine Learning, BigData, Performance, Docker and Neural Networks.