Reflections Of The Void: 2016

Friday, December 23, 2016

[Links of the Day] 23/12/2016 : Microsoft Configurable cloud (with fpga), Open Pilot OSS driving software, Deep learning is all about rigor

Microsoft's Production Configurable Cloud : built in custom nic + fpga for highly configurable and dynamic network stack in Microsoft DC. The work is really impressive. It demonstrate how pervasive FPGA and customization hardware will be in future datacenter.
Open Pilot : open source driving agent providing Adaptive Cruise Control (ACC) and Lane Keeping Assist System (LKAS) for Hondas and Acuras. This is a really interesting solution and I wonder how fast other company will start to leverage or opensource their own solution in order to accelerate adoption. However without extremely strong verification and proof system ( formal method ) it will be extremely hard ( and illegal probably ) to deploy such software at this stage.
Nuts and Bolts of Building Deep Learning : Andrew Ng reiterated at NIPS2016 that there is no secret AI equation that will let you escape your machine learning woes. All you need is some rigor. [video]

Wednesday, December 21, 2016

[Links of the day] 21/12/2016 : Bigdata Systems comparison, State of EU tech & NuttX RTOS

Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads : If you ever wonder which big data system (SciDB, Myria, Spark, Dask, and TensorFlow) you need to use, this comparison can give you a starting point. Even if the use case is rather narrow it provide a good comparison point.
The State of European Tech 2016 : fantastic read of a data driven look at the inside of European tech in 2016.
NuttX : real-time operating system focus on standard ( Posix and ANSI ) compliance and small footprint.

Monday, December 19, 2016

[Links of the Day] 19/12/2016 : Cloud storage consistency models, heterogeneous memory management and atomic consistency for storage class memory

Consistency Models for Cloud Storage Services : A must read for anybody relaying on any for of cloud storage. It is imperative to understand the consistency model of these service in order to avoid bad surprises. Sadly, a lot of cloud storage out there lack of official documentation on the subject or are really fuzzy and lack proof.
Soft2LM : heterogeneous memory management , basically optimise memory allocation and migration between tier in order to minimise power consumption while maximizing performance.
Free atomic consistency in storage class memory with software based write-aside persistence : interesting article on a software stack that aim to deliver atomic consistency for SCM in write aside scenario. I am not sure how often write aside pattern though.

Tuesday, December 06, 2016

Another tale of Execs bottom-up Blindness : SAP, Oracle, [Insert Software Giant here] vs AWS

After watching this year's AWS re:Invent show, I can help but have this strange feeling of “déjà vu”. AWS achieve to deliver exciting new product and solution that take the industry by storm. GreenGrass literally take my last year prediction and makes it reality. What’s even scarier is that, with GreenGrass, AWS achieve the feat of unifying #IoT and #DevOps under a common platform.

But my feeling of déjà vu didn’t come from the GreenGrass announcement. It came from the Step Functions announcement. And it felt like a textbook repeat of what happened to Detroit's big three when Toyota took over the US car market: Another case of bottom up blindness.

Step is the natural next step in the evolution of AWS product portfolio. It nicely complement the serverless lambda product and allow to organise your server-less logic flow in a transparent manner.

But what is more important is the implication of such release. Step allow you to create and coordinate complex workflow. Which is just a step off having a full blown ERP. It is actually even better than an ERP as Step allow you to coordinate any kind of distributed application. Allowing you to define business process workflows that blend seamlessly business logic and application logic.

24 hours before the release of Step, I was mentioning on Twitter that AWS just needed a good process workflow service to open up the ERP business.

@avideitcher Yep, but they are working their way up , DB / BI -> need job / task / process workflow , then you add CRM / ERP
— Benoit Hudzia (@blopeur) December 1, 2016

One day later, Step Function emerged and it won’t be long before we see emergent ERP like functionality hosted on Step / Lambda. This release sound like a swan song for the like of SAP or Oracle. However they had plenty of warnings. But like many in other industry they buried their head in the sand.

With Step, AWS has now completed offering all the building block to develop a full blown ERP without the hassle of taking care of all the nitty gritty details ( scaling, resilience, deployment etc..). To name some of the main one:

Database : Redshift / Aurora / DynamoDB
Data Ingestion : Firehose
BI : Quicksight
Business process : Step + lambda
Mobile : AWS Mobile

SAP, especially, should have taken a hint when they announced the availability of their in memory database on AWS in 2012 and AWS announce Redshift DB a couple of day later.

SAP , ORACLE , and many other are repeating the same error that other industry giant fell for:

Failure to master small product :

They are addicted to the revenue they extract from the fat margin of the top 20% customers ( think Nestle, CocaCola, Caterpillar, etc..). These customer deploy these massive ERP systems while smaller customer tend to be thrown upon because the margin extracted from them are not high enough. Sadly, like with the automotive industry, history repeats itself. The Detroit's big three lost to Toyota because they didn’t care to lose the small car market. Margin where small enough they deemed that Toyota could have the small car market share as long as they retained the higher end one. However by doing so Toyota gained a foothold and worked its way up the food chain while they lost market share. AWS is doing the same thing. It started with the infrastructure and is now on their doorstep looking at their crown jewel. By losing the smaller product mastery, they lose the knowledge necessary to deliver product for the companies that will become the giant of tomorrow.

Failure to embrace cloud :

SAP sold their hosting operation in 2009 to T-Systems. They literally sold out all their expertise that would help them transition to offer solid cloud solution. They have been left in the dust by the competition and are fooling themselves if they think they can catch up. Elisson on the other side is trying to fool its shareholder by promising to catch up AWS. However at the current rate of their infrastructure investment they will reach current AWS infrastructure size in FIVE YEAR !!

Failure to simplify their stack :

Anybody who used an SAP or Oracle system knows how painful it is to deploy the simplest web service .. let alone an ERP system. Moreover, it is almost impossible to learn about the systems if you are not working for either a company that use them or the companies that produce them

Failure to learn :

These software giant traditionally didn’t run their own systems. And to be honest, hardware and operational cost was often a fraction of the overall license cost. Because of that, it was easy to tell the customer to throw more hardware at the software problem. However, everything change when you start to offer your solution as a service. And SAP experienced this the hard way with their SAP Bydesign solution (discontinued in 2013 and revived a year later). Rumors was that the company was spending 7 euro to run the system for every euro it was getting from its customers.

However, they didn’t learn from their mistakes and change their approach to build, deliver, and run their systems. Look at S/4 HANA, even today you cannot run it on anything else than the humongous X1 instance. And this lack of learning seems to be widespread among the software giant industry. Surprisingly we are almost 5 year after HANA availability on AWS announcement and I have yet to see a customer running a production HANA on AWS.

actually they don't us it for prod @jplsegers if you read the case study its only test/dev ... pic.twitter.com/m3hT7bU7au
— Benoit Hudzia (@blopeur) November 16, 2016

Because of these failures from bottom-up blindness, these companies easily fall prey of the Tower of Hanoï fallacy and as a result they cannot :

transition to a new value chain
acquire new technical skills/knowledge
expand to new market and business model
compete with ecosystem (cloud) natives
manage revenue stream self cannibalization

Traditional software giants have feet of clay and AWS already chipped its way up to the knee without them noticing.

Friday, December 02, 2016

[Links of the Day] 02/12/2016 : AWS best practice, 1k+ RISC-V with Shared memory, Verification of distributed systems

AWS Well-ArchitectedFramework : AWS document outlining high level cloud best practices. Not really in depth technical solution but provide good guideline for organisations.
Towards Thousand-Core RISC-V Shared Memory Systems : MIT is advocating for leveraging its TARDIS cache coherence protocol to scale RISC-V architecture to 1k+ cores. But the interesting thing is that they are advocating for a shared memory system using a 3d mesh. What's interesting is that it seems that RISC-V and TARDIS are oddly compatible architecture wise. Now we need to see if the cache technology can deliver on its promise. 1K core is a hell of a lot of coherence to maintain.
The Verification of a Distributed System : talk by Caitie McCaffrey where she present strategies to prove system correctness. This is rather important as too often companies build distributed system and swear that they satisfy some part of the CAP theorem. But too often they crumble.. Especially if @aphyr decided to take some interest into it (or even better get paid to do so)

Monday, November 28, 2016

[Links of the Day] 28/11/2016 : Earth Computing network fabric, CS video courses, Go app tracing

Earth Computing Network Fabric : event based protocol for datacenter that target specifically datacenter as it eliminate the need for heartbeats and timeouts. The protocol relies on recoverable atomic token to deliver deterministic in order communication. To some extend they are proposing to move back to latices system where each server are a node within the network and act as router for messages. This eliminate switch requirement and looks really neat. However adoption might be difficult due to the ubiquitous Ethernet hardware and also the need to change the underlying communication protocol. Last but not least I do not really know how to efficiently secure and trust messages on such network. [slides]
Computer Science video courses : Extensive collection of links to CS courses ranging from introductory to expert in pretty much the full scope of CS subject (DB, distributed systems, etc..)
Appdash : Application tracing system for Go, based on Google's Dapper.

Friday, November 25, 2016

[Links of the Day] 25/11/2016 : CD/CI maturity model , Deep Learning lip reading, Microservices make

Continuous Delivery Maturity Model : look at the different level of maturity for continuous integration-delivery-build-.... in software development
LipNet : deep learning for full sentence lip reading. One step closer to a fully fledged HAL.
Dmake : tool to manage micro-service based applications. It allows to easily build, run, test and deploy an entire application or one of its micro-services.

Wednesday, November 23, 2016

[Links of the Day] 23/11/2016 : AMD Exascale vision, Hardware Resiliency myths and truths, MIT EmTech

Resiliency for Reliability– Myths and Truths : this slide deck provide an overview of the resiliency issue and how Intel tackle those for hardware fault. From fans down to soft errors ( ex: neutron beam ... yes this can £%£ your system). The authors present the two type of approach , reactive and proactive handling of errors.
AMD's Exascale computing vision : Its all about 3d stacked chip with future interconnect. The interesting bit is the ROCM platform and the P2P multiGPU and P2P with RDMA. Slowly we are removing the need to have a full server to deploy GPU, one step closer to fully modular system with each resourced pooled and optimized in their own enclosure. Its a lot easier to design power supply, cooling system, etc.. When you do not have to deal with heterogeneous hardware with different power, and cooling profile ( cpu, memory , disk etc.. in the same enclosure).
MIT EmTech 16 : This year MIT EmTech is all about AI & machine learning ... reaching maximum hype in the domain

Monday, November 21, 2016

[Links of the Day] 21/11/2016 : Erasure Code for Big data and cluster cache, Emerging Interconnect Tech

Erasure Coding for Big-data Systems : Technical report presenting the state of the art of erasure code and how they are use in practice.
EC-Cache: the authors present an interesting solution. Where they combine erasure code to compensate for limitation of selective replication. The solution provide a load-balanced, low-latency cluster caching and improve resilience against failure from the inherent benefit of the code.
Emerging Interconnect Technologies : really cool overview of the current and future of the interconnect especially on chip or chip to chip interconnect. This looks at the future of communication when chips will be stacked to keep Moore's law going.

Friday, November 18, 2016

[Links of the Day] 18/11/2016 : Extreme Scale OS, GPU Stream Benchmark, Neural Net that produce neural net

Neural Architecture Search with Reinforcement Learning : Neural net that produce neural net. Cool thing is that the authors are able to beat human generated model for text processing and deliver equivalent performance for image processing model. Who needs human anymore....
Extreme-Scale Operating Systems : multi-OS research project at Intel aiming to be the node OS for HPC machine. Intel is trying to deliver a polymorphic OS that can quickly adapt to new software and hardware without the need for specialized solution like it exist commonly on high end HPC systems. To some extend it looks like the Jailhouse system. Where the HW is physically partitioned. A few core are dedicated for management, while the rest are partitioned and are running lightweight kernel (LWK) + application. Note that I really resent Intel for always trying to rename things that are commonly used. LWK are Unikernel dammit.. Anyway its jailhouse + unikernel for HPC.
GPU-STREAM : Stream benchmark for GPU, much needed benchmark to understand and quantify memory transfer rate to from global memory device on GPUs.

Wednesday, November 16, 2016

[Links of th Day] 16/11/2016 : Open Whisper, Security via BPF + XDP, Why cloud fails

X3DH : stand for Extended Triple Diffie-Hellman key agreement protocol. It allow asynchronous secure communication.
Cilium : BPF start to emerge as the dominant tool for any network functionality out there. Google cilium which leverage BPF and XDP for enforcing dynamic security rules via eBPF. [github]
Why Cloud Fails : Excellent review by Murat on the paper analyzing various cloud failures . Turn out that the vast majority are config related and upgrade related. Human failure only represent a very small portion of the overall numbers. [paper]

Monday, November 14, 2016

[Links of the day] 14/11/2016 : Time series features extraction, GCHQ data analysis platform and Alibaba's Kafka contender : RocketMQ

TsFresh : Automatic extraction of relevant features from time series. This is really usefull for anybody dealing with metrics. It allow in one sweep data cleaning and feature extraction.
Stroom : GCHQ data processing storage and analysis platform.
RocketMQ : Alibaba's MQ proposed as an Apache project. It tries to solve some of the limitation of Kafka while providing better performance.

Friday, November 11, 2016

[Links of the Day] 11/11/2016 : Anonymous Trustless Bitcoin, Zap golang log lib, Intel RSA controller

ZeroCash : Trustless Bitcoin Tumbling, the authors proposed a pooled approach to anonymise transaction in Bitcoin. However the authors go a step further than just pooling. They popose a system where participant can anonymously check in and out resources from a global pool. Effectively creating an anonymous cooperative resource sharing infrastructure. [github] [Paper]
Zap : Fast, structured, leveled logging in Go. When you start to reach Uber or other hyperscale microservice architecture. Every aspect counts, and logs are everywhere. This library provide a high performance structure log for go.
Scalable software controller : This controller basically allow to allocate on the fly hardware ressource, compute, memory storage, network based ont the demand of the deployment tool (openstack, k8, mesos, etc..) .

Wednesday, November 09, 2016

[Links of the Day] 09/11/2016 : Deep Neural Net Threats, Scaling Uber, Tcp over Sound

Assessing Threat of Adversarial Examples on Deep Neural Networks : machine learning is the next frontier for hacker. And because of its inherent opacity it requires special capabilities to secure system that relies on this underlying technology. This paper show that for text driven classification, adversarial exemple are more an academic curiosity than a real threat. However, we need to see if this can be applied to other type of classification.
Lesson learns about scaling Uber : Many talk are about scaling, however most company and startup would love to have those problems. Often its not about scaling, its about having the right product market fit. Then you can enjoy the roller coaster of scaling problems.
Quiet : TCP over sound . This is really cool, it allows to pass data through speakers on android devices.

Monday, November 07, 2016

[Links of the Day] 07/11/2016 : Baidu Open Source Repo(s), Wan Replicated DB

Baidu : Baidu open source code on Github. It looks like it replicate a lot of service / feature that other hyperscale system use. Raft seems to be the default underlying consensus protocol for all applications. A lot of nice goodies in there, especially:

BFS : Baidu file system that provide the underlying persistence for Baidu real time application. Its a distributed multi datacenter using raft for metadata coherence and use a shared nothing approach for linear scalability.
Tera : Distributed database
Galaxy : mesos / kubernetes equivalent.
Paddle : Distributed machine learning
iNexus : Distributed K/V store . Looks similar to consul and it also use raft as the underlying consensus protocol

Bedrock : Wan replicated distributed data (base). Designed to use SSD and other nice features.

Friday, November 04, 2016

[Links of the Day] 04/11/2016 : high performance reliable message passing framework, speculative paxos, Awesome Falsehood

Aeron : Aeron is an efficient reliable UDP unicast, UDP multicast, and IPC message transport. The key word here is reliable. Under the hood is brokered architecture built from the top down with performance in mind.They offer Java and C++ clients.
specpaxos : Interesting concept but relies on multicast. SDN might help solve the inherent multicast drawbacks by creating topologies, distribution trees, etc.. ahead of time. But practically, how often do you see multicast deployed and enabled in modern datacenters ?
Awesome Falsehood : curated list of awesome falsehoods programmers believe in.

Wednesday, November 02, 2016

[Links of the Day] 02/11/2016 : Unik fast easy unikernel builder, Noms decentralized DB and dev books

Noms : decentralized database using GIT principle. There is some nice feature in there, such as content addressing (no duplicate), append only and last but not least : decentralized. Which means you can fork / merge, disconnect etc.. for seconds, hours or years. Like GIT, however i am not sure yet how they handle merge and conflict resolution....
Programming books : Good list of books for developers.
Unik : Tool by EMC to compile unikernels directly rather than going the binary route. Its nice to see an increased effort to facilitate unikernel adoption. Previously i talked about the effort. This is a slightly different approach here as it all to build unikernel in almost any langage using a tool chain as intuitive as docker. If this trend continue we might see a decline in container adoption with a move to unikernel. But its not for the short term as the optimisation cycle of the containers technology didn't fully kick in yet. [video] [slides]

Monday, October 31, 2016

[Links of the day] 31/10/2016 : AWS open guide, uncertainty in deep learning, Hacking Google interview

Amazon Web Services Open Guide : This is THE practical guide for anybody that use AWS. Really well constructed and easy to use guide for the majority of AWS services out there.
Uncertainty in Deep Learning : Thesis looking into the probabilistic aspect of Bayesian network and how not everything is black and white in deep learning land. It seems that we need to see an emergence of probabilistic programmation + deep learning hybrid in order to handle the new world of uncertainty that is opening up with the progress of AI research. [PhD Thesis]
Google Interview University : program to help you beat the google technical interview process. While this is excellent as a refresh course in basic computer science I would also recommend MIT course : hacking a google interview. When interview by google i almost got a carbon copy of the question in the MIT course.. Sadly the google interview process is, lets say, "abysmal" for senior people. They want you to go through this marathon of interview without providing you an idea of what you are applying or other informations..

Friday, October 28, 2016

[Links of the day] 28/10/2016 : 3D chip stacking Wireless interconnect, Golem reverse MVC framework, Data Cleaning with R

ThruChip : The trend is to stack chips to save space and increase bandwidth. Most solution use Thru-Silican Vias (TSV) solution. Instead of using TSV, ThruChip company propose to use wireless interconnect to link the different chips and claims that they can achieve terabytes/s of bandwidth. [video] [slides]
Golem : Golem turns the MVC app inside out by making the client the intermediary between the application servers and the database.
Data cleaning with R : data cleaning often take more time than the analysis itself. This paper describe how to do it fast and efficiently with R.

Wednesday, October 26, 2016

[Links of the day] 26/10/2016 : PCOMMIT drop , Awesome Go , MIT #AI classes

Intel Drop PCOMMIT : Intel decided to simplify its persistent memory specific instruction set. The logic behind this change is that Asynchronous DRAM Refresh (ADR) is now a requirement for persistent memory support. As a result there is no need for PCOMMIT anymore because of the guarantee of the Write Pending Queues flush on power loss.
Awesome Go : all in the title
MIT Artificial Intelligence : video of MIT class on Artificial intelligence

Monday, October 24, 2016

[Links of the day] 24/10/2016 : HPC scratchpad memory architecture, Flexible Package manager

Spack : flexible package manager designed to support multiple versions, configurations, platforms, and compilers. Cool feature : its non destructive, which means you can have multiple version co-existing in the same system. [Github]
Runnemede : energy-optimized research architecture using scratchpad ( software managed ) rather than HW cache. They achieve almost a 4x energy improvement vs standard design. [slides]
Efficient HPC Data Motion via Scratchpad Memory : Earlier paper demonstrating the efficiency of scratchpad rather than cache for Data motion in HPC systems.

Friday, October 21, 2016

[Links of the day] 21/10/2016 : Manager & Career Mgmt podcasts, Memory Enhancement, Computer Scientist Handbook

Manager Tools & Career Tools : podcast about management and career .
How to Think Like a Computer Scientist : interactive handbook , its a little bit more about learning python than actually being a computer scientist. But a good intro to programming.
Multiplicity of memory enhancement : human memory is complex, and with the accelerate advance of technology to support (and supplant) our retention capability. New ethical and societal consequence emerges. The authors looks at the different memory systems and how technology evolution affect them.

Wednesday, October 19, 2016

[Links of the day] 19/10/2016 : #AI hard problems, Dark Silicon & Reliability , Transport Layer Dev Kit

Applied AI hard problems : current and future AI hard problem, the interesting bit is the "emergent" behavior aspect that computer scientist are trying to achieve. Where AI is not tailored for a specific problem by adapt to the environment it encounter.
Dark silicon & Hardware Reliability : the authors look at the impact of the dark silicon approach ( when not all component are turned on when the system is up) and how to leverage the "dark" ratio to maximise lifespan of hardware. [slides]
TLDK : project lead by Intel within the fd.io framework. It is trying to adresse the lack of high level ( as in layer 4 ) packet processing capabilities. The project aim at delivering UDP/TCP etc.. packet processing on top of vector packet processing of FD.io (which can works on top of DPDK). By doing so Intel will be able to finally have a comprehensive framework which will enable DPDK based solution to flourish beyond the pure networking stack (NFV) solution.

Monday, October 17, 2016

[Links of the day] 17/10/2016 : MIT Tardis 2.0 Cache, TCP/IP FPGA stack , Knowledge Defined Networking

Tardis 2.0 : MIT people are back with optimized and extended version of their novel cache system.
FPGA TCP/IP stack : TCP/IP stack that can be embedded on FPGA along applications, this allow seamless flow of data without CPU interaction or reliance on other devices. You could do some neat in line processing of data flow using this. It support 10 Gbps and thousands of concurrent connections. [github]
Knowledge-Defined Networking : merging network analytics and software defined network by using machine learning. The objective is to enable automated network control. To some extend we should replace the software is eating the world mantra with Machine learning is eating software one. And closer than you think at least for SDN as there is an effort in Open daylight by Cisco and al. to push machine learning in the SDN framework.

Friday, October 14, 2016

[Links of the day] 14/10/2016 : Docker Infrakit , erasure Code for big-data and ARM research summit

InfraKit : docker answer to public cloud lock in. It allow devs to easily deploy their systems on various cloud infrastructure without code change.
Erasure Coding for Big-data Systems : Phd Thesis of Rashmi Korlakai Vinayak on erasure code for very large data systems. The author analyse the requirement and provide potential solution allowing resource efficient distributaire storage codes . The authors looks at the various trade-off that can be used to guarantee durability while limiting ressource usage.
ARM Research Summit 2016 : live blog of the keynotes, a lot of the research issue are similar to x86 one. Which can be worry some as ARM needs to be able to differentiate itself from Intel especially in the server market.

Wednesday, October 12, 2016

[Links of the day] 10/12/2016 : Full stack fest 16, Product Dev handbook and Stealing Machine Learning Models

Full Stack Fest 2016 : Playlist of the video of Full Stack fest 2016 conference
Product Aikido : handbook for an organisation’s Product Development Group.
Stealing Machine Learning Models : the authors of the papers propose a technique that analyse response of system using Machine learning via their API in order to extract the model used. And as a result allow the attacker to determine the best response for manipulating the system.

Tuesday, October 11, 2016

Notes on SNIA Storage Developer Conference 2016

This year SNIA Storage Developer Conference, chosen bits :

MarFS : scalable near-POSIX file system using object storage. What is really impressive is that MarFS is part of a 5 tiers storage system of the trinity project. Yes FIVE tiers, RAM -> BurstBuffer -> Lustre -> MarFS-> Tape. MarFs seats above Tape for long term archival and aim at providing storage persistence that span year(s) of usage. In comparison Lustre just above aim at keeping the data for weeks only. What bother me is the logic behind this approach as most Supercomputer system have a 5-6 year lifespan. This implies that the project usage will span multiple generation of systems. [Github]
Hyperconverged Cache : It seems that Intel start to realize what we discovered years ago in the Hecatonchire project. Once you start to have near Ram performance, dis-aggregating and pooling your ressource becomes the natural next step for efficiency. And this is what they aim to achieve with a distributed storage cache system that would aggregate their 3dxpoint system across a cluster in order to deliver fast and coherent cache layer. However without RDMA this approach seems a little bit pointless. The only things that seems to save them is that the cloud storage backend ( Ceph ) has a big enough latency gap they can exploit.
Erasure Code : Very good overview of modern erasure code and their trade-offs. As always no code are equal but not all use case are the same.

Persistent Memory : As storage shift away from HDD to Pmem , the number of talk around persistent memory exploded this year. The main focus seems to shift from pure NVM consumption to remote access model.

NVMe over fabric : two talk on the recent progress of NVMe over fabric. Nothing really new there, just that it seems that it will be the standard in remote storage access in the near future. [Mellanox] [Linux NVMf]
RDMA : It seems that Intel and other are aiming for direct Persistent memory access using RDMA, bypassing the NVMe stack. The idea is to eliminate the latency from the NVMe stack. However this require some change in the RDMA stack in order to guarantee persistence of data.

IOPMEM : interesting work where the author propose to bypass CPU interaction between PCIe devices. Basically enabling DMA between NVM and other devices. It then allows RDMA NIC to directly talk to the NVM device on the same PCI switch. However it doesn't really explain what persistence guarantee are associated with the different operations.
RDMA verbs extension : basically Mellanox propose to add a RDMA flush verbs that would mimic the CPU flush command . This operation would guarantee consistency and persistence of remote data.
PMoF : address the really difficult aspect of guaranteeing persistence and consistency of accessing persistent memory over fabric. Basically this talk describe all the nitty gritty detail to avoid losing/corrupting data during access over over fabric. This is what the RDMA flush verb will need to address but for the moment require a lot of manual operation.

Last but not least we can see reference here and there to 3dXpoint from Intel however it seems that the company tuned down its marketing machine. Probably fearing some backlash because of the continuous claw-back on claimed performance front.

Monday, October 10, 2016

[Links of the day] 10/10/2016 : k8s Stateful container App platform, Modern Bank Backend, Fast websocket & tcp server

Deepstream.io : fast, secure and scalable websocket & tcp server for mobile, web & iot
SuperGiant : Application platform specializing in stateful container orchestration, based on Kubernetes
Building a Modern Bank Backend : Nothing much on the real detail, but it can be summarized: We took "classic" banking features and used a new stack to deliver it. The HN discussion is also worth checkking out. What is interesting is the discussion on Industrial vs open source software when it come to audit and security as well as reactivity. However what the authors of the discussion is that it often boil down to the blame game and emergency patch by industrial are NOT always thoroughly tested / audited / secured but you benefit from the insurance / SLA in case something goes wrong. [HN discussion]

Friday, October 07, 2016

[Links of the day] 07/10/2016 : Product Launch Checklist, Hashiconf 16, Startup Text Rpg

Product Launch Checklist : What is important in this check list, beyond the excellent break down of task is the RACI model to guarantee a smooth and efficient cooperation of the various participant.

Responsible: The person who has to do it. (The doer).
Accountable: The person who makes the final decision and who has ultimate ownership of the task.
Consulted: The person who is consulted BEFORE a decision or action is taken.
Informed: The person who is informed that a decision or action is taken.

Hashiconf 16 : 2016 Hashi corp conference videos. For all the consul addict out there.
Startup text rpg: Hillarious exchange of a text RPG style game where you are a Startup founder. Read the thread.

You are in a startup. All around is a burning runway. There are exits to the North and East. You have a bootstrap. There is a VC here.

> |
— Stef Lewandowski (@stef) September 22, 2016

Wednesday, October 05, 2016

[Links of the day] 05/10/2016 : SQL Scan using NVMe to GPU using P2P DMA , Strange Loop 2016, Secure Time Service

SSD-to-GPU Direct DMA : Interesting work where the author use p2p DMS to load data from NVMe to GPU. This bypass the RAM altogether. The objective is to accelerate PostgreSQL scan operation. This is really neat, but I am not sure that the SQL DB are the best choice of use case. I would have thought that columnar or K/V system would held better speedup potential because of the way the data is organised and processed.
Strange Loop : all the video of strange loop 2016 conference. Way too many good talk to mention them all. Just check it out.
RoughTime : secure time synchronization project. Everybody use time, but nobody mention clock attack. This project aim at alleviating this potential threat. So you know all your data is accurately fresh.

Monday, October 03, 2016

[Links of the day] 03/10/2016 : Designing Internet Scale Services and Machine learning podcast

Internet Scale Service checklist : Succinct check list that cover the basics , nothing hearth shattering but often one or two key elements are forgotten in the designs might crumble down the line.
On Designing and Deploying Internet-Scale Services : classic paper by James Hamilton of Windows Lives team on how to build very large scale internet service and operations.
Talking Machines : excellent podcast providing a clear window into the world of machine learning.

Friday, September 23, 2016

[Links of the day] 23/09/2016 : Intel's 3dxpoint vanishing performance, VLDB16, Core to Core HW queue engine

3dxpoint performance evaporate : seems that Intel is heavily scaling back its xpoint NVM performance claim. From 1000x to 10x ( still good but a far cry from what was promised). It seems that Intel had to push the technology early in order to counter a potential acquisition of its partner, Micron, by a competitor. Announcing the technology surely propped the share price making an acquisition difficult.
VLDB : very large databases 2016 proceedings are out. Sadly its one big zip file and didn't have time to go through it.
CAF : the authors propose a hardware core to core communication offloading engine. Providing an efficient queuing mechanism for transferring data between cores. I am not sure 100% of the value but the concept is interesting, let see if it catch on and if it can plays well in heterogeneous environment of today's datacenter. As core to core is slowly replaced with cored to GPU or core to FPGA or core to NVM.

Wednesday, September 21, 2016

[Links of the day] 21/09/2016 : NAS 16 , cloud optical interconnect, Netmiko

NAS : 2016 networking, architecture and storage conference, selected papers:

CircularCache : storage wide cache system using virtual queue mechanism for load balancing usage and performance accross the cluster.
Active Burst-Buffer : when storage is not fast enough you start to move your processing int he buffers to save time

Emerging Optical interconnect Technology for the Cloud : Finisar presentation , the largest fiber optic transceivers provider in the world, on the trend in cloud interconnect technology. Like HPC and other system its all about power / bits / seconds. Power decrease while bandwidth needs to increase. What is interesting is the impact of the topology used on the fabric requirement ( HPC vs hyperscale datacenter). What is impressive is that aggregate bandwidth doubles every 3 years but that cost per Gbps is lower for higher channels counts at the same point in time.
Netmiko : paramiko wrapper simplifying SSH connections to network device

Monday, September 19, 2016

[Links of the day] 19/09/2016 : #AI bias, Incremental consistency , Customizable datacenter

Stuck in a Pattern : as predictive policing tools are being widely adopted in corporation and public organisation. There is little transparency as how these systems have been configured. It seems that the current set of software designed and deployed may reinforce discrimination and inequality under a veil of marketing publicizing intelligent solution.
Incremental consistency guarantees : The authors propose a system that instead of providing a single "hard" consistent answer to a query a system that will provide multiple reply with incremental consistency guarantee albeit with incremental latency cost. This allow system to make decision based on their consistency requirement as well as performance needs. This is interesting as it would allow some application to take decision based on consistent enough information while being able to revise their decision if needed once receiving a higher level of consistency response.
Customizable Computing at Datacenter Scale : NAS 16 keynote , it seems that HPC and exascale system are slowly converging toward an hybrid model with heterogeneous resources, FPGA, GPGU , CPU , etc..

Friday, September 16, 2016

[Links of the day] 16/09/2016 : Sigcomm 16 conference papers , Golang Numa scheduler, ICML videos

Sigcomm 2016 conference : schedule, paper and slides. I already mentioned Microsoft paper yesterday on RDMA at scale over Ethernet fabric. Here is a few more selected paper:

Real-time Distributed MIMO Systems : MIT achieved the feat to create a distributed MIMO system operating across devices with independent clocks. This allow system to clearly maximize bandwidth usage while maintaining fairness.
Globally Synchronized Time via Datacenter Networks : Authors proposed to use a hardware solution leveraging low level network fabric for maintaining synchronous clock across data-center. Basically they embed within the fabric extra information for maintaining coherent time. It allow a 200ns clock skew within a 6 hops data-center. Which is quite impressive even if it comes at the cost of requiring HW modification.
Evolve or Die: High-Availability Design Principles Drawn from Google's Network Infrastructure : Google approach to network infrastructure management and more specifically how they look at incident resolution and mitigate risk. The main goal of their effort described in this paper is to treat a change to the network as business as usual and not exceptional events.

Numa-aware scheduler for go : as machines get bigger, Numa optimization start to creep up as a requirement. Especially with language using garbage collection and automatic scheduling like go. This proposal show a possible approach for Golang to handle Numa setup.
ICML 2016 conference : international machine learning conference videos

Wednesday, September 14, 2016

[Links of the day] 14/09/2016 : Ethic in AI , Survey of fully homomorphic encryption, RDMA over Ethernet at scale at Microsoft

Ethical Preference-Based Decision SupportSystems : when AI and other autonomous agent start to be more ubiquitous in the human environment. As the decision of these systems will start to have a greater impact on our daily life, trust will need to be build and to achieve that these system will need that they are perceived to act in a moral and ethical way.
A brief survey of Fully Homomorphic Encryption, computing on encrypted data : fully homomorphic encryption allow you to manipulate encripted data without decrypting it. This is great for database and other systems as it allow service to modify and update information without the need to know its content. Effectively partitioning operation from knowledge. However this comes at a cost (but its going down). Might finally end up with the security pipe dream where the data is immediately encrypted and is only manipulated in this form until it is finally consumed.
RDMA over Commodity Ethernet at Scale : It is interesting to see that RDMA start to slowly permeate hyper-scale data-center. However it is even more interesting to see that Microsoft decided to go for the RoCE version of it instead of infiniband. It make sens as there was a lot of investment in scaling the Ethernet for their cloud infrastructure and allow a lot of reuse and collocate normal and RDMA traffic on a single underlying fabric.

Monday, September 12, 2016

[Links of the day] 12/09/2016 : Multi Hash libs, Papers we love and Goto Conference

OmniHash and Rhash : Omni hash is a Python multi hash library , RHash provide similar functionality but in C . They are both small and versatile.
Goto Chicago , Amsterdam , and Stockholm conference 2016 videos : loads of good content and some rehashed one.
Papers we love conference : I am so jealous of the people who can attend. This looks like an amazing conference, I will link to the slides deck / video as soon as they are up.

Tuesday, September 06, 2016

[Links of the day] 06/09/2016 : KVM forum 2016, Beginner's guide to neural net, Seeing with Wi-Fi

KVM forum 2016 : KVM forum videos are up, notable talk on vGPU by NVDIA. This is really interesting as it pave the way to multi tenant GPU in virtual environment. Which allow resource sharing and hence RoI of such device.
A Beginner's Guide To Understanding Convolutional Neural Networks : 3 part blog post explaining neural network, this is a really good introduction. [part-1 , part-2, part-3]
We Can "See" You via Wi-Fi : researcher devise a way to identify action using Wi-Fi. Their method allow to some extend the recognition of human interaction. This is really interesting as Wi-Fi is pretty much ubiquitous. It would allow anybody to have the capability to peek into environment without the need to costly setup.

Monday, September 05, 2016

[Links of the day] 05/09/2016 : Neural Net Architectures, MIT 100 years of AI report, Queuing theory textbook

Neural Network Architectures : this article provide an overview of the evolution of the neural network architecture and the different breakthrough that lead us to the current deep learning approach. The authors provide detail overview of the deep learning architecture and the logic behind their evolution and implementation.
One Hundred Year Study on Artificial Intelligence (AI100) : MIT report on AI, where does it fit and where it will be. This is a excellent high level overview of the implication of the current boom in machine learning, deep learning, agent, AI trend. This report presents insights in the impact of daily life and business.
Introduction to Queueing Theory andStochastic Teletraffic Models : textbook providing everything you want to know about queuing theory.

Friday, September 02, 2016

[Links of the Day] 02/09/2016 : Row Hammer in the cloud , Usenix Security conf and Agile IT Book

USENIX Security '16 : usenix security conference proceedings are out, there is two notable paper in there:

Off-Path TCP Exploits: Global Rate Limit Considered Dangerous : it seems that there is a flaw in current TCP sepcificatin and implementation that allows a blind off-path attacker to infer if any two arbitrary hosts on the Internet are communicating using a TCP connection. This could allow large scale denial of service or worse .
One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation : this one sound scarier than the precedent paper. However the inherent limitation of the approach reduce the potential scope of attack. It basically leverage Row Hammer procedure to attack neighbor VM within a cloud system. However, the caveat is that it requires that the cloud provider allow to run paravirtualised guest, can only target PV guest. And is easily defeated if the RAM used is non ECC. Which is the default RAM used in any decent data-center.

Agile IT Management: From Startup to Enterprise : really good book providing a well documented set of observations on IT’s current challenges that can orient you for more effective decisions and actions in your journey toward IT excellence. But beware of buying too much of the hype of one solution fits all. Agile is just one part of the solution, old waterfall still has its place as well as intermediary one. The makes the difference between wisdom and knowledge.

Thursday, September 01, 2016

[Links of the Day] 01/09/2016 : Cloud reference model, Scaling with Threads and economyics of response time

Economic Value of Rapid Response Time : classic 1989 paper demonstrating that lower software response time yield significant economic benefit with
ClouNS : A Cloud-native Application Reference Model for Enterprise Architects. The authors propose a reference model for cloud-native applications that relies only on a small subset of well standardized IaaS services. The reference model can be used for codifying cloud technologies. It can guide technology identification, classification, adoption, research and development processes for cloud-native application and for vendor lock-in aware enterprise architecture engineering methodologies.
Scaling to Thousands of Threads : excellent blog post looking at the misconception that thread based system are inherently flawed when it comes to availability.

Wednesday, August 31, 2016

[Links of the Day] 31/08/2016 : Open Lambda and consensus algorithms

Open Lambda : allow anybody to run a local lambda platform similar to AWS lambda or Azure Function.
AllConcur : the authors propose in this paper a distributed system that provides agreement through a leaderless concurrent atomic broadcast algorithm. What is interesting is that the authors claims a 17x performance increase vs the leader based solution. However, there is a catch in there assumption : "We assume a model of reliable communication—messages cannot be lost (only delayed). This is a reasonable assumption if we consider a reliable protocol, such as TCP." I think that here we might hit a major issue as even with TCP we can have duplicated/ lost message. Paxos make not such assumption, Lets see if the idea can be adapted to an unreliable medium.
Flexible Paxos : in this paper authors provide a proof that majority agreement isn’t required by Paxos and the sets of nodes required to participate in agreement (known as quorums) do not even need to intersect with each other.

Tuesday, August 30, 2016

[Links of the Day] 30/08/2016: trends in infrastructure , HPC / supercomputer and memory

Algorithms for future emerging technologies : Jack Dongarra CRIM talk presenting trend in supercomputing [video]
Trends in - and the Future of - Infrastructure : One of the fathers of software-defined networking (SDN),) Martin Casado, gives its view on the future of infrastructure.
Hot Chops - memory - 2016 : more persistent memory, more storage, but slower overall performance. What is interesting is that the trend is more about filling the niche gap rather than raw performance or storage size improvement. It seems that memory like cpu is hitting a performance wall. The two big trend are NVDIMM as near term solution and PIM : process in memory. PIM literally package memory on top of the logic processor. It reduce data transfer cost, but the heat generated by the overall package quickly rise, it requires to support >95 C . Power and heat is now the number limiting factor in HW

Monday, August 29, 2016

[Links of the Day] 29/08/2016 : Stacks project, Top deep learning and the most profitable or disrupt-able industry 2016

Stacks project : open source textbook and reference work on algebraic stacks and the algebraic geometry needed to define them.
Top Deep Learning Projects : the name says it all
The most profitable industry : 2016 list, pretty much give which industry will be disrupted first.. : )

Thursday, August 25, 2016

[Links of the Day] 25/08/2016 : Micro-services architecture best practices, Semantic hashing and principles of programming language book

Best Practices for Building a Microservice Architecture : Article providing an overview of the best practice when shifting complexity from a monolithic systems to a distributed one. Complexity doesn't disappear, it is bounded to the interactions between simple services. While this sound great, the complexity increase with the degree of interactions which can follow a power law.
Semantic Hashing : paper describing a method using deep learning algorithms for generating fast hash of documents. This method allow to generate locality sensitive that permit to execute similarity search over a vast library of documents in a time independent of the size of the collection.
Principles of Programming Languages : book providing an introduction to the study of programming languages derived from Johns Hopkins University programming language courses.

Wednesday, August 24, 2016

[Links of the Day] 24/08/2016 : Berkley Data science texbook, IaaS pricing trends, Wargaming conference

IaaS Pricing Patterns and Trends : Interesting to see that Google is really aggressive on its pricing.
Computational and Inferential Thinking : Texbook for UC Berkley foundation of data science class.
Connections 2016 : Report of the connections 2016 conference on Wargaming. There is definitely more than a thing or two to be learned by corporation on how to leverage war-games for improving and testing various strategy and understand competition behavior. Sadly the slides deck are not available yet. I wish that there was also video recording of this event.

Tuesday, August 23, 2016

[Links of the day] 23/08/2016 : Adapting In memory database architecture for Storage class memory and Datacenter network congestion management

The implication of Storage Class Memory for In memory database architecture :

SOFORT : The authors propose to modify traditional In memory database architecture in order to optimise its operation for upcoming storage class memory hardware. The idea is quite simple, get rid of the log mechanism and persist all data to NVM except for the index which needs to be maintained in RAM for performance requirement. SCM allow to drastically eliminate a lot of boiler plate architecture functionality by delivering fast byte addressable persistent storage. However, now the developers needs to be aware of the transnational model imposed by this new class of persistent memories. [Slides]
Instant Recovery for Main-Memory Databases : This paper build on top of SOFORT and looks at leveraging NVDIMM or SCM for speeding up crash recovery features. The idea is not only speed up the normal operation but also eliminate the recovery cost in case of application crash [Slides]
Note that both these paper have an author working for SAP, so my guess that we will start to see new dedicated feature in SAP Hana for supporting SCM.

Flowtune : It seems that we are going to see slowly a return of the ATM model in data-center for networking fabric. In this paper the author propose to combine a form of MPLS system with a centralized allocator for resources management and congestion avoidance. Basically the system identify connection ( called flowlet ) establishment and end . Using the existing and past information it derive an optimal path and resources allocation minimizing interference and congestion over the lifetime of the flowlet. Looks like SDN is finally enabling a simplified and more robust ATM model within and probably across data-centers.

Subscribe to: Posts ( Atom )