Wednesday, July 08, 2020

Software RDMA revisited : setting up SoftiWARP on Ubuntu 20.04

Almost ten years ago I wrote about installing SoftIwarp on Ubuntu 10.04. Today I will be revisiting the process. First, what is SoftIwarp: Soft-iWARP is a software-based iWARP stack that runs at reasonable performance levels and seamlessly fits into the OFA RDMA environment provides several benefits. SoftiWARP is a software RDMA device that attaches with the active network cards to enable RDMA programming. For anyone starting with RDMA programming, RDMA-enabled hardware might not be at hand. SoftiWARP is a very useful tool to set up the RDMA environment, and code and experiments with.

To install SoftIwarp you have to go through 4 stages: Setting up the environment, Building SoftIwarp, Configuring Softiwarp, Testing.

Setting up RDMA environment

Before you start you should prepare the environment for building a kernel module and userspace library.
Basic building environment

sudo apt-get install build-essential libelf-dev cmake

Installing userspace libraries and tools

sudo apt-get install libibverbs1 libibverbs-dev librdmacm1 \
librdmacm-dev rdmacm-utils ibverbs-utils

Insert common RDMA kernel modules

sudo modprobe ib_core
sudo modprobe rdma_ucm


Check if everything is correctly installed : 

sudo lsmod | grep rdma 

You should see something like this : 

rdma_ucm               28672  0
ib_uverbs             126976  1 rdma_ucm
rdma_cm                61440  1 rdma_ucm
iw_cm                  49152  1 rdma_cm
ib_cm                  57344  1 rdma_cm
ib_core               311296  5 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,ib_cm

Now set up some library for the userspace libs : 

sudo apt-get install build-essential cmake gcc libudev-dev libnl-3-dev \
libnl-route-3-dev ninja-build pkg-config valgrind


Installing SoftiWARP

10 years ago you had to clone the SoftiWARP source code and build it (https://github.com/zrlio/softiwarp.git). Now you are lucky, it is by default in the Linux kernel 5.3 and above!

You just have to type : 

sudo modprobe siw

verify it works : 

sudo lsmod | grep siw
you should see : 
siw                   188416  0
ib_core               311296  6 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,siw,ib_cm
libcrc32c              16384  3 nf_conntrack,nf_nat,siw

moreover, you should check if you have an Infiniband device present : 

ls /dev/infiniband 

Result : 

rdma_cm


You also need to add the following file in your /etc/udev/rules.d/90-ib.rules directory containing the below entries : 

 ####  /etc/udev/rules.d/90-ib.rules  ####
 KERNEL=="umad*", NAME="infiniband/%k"
 KERNEL=="issm*", NAME="infiniband/%k"
 KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
 KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
 KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
 ########


If it doesn't exist you need to create it.


I would suggest you add also the module to the list of modules to load at boot by adding them to /etc/modules file

You need now to reboot your system.


Userspace library

Normally, recent library support softiwarp out of the box. But if you want to compile your own version follow the step bellow. However, do this at your own risk... I recommend to stick with the std libs.

Optional build SIW userland libraries: 

All the userspace library are in a nice single repository. You just have to clone the repo and build all the shared libraries. If you want you can also just build libsiw but it's just easier to build everything at once. 

git clone https://github.com/zrlio/softiwarp-user-for-linux-rdma.git
cd ./softiwarp-user-for-linux-rdma/
./buid.sh

Now we have to setup the $LD_LIBRARY_PATH so that build libraries can be found. 
cd ./softiwarp-user-for-linux-rdma/build/lib/
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH


or you can add the line in your .bashrc profile:
export LD_LIBRARY_PATH=<<PATHTOTHELIBRARIES>>:$LD_LIBRARY_PATH

End of optional section



Setup the SIW interface : 


Now we will be setting up the loopback and a standard eth interface as RDMA device:

sudo rdma link add <NAME OF SIW DEVICE > type siw netdev <NAME OF THE INTERFACE>


In this case for me : 

sudo rdma link add siw0 type siw netdev enp0s31f6
sudo rdma link add siw_loop type siw netdev l0

You can check the two devices have been correctly set up using ivc_devices and ibv_devinfo command
result of ibv_devices  :
    device              node GUID
    ------           ----------------
    siw0             507b9ddd7a170000
    siw_loop         0000000000000000

result of ibv_devinfo :

hca_id: siw0
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   507b:9ddd:7a17:0000
 sys_image_guid:   507b:9ddd:7a17:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  1024 (3)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet
hca_id: siw_loop
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   0000:0000:0000:0000
 sys_image_guid:   0000:0000:0000:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet

Testing with RPING: 

Now we simply test the setup with rping : 

In one shell : 
rping -s -a <serverIP> 

in the other : 

rping -c -a <serverIP> -v 


And you should see the rping working successfully! 

You are now all set to use RDMA without the need for expensive hardware. 


Thursday, July 02, 2020

[Links of the Day] 02/07/2020 : Database query optimization, Deep Learning Anomaly detection survey, Large scale packet capture system

  • event-reduce : accelerate query result after write. Basically if cache part of the write and recalculate the new query result using past query result and the recent write event. The authors observe an up to 12 times faster displaying of new query results after a write occurred.
  • Deep Learning for Anomaly Detection: A Survey : comprehensive survey of anomaly detection techniques out there. 
  • Moloch : Large scale, open-source, indexed packet capture and search.



Tuesday, June 30, 2020

Data is the new oil fueling Machine learning adoption but Businesses are discovering #AI is no silver bullet

Data is the new oil. However, unlike oil, as data scarcity is becoming less of a problem, processing costs are skyrocketing. The business world is waking up to the fact that while the cost of computing keeps getting cheaper all the time. The cost of training machine learning models is outpacing the compute cost drop.

Moreover,  business are finding challenging to adopt #ai, and the economist report numbers are showing how often #machinelearning projects in the real business world fail :
  • Seven out of ten said their #ai projects had generated little impact so far.
  • Two-fifths of those with “significant investments” in ai had yet to report any benefits at all.
Companies are finding that #machinelearning is not the promised silver bullet. The non-tech company are discovering what tech companies had to learn the hard way: that they are no Google, Facebook, ...

To successfully deploy an AI/ML/DL project you need: a vast amount of data, skilled employee, solid engineering practice, access to infrastructure and last but not least, a clear understanding of the business problem.

I have a false hope that corporation will abandon the silver bullet thinking, but I would settle for avoiding another #ai winter cycle.






[Links of the Day] 30/06/2020 : Homomorphic encryption for Machine Learning, Neural Network on Silicon, Python Graph visualization library

  • PySyft : Python framework for homomorphic encryption for Machine learning. It allows you to train model on encrypted data without the need to decrypt it. It's 40x slower than normal method but you this means you don't have to deal with the new EU regulation on AI. 
  • Neural Networks on Silicon : a collection of papers and works on Neural Networks on Silicontopic
  • Pygraphistry Python visual graph analytics library to extract, transform, and load big graphs in Graphistry 




Thursday, June 25, 2020

[Links of the Day] 25/06/2020 : Architecture decision record, Database stress test, Rust Network Function Framework

  • Architecture decision record : Methods and tools for capturing software design choices that address a functional or non-functional requirement that is architecturally significant. [template]
  • pstress : Perconna Database concurrency and crash recovery testing tool
  • capsule : A framework for network function development. If you want to do fast packet process in a memory safe programing language (RUST) this is for you.

Tuesday, June 23, 2020

The rise of Domain Specific Accelerators

Two recent articles indicate a certain pick up of Domain-specific Accelerators adoption. With the end of Moore's Law, domain-specific hardware solution remains one of the few paths to continuing to increase the performance and efficiency of computing hardware.


For a long time, domain-specific Accelerators adoption was limited by economics factors. Historically, the small feature sizes, small batch sizes, and high cost of fab time (for ASICs) translated in a prohibitive per unit cost.
However, economic factors have shifted :

  • move toward standardised opensource tooling,
  • more flexible licensing model,
  • RISC-V architecture coming of age and maturing rapidly
  • Fab cost dropping
  • Wide availability of FPGA (AWS F1)
  • Rise of co-designed high-level programming language reducing the learning curve and design cycle.
  • power/performance wall of general-purpose compute unit

We are about to see a dramatic shift toward heterogeneous compute infrastructure over the next couples of years.

[Links of the Day] 23/06/2020 : Thinking while moving, AI snake oil, Graph Database

  • Thinking While Moving too often current machine learning system is used in a rigid control loop. Leading to saccades. The authors of this paper propose concurrent execution of the fingering system with the controlled system. Allowing more fluid operations and shorter execution time of the task.
  • AI snake oil : a lot of AI solution project fail to return their initial investment. Too many buzzwords and not enough understanding of the limits of the current technology. At least NVIDIA is selling GPU by the millions. When there is a gold rush, the one making a fortune is the one selling shovels.
  • TerminusDB : in-memory graph database management system. it's interesting to see that 99% of the source code is Prolog and they JSON-LD as the definition, storage and interchange format for the query language. The original use case for this solution targeted financial data stored as time series but lacking graph correlation.


Monday, June 22, 2020

Stop throwing GPU at HPC, Not all scientific problems are compute dense

The current race to exascale has put a heavy emphasis on GPU-based acceleration at the detriment of other HPC architecture. However, Crossroads and Fugaku supercomputer are demonstrating that it is not all about GPU.



The vast majority of the (pre-)exascale machines are relying heavily on GPU acceleration targeting scientific problems that can be cast as dense matrix-matrix multiplication problems.

However, there are large numbers of scientific problems that are not compute dense. And such GPU architectures are ill-equipped to accelerate these problems. Sadly, the current trends seem to have relegated those type of scientific challenges to second class citizens in the HPC world. If you look at extreme-scale graph problems by example, the graph500 benchmark clearly shows that these type of problem have been orphaned. 4 out of ten systems are more than seven years old and nearing their end of life. Moreover, the newer systems show marginal progress toward accelerating extreme-scale graph traversal. 

I understand that the current machine learning hype heavily influences the HPC ecosystem. However, we have to remind ourselves that there is life beyond FLOPS. And the Fugaku and Crossroads system demonstrates it is possible to achieve strategic compute leadership without sacrificing the architecture to the altar of exaflop compute dense gods. 



The Japanese latest ARM-based Fugaku supercomputer is demonstrating that it can address both compute dense GPU optimised and the one that not reducible to dense linear algebra and therefore incompatible with GPU technologies. The Japanese supercomputer built around the ARM v8.2A A64FX CPU just picked up the number one in the HPC Top 500 Green benchmark and the Graph500 BFS benchmark.

Hopefully, this will be a wake-up call within the HPC community to properly fund R&D efforts orthogonal to the compute dense and exaflops benchmark friendly architecture.


Update 22/06/2020 : after publishing this article Fugaku just got ranked Nb 1 at the top 500 Linpack benchmarks with close to half an exaflops! (415 Petaflops).And Fugaku is pretty much topping every single HPC ranking :

  • Top500 : Nb 1.
  • Top500-Green : Nb 1.
  • HPCG : Nb 1.
  • HPL-AI: Nb 1.
  • Graph500 : Nb 1.







Friday, June 19, 2020

With enough data and/or fine tuning, simpler models are as good as more complex models

This is an age-old issue that seems to repeat itself in every field. There are a couple of recent papers published criticising the race to beat SOTA.

This recent paper demonstrates that older and simpler model perform as well as newer models as long as they get enough data to train.

This has some interesting impact on production systems. As if you already have a good enough model, throwing more data at it can help achieve close to SOTA result.
Which means that you won't have to build from scratch a new model to keep up with SOTA in your production system. You just need to collect more data as the system run and retrain your model once in a while.
Also, less complex models tend to have shorter Inference time in production. Which would be a quite crucial component as well that gets impacted by model complexity.







In another recent paper, the authors look at Metric learning papers from the past four years and demonstrate that the performance claims over the old method (often more than double) are mainly due to the lack of tuning.
Most of the time the authors of the SOTA beating algorithm show two evaluations. One where they finetune their algorithm on the test set and compare against the off the shelf tuning SOTA algorithm.






"Our results show that when hyperparameters are properly tuned via cross-validation, most methods perform similarly to one another"

"...this brings into question the results of other cutting edge papers not covered in our experiments. It also raises doubts about the value of the hand-wavy theoretical explanations in metric learning papers."
This happens time and time again across the industry and academia: perf benchmark of CPU Intel vs AMD, GPU Nvidia vs ATI, Network, Storage, etc....
This can be due to lack of knowledge, time, integrity, etc..

To conclude, be careful, the latest shiny model might note the best one for your production. If you spend enough time and data on older models you might achieve the same performance at lower inference cost.
Obviously, this assumes that you already have the best practice when it comes to model monitoring in production :)







Thursday, June 11, 2020

[Links of the Day] 11/06/2020 : Metric Time-Series Database, Machine Learning for metrics, Causal Time series Analysis

  • Victoria Metrics : fast, cost-effective and scalable time-series database, if you need a backend for Prometheus, by example, this is the DB for you.
  • Sieve : a platform to derive actionable insights from monitored metrics in distributed systems. the platform is composed of two separate systems. One geared toward trace reduction and selection with intelligent sampling using a form of zero-positive learning. And a second system that extracts correlations between the services generating the traces.
  • Tigramite : causal time series analysis python package. It allows to efficiently reconstruct causal graphs from high-dimensional time-series datasets and model the obtained causal dependencies for causal mediation and prediction analyses [github]



Tuesday, June 09, 2020

[Links of the Day] 09/06/2020 : #WASM on #K8S , Fast Anomaly Detection on Graphs, Linux one ring

  • Krustlet : it seems that web assembly is getting more pervasive, we have kernel WASM, WASM for Deep learning, now Krustlet offer WASM for Kubernetes via Kubelet.
  • Fast Anomaly Detection in Graphs :  Really cool real-time anomaly detection on dynamic graphs.  the authors claim to be 644 times faster than SOTA with 42-48% higher accuracy. What is even more attractive is the constant memory usage which is fantastic for production deployment. [github
  • io_uring : this will dominate the future of the Linux interface. It is currently eating up every single IO interface and probably won't stop just there. 


Saturday, June 06, 2020

Yet another Red Queen Project : Franco-German Gaia-X

For some reason, the EU and especially the French government love moonshot project. The only problem is that they tend to be launched after the moon as already been colonized.

Gaia-X is not a moonshot, but a Red Queen project. I use this term in reference to the Red Queen hypothesis or Red Queen effect, which is derived from Lewis Carroll's Through the Looking-Glass :
Now, here, you see, it takes all the running you can do, to keep in the same place.

Gaia-X is a Red queen project because the French and German government (and the EU to some extend) are trying to forcefully evolve the digital ecosystem to stay in the same place. Also, because they always launch this initiative way too late or without any long term strategic planning both in term of funding and impact. 







Let's look at Gaia-x and why there is an air of "deja vu". First, it's not a cloud service; it's a "platform" aggregating cloud hosting services from dozens of companies. Does that remind you of anything? Bingo, the European cloud initiative, which aim at : 
"Strengthen Europe's position in data-driven innovation, improve its competitiveness and cohesion, and help create a Digital Single Market in Europe."

This initiative started back in 2012; at the time, I didn't get the strategy and structure of the effort. And unfortunately, I still don't. EU wanted to regulate and impose EU standard to the industry hoping to spruce the EU cloud ecosystem via standards and funding sprinkling. I use the term sprinkling because EU thought that by seeding a constellation of research projects and local initiatives it would magically help sprout an EU cloud giant.

The "standard effort" side of the program seems to have fizzled out. Judge by yourself: The official final report is here.

Gaia-x seems to be an offshoot of the European Cloud Partnership side of the cloud initiative, aiming at increasing trust when using cloud services: 
"it's (Gaia-X) conceived as a platform joining up cloud-hosting services from dozens of companies, allowing businesses to move their data freely with all information protected under Europe's tough data processing rules."

Compounding with the regulatory compliance spin, the project promoters cannot refrain themselves from using the vendor lock-in FUD: 
"One important concept underpinning Gaia-X is "reversibility", a principle that would allow users to switch providers quickly. " 

They conveniently forgot to mention that by using Gaia-x, you will be replacing provider lock-in for platform lock-in. 

If you dig a little bit on the technical side you find out that this reads more like a program to keep academic research institutes busy and rehashes fantasies of dynamically matching service providers to consumers and policies. Dynamic matching was something that was a hot topic in academia during the SOAP times but isn't used at all in practice. Moreover, it doesn't use any established logic programming paradigm and re-invents an ad-hoc service ontology/taxonomy and query language.




Last but not least, one of the glaring omission from the platform is the complete lack of specification regarding a common accounting, payment and monetization of services. Where is the processing and payment service? It is conveniently absent.




Providing an accounting and payment platform for dynamically orchestrated services from a multitude of providers is not only hard. It's near impossible. Without this crucial element, the platform is stillborn.




If France and Germany want to avoid turning Gaia-X into another Qwant. Maybe pivoting the platform to a more niche domain such as a government and large company cloud services procurement platform. This would fit right in the compliance, sovereignty and interoperability narrative as well as the business profile of most of the consortium participants.

Thursday, June 04, 2020

[Links of the Day] 04/06/2020 : XoR filters, SIMD + Json , Online tracking and publisher's revenues

  • Xor Filters :  Xor filters are great as they provide a fast and small version of bloom or cuckoo filter. However, there is some key difference. Xor filters require all the members of the set be provided upfront. While, Bloom filters allow adding members, but not removing them and finally Cuckoo filters allow removing members. So just pick what's best for you.
  • SimdJson : nice performance leveraging the CPU feature. However, the lack of support for null entry feel like cheating ( and probably crash with the most common real-life payload)
  • Online Tracking and Publishers' Revenues : The authors demonstrate that the use of cookie only represent a 4% increase of revenue vs non-cookie for an advertiser. Which brings question the differential benefit between ad publisher like Google and Facebook vs the advertiser. Bringing into question why should advertiser pay for the loss of privacy that only benefits their platform provider.



Tuesday, June 02, 2020

[Links of the Day] 02/06/2020 : Real time network topology, Detecting node failure using graph entropy, Monitoring machine learning in production

  • Skydive : open source real-time network topology and protocols analyzer providing a comprehensive way of understanding what is happening in your network infrastructure.
  • Vertex : the authors propose to use vertex entropy for detecting and understanding node failures in a network. By understanding the entropy in a graph they are able to circumvent the lack of locality in the information available and pinpoint critical nodes. 
  • Monitoring Machine Learning Models in Production :  Once you have deployed your machine learning model to production it rapidly becomes apparent that the work is not over.


Thursday, May 28, 2020

[Links of the Day] 28/05/2020 : Reverse Oauth proxy , Prometheus timeseries backend, Google use machine learning to improve audio chat

  • Oauth Proxy : A reverse proxy that provides authentication with Google, Github or other providers.
  • Zebrium : Prometheus backend project, not sure why you don't want to just export the data from Prometheus into a distributed column-store data warehouse like Clickhouse, MemSQL, Vertica. This gives you fast SQL analysis across massive datasets, real-time updates regardless of order, and unlimited metadata, cardinality and overall flexibility. Maybe because they want to focus on the monitoring / reactive aspect and less on the analytics.
  • Improving Audio Quality with WaveNetEQ : Google uses machine learning to deal with packet loss, jitter, and delays. An interesting bit of info: " 99% of Google Duo calls need to deal with packet losses, excessive jitter or network delays. Of those calls, 20% lose more than 3% of the total audio duration due to network issues, and 10% of calls lose more than 8%."


Wednesday, May 27, 2020

[Links of the Day] 27/05/2020 : Combining Knowledge graphs, Real World storage resource management, Jespen black box transactional safety checker

  • Combining knowledge graphs : The authors describe a new entity alignment technique that factors in information about the graph in the vicinity of the entity name. It provides a 10% higher accuracy while reducing computational cost for model generation.
  • Wizard : project looking into real-world storage reliability for cost-effective data and storage resource management system for reliability enhancement.
  • Elle : Jespen black-box transactional safety checker based on cycle detection. You can find more in the Arxiv Paper by Kyle Kingsbury and Peter Alvaro : "Elle: Inferring Isolation Anomalies from Experimental Observations"



Tuesday, May 19, 2020

[Links of the Day] 19/05/2020 : Fat Tails, AutoML Zero , Version Control for GIS

  • Statistical Consequences of Fat Tails : Nassim Nicholas Taleb book investigates the misapplication of conventional statistical techniques to fat-tailed distributions and looks for remedies
  • AutoML Zero : aims to automatically discover computer programs that can solve machine learning tasks, starting from empty or random programs and using only basic math operations. The goal is to simultaneously search for all aspects of an ML algorithm—including the model structure and the learning strategy—while employing minimal human bias.
  • Sno : Distributed version-control for geospatial and tabular data


Thursday, May 14, 2020

[Links of the Day] 14/05/2020 : Wasm In Linux Kernel, Knowledge Graphs, Contrastive Machine learning Model for Software Performance Regressions

  • Kernel Wasm : Looks like people want to run WASM everywhere. This time the authors propose to run wasm program in the kernel. In this case, I just wonder if it would not be more judicious to try run WASM in EBPF. From the GitHub repo it seems that they might actually try to do the opposite. [github]
  • Knowledge Graphs : A comprehensive introduction to knowledge graphs. If you want to learn more about the knowledge graph I would recommend reading the following paper before reading the arxiv one.
  • A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions : really innovative approach by Intel folks there. I have started to see interesting trends in machine learning where instead of trying to train the ML model using a dataset that contains the whole spectrum of possibility. The authors start to use contrastive methods instead. In this case, the ML model is trained on a non-abnormal dataset in order to identify abnormal behaviour. It is much easier in performance evaluation to obtain ideal, or standard metrics rather than abnormal scenario. In this case, the author uses the ideal hardware performance counter to train their model in order to identify abnormal behaviour. [poster]


Tuesday, May 12, 2020

[Links of the Day] 12/05/2020 : Learning From Unlabeled Data, Fast Dataset Classifier, Azure Bad Rollout guardian

  • Learning From Unlabeled Data : Slidedeck of a talk by Thang Luong of Google research. Thang present a novel method for learning from unlabeled data and more specifically semi-supervised learning methods. These methods were used to generate Google Meena Chatbot model.
  • Flying Squid : Looks like a super-fast Snorkel with even better performance. Like Snorkel this is used to quickly building classifiers of datasets that would be otherwise extremely time-consuming (and expensive) to label by hand for training purposes.
  • Gandalf : Azure machine learning system trained to catch bad rollout deployment. The aims of this system is to catch bad deployment before they can have ripple effects across the whole system.


Thursday, May 07, 2020

[Links of the Day] 07/05/2020 : Startup tactical manuals, AutoML pipeline, Thread Caching Malloc

  • Tactical manuals and guides for startups : an awesome collection of strategic posts, essays or documents for startups. While these are great resources, it doesn't replace experience.
  • AutoML Pipeline : The power of Juila meet Machine learning. However, beware as just feeding data into a system and hoping to get the best result coming out without any effort is doomed to deliver sub-optimal results. Often you end up with an ok-ish solution that blows up in production down the line.
  • Tcmalloc : Google Thread Caching Malloc


Tuesday, May 05, 2020

[Links of the Day] 05/05/2020: cached Compilation, DeepLearning optimization library, Nuclear Matters Handbook

  • umake : no more compilation wait, this tool offers fast with cached compilation.
  • Deepspeed : a deep learning optimization library. The authors claim some amazing gains over the standard library. The nice thing is that it reuse the PyTorch API, which makes it easy to use. [github]
  • Nuclear Matters Handbook : ever wanted to know how the US handles Nuclear deterrent and nuclear matters? look no further and read this book. It provides an overview of the U.S. nuclear enterprise and how the United States maintains a safe, secure, and effective nuclear deterrent.


Thursday, April 30, 2020

Tuesday, April 28, 2020

[Links of the Day] 28/04/2020 : Distributed Time Series Database, Data lakes, Translate data between format

  • Modern data lakes : if you think you need a data lake, you probably don' need one and are better off using S3/athena or GCP/bigquery . If you know you want a data lake you might be mature enough to need one and should read this article.
  • M3DB : Distributed Time Series database from Uber, it tries to address horizontal scaling of storage and queries or long term storage limitation of existing solutions.
  • ConfBase : a practical tool for inferring and instantiating schemas and translate between data formats. The tools support JSON, GraphQL, YAML, TOML, and XML. [github]


Thursday, April 23, 2020

[Links of the Day] 23/04/2020 : Machine Learning technical debt, Python and Bayesian Deep Learning perspective of generalization



Tuesday, April 21, 2020

[Links of the Day] 21/04/2020 : Machine Learning for Relational Query processing, Augmenting Language model with latent knowledge retriever, Computer Vision Recipies

  • Extending Relational Query Processing with ML Inference : the authors present advanced cross-optimizations between ML and DB operators in Raven DB. The authors demonstrate significant performance improvement, up to 5.5x from the native integration of ML in SQL Server, and up to 24x from cross-optimizations.
  • REALM: Retrieval-Augmented Language Model Pre-Training : the authors propose to leverage Retrieval-Augmented Language Model pre-training for the challenging task of Open-domain Question Answering.By using augmenting their model with latent knowledge retriever they are able to beat current SOTA models while limiting the model growth size.
  • Computer Vision Recipies : Microsoft is releasing a lot of really good content, this time it's for computer vision. In this repository, you will find best practices, code samples, and documentation for Computer Vision.


Thursday, April 16, 2020

[Links of the Day] The future of Machine Learning is DBMS, Fun Exploring Explanations, High performance Regex

  • Hyperscan : high-performance multiple regex matching library
  • Cloudy with a chance of DBMS : A. Colier reviews the 10 year ML prediction paper. the TL;DR: Model, model everywhere in enterprise databases. You have already seen of a glimpse of what that means with Big-query ML
  • Explorables : awesome website explaining a lot of concepts through play. Lot of computer science stuff in there. 

Tuesday, April 14, 2020

[Links of the Day] 14/04/2020 : Time series dynamical attractors Autoencoder , Binarized Neural Network framework, Machine learning and Databases

  • Deep learning of dynamical attractors from time series measurements : the authors propose a general embedding technique for time series, consisting of an autoencoder trained with a novel latent-space loss function. Worth giving it a look if you deal with time series.
  • larq : open-source Python library for training neural networks with extremely low-precision weights and activations, such as Binarized Neural Networks. Basically, this framework is aiming at embedded / FPGA / ASIC machine learning models deployment. A fantastic resource and great model zoo on top of that.
  • Cloudy with a chance of DBMS : Databases are going to embedded more and more machine learning solution. Big query from Google already does that. But it's just a question of time for most mainstream DB to offer ML service.

Thursday, April 09, 2020

[Links of the Day] 09/04/2020 : TRAX deep Learning library, The next decade in AI, 1:1 questions

  • The Next Decade in AI : Paper by Gary Marcus where he explores the possible future of AI over the next decade
  • 1 on 1 meeting questions : a collection of 1:1 questions, great list that can help any manager pick the right question for the right context. As long as you are able to read the room/ team/ person.
  • Trax: advanced google deep learning library built on top of JAX. It is actively used by the DeepMind team and aiming code clear while providing advanced models like Reformer.


Tuesday, April 07, 2020

[Links of the Day] 07/04/2020 : Incentivizing Innovation, Network Performance analysis, Neural Networks for embedded systems

  • The Effects of Prize Structures on Innovative Performance : how to incentivize innovation? Well, the authors found that a winner-takes-all compensation scheme generates significantly more novel innovation relative to a compensation scheme that offers the same total compensation, but shared across the ten best innovations. However, like every psychological paper, you have to take it with a grain of salt.. reproducibility is always difficult.
  • nfstream : Python package providing fast, flexible, and expressive data structures designed to make working with online or offline network data [github]
  • Neural Networks on embedded systems : a good overview of the challenges and available neural network architectures for running on embedded systems.


Thursday, April 02, 2020

[Links of the Day] 02/04/2020 : Grep all, HealthCare mobile data collection for machine learning, FastAI framework

  • ripgrep : grep search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
  • pymedserver : a server framework for mobile data collection and machine learning in healthcare
  • fastai : fantastic machine learning library trying to abstract away a lot of PyTorch into simple API and building blocks. Sometimes it attracts a bit too much, especially with you want to get murky with some details. But all in all, fastAI is really a framework you want to look at if you are doing machine learning.


Tuesday, March 31, 2020

[Links of the Day] 31/03/2020 : Quantum Computing course, Risk quantifying library, Google time windowed availability metric


  • Meaningful Availability : google folks propose a different interpretation of availability in this paper. The authors propose a new metric called "windowed user-uptime".  The objective of this metric is to measure user perceived uptime combined with calculating the availability over many windows in order to identify transient vs long periods unavailability. 
  • riskquant :  Netflix Python library for quantifying risk 
  • Quantum Computation Course : approachable quantum computation course. 


Thursday, March 26, 2020

[Links of the Day] 26/03/2020 : Golang distributed in memory key value store, Datasciences github repo trove, Developer Road-maps

  • olric : Distributed cache and in-memory key/value data store. This can be embedded as a go library.
  • Pilsung Kang : a lot of really cool git repository for machine learning and datascience lecture , notes, code etc.. by Pilsing Kang of the School of Industrial Management Engineering Korea University.
  • Developer Roadmaps : Step by step guides and paths to learn different tools or technologies, checkout the devops one .. you probably need two lifetime to cover everything.


Tuesday, March 24, 2020

[Links of the Day] 24/03/2020 : Machine learning visualisation, debugging and project template

  • hiplot : Facebook lightweight interactive visualization toolkit, quite useful for discovering correlations and patterns in high dimensional data.
  • manifold : Machine learning Visual debugging tooling bu Uber [github]
  • cookie cutter Data Science : love cookie-cutter, and this is a great one for Machine learning projects

Thursday, March 19, 2020

[Links of the Day] 19/03/2020 : Directed Acyclic Graph structure estimation, Groovy Linter, AI hierarchy of needs

  • DAGs with NO TEARS : NIPS 2018 paper that demonstrate a novel way to Estimate the structure of directed acyclic graphs. Bonus point for code in github ! [arxiv]
  • groovyfmt: I wish I knew about this one a long time ago. All those Jenkins File errors and debugging session I could have avoided. Well, let's add it to my default list of linter to run with every job.
  • AI hierarchy of needs : neet representation of what is needed to deliver an AI project and how much effort and information is required as you progress throughout the hierarchy.


Tuesday, March 17, 2020

[Links of the Day] 17/03/2020 : Machine Learning research Guide, Engineering Strategy, Contrastive Self Supervised Learning Techniques


Thursday, March 12, 2020

[Links of the Day] 12/03/2020 : Neuromodulation in Deep Neural Networks, AWS landingzones as code, 100 days of Machine learning

  • Introducing Neuromodulation in Deep Neural Networks to Learn Adaptive Behaviours : the authors of this papers propose to leverage cellular neuromodulation, the biological mechanism that dynamically controls intrinsic properties of neurons and their response to external stimuli in a context-dependent manner. In order to build a construct a new deep neural network architecture that is specifically designed to learn adaptive behaviours. They demonstrate that their solution is able to adapt to change in the environment as well as providing more flexibility during the lifespan of their model.
  • AwsOrganizationFormation : Alternative to AWS Landingzones. The advantage of this solution is that you can manage your organisation as code. This will help in the long run will simplify your life when it comes to updating and maintaining those resources.
  • 100 Days of ML : this repository captures the journey of Hithesh to learn machine learning in 100 days.


Tuesday, March 10, 2020

[Links of the Day] 10/03/2020 : NLP models platform for elasticsearch, Encrypted Tensor flow framework, Reformer transformer machine learning model

  • nboost : scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
  • tf-encrypted : encrypted tensor flow. This allows you to work on an encrypted dataset for generating models. It's privacy (??) preserving machine learning framework [github]
  • reformer : while most transformers are limited to a short number of tokens (512.. maybe more). Google folks came up with a new architecture called Reformer that leverage locality preserving hashing that blast past this limation and a allow handling context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.[arxiv]


Friday, March 06, 2020

Google Cloud GKE control plane price introduction: the tragedy of the commons or bait and switch?

Recently when GCP announced that On June 6, 2020, Google Kubernetes Engine (GKE) clusters will start accruing a management fee. The fee is $0.10 per cluster per hour which amount to roughly $73/month. Everybody using GKE hit the roof as it was seen as yet another flakiness episode of the chocolate factory. A lot of existing customers are seeing themselves trapped into a bait and switch tactic by Google however the story is a little bit more complex than what it appears.
It seems that Google made a series of mistake in its rush to try to attract enterprise customer with it’s Kubernetes offering.



The first mistake is to have used the free control plane as a "loss leader." GKE provides the manager node, and cluster management so that it’s customers don't have to. And in exchange, you sell more compute, storage, network, and app services. 
The side effect of not charging for the control plane and charging for the control plane leads to two very different Kubernetes architectures. Small, single application clusters are simpler to set up and operated.  With the free control plane, customer embraced this approach as they didn’t need to architect their cloud infrastructure in a multi-tenant fashion.
Moreover, as per google docs, those decisions made at the start are very much set in stone. Customer cannot change their cluster from a regional cluster to a single zone cluster for example. So Google has customers who built their stacks taking into account Google’s free control plane, and GCP is turning the screws in by adding a cost for it — but they cannot change the type of their cluster to optimise their spend, since, per your docs, those decisions are set in stone. Hence the entrapment feeling that a lot of existing clients feel at the moment.

The second key metric they missed when announcing the free control plane is that the majority of Kubernetes deployments tend to have a single application to cluster mapping. So it would have been normal to assume that most of their potential customers would have started with small single app cluster deployment as they didn’t have the natural inertia brought by the cost of running the control plane. 

Third, as per their own metrics, they discovered that customers will use and abuse those free resources. It’s the tragedy of the commons where all those empty clusters cost Google money
Obviously, Google hoped that their customers would have applied their best practices and deployed multi-tenant cluster. Multitenant clusters are harder to manage, deploy and maintain.
And no amount of "best practice" documentation will solve this. However, it is not as simple and not every company is a hyper-scale corporation like Netflix and al.  Engineering is about balancing cost and benefit. Often the best practice to have many clusters for a variety of reasons such as: "Create one cluster per project to reduce the risk of project-level configurations". And company are ok with the waste as long as their software & deployment practices can treat any hosted Kubernetes service as essentially the same. Often corporation accepts waste as part of the inherent cost of not rearchitecting their process and culture. It's often more efficient to simplify the infra complexity albeit extra cost than trying to re-architect the company IT structure to embrace the latest best practice. It's a simple cost/risk/ROI analysis. They are even more ok with waste when Google fitted part of the bill with their free control plane. 




In the end, the folks at Google cloud fell between a rock and a hard place. They fell to the trap of the tragedy of the common hoping that all their customer will run their operations like google and now are trying to recoup those extra $$ by introducing a cost for the control plane. By doing so Google did the equivalent of adding a tax to running Kubernetes clusters on GCP. This is perceived as an ex-Oracle way of thinking, "what can we do to meet growth objectives," "how can we tax the people who we own".
To some extent, this is the equivalent by Google of a carbon or petrol tax. Customers need now to rethink their strategy, I.e. adopt public transport ( multi-tenant cluster ) or move to electric (cloud run). Some might move away completely from GCP because of the perceived lack of stability of the offering both in term of services and pricing.