Sunday, November 22, 2020

HPC ecosystem - SC20

 This article is a quick summary of SC20 trends and the current state of the HPC ecosystem from a tech and market perspective.

Technology-wise there are three main competing HPC architectures: 

  • Commodity (e.g. Intel)
  • Commodity + accelerator (e.g. GPUs)
  • Lightweight cores (e.g. IBM BG, Xeon Phi, TaihuLight, ARM )

Commodity systems represent the bulk of the systems out there. However, commodity + accelerator are ramping up their presence aggressively. Nvidia dominates this market segment with 142 systems out of 149. With Intel scooping 4 with it's Phi solution. Lightweight cores systems are a minority with only four systems. But with the new A64FX and a renewed appetite for custom chips, this might change rapidly.

Intel is still dominating the ecosystem, with 92% of the shares, followed by AMD with 4%. However, this might change rapidly with AMD EHP technology ramping up. Another aspect is that AMD technology tends to be more open-source friendly, which can make it more attractive long term. Not to mention that their GPU also start to become highly competitive in the AI space.

From a market size, the HPC market was $39.0 billion in 2019, up 8.2% from $36.1 billion in 2018. Predictions show growth to $55.0 billion in 2024. Most of the growth was led by government spending after six years of growth led by industry. The number of system in industry vs public is not equally divided with ~50% each.

One notable change is double-digit growth of cloud HPC related market. Cloud grew 17.8% to $1.4 billion; however, this might only be the tip of the iceberg as many companies might be using HPC like system in the cloud without labelling it as HPC. Cloud solutions are heavily displacing low-end HPC segment. Entry and mid-range level server classes have the slowest growth in years as consumers prefer to buy HPC as a service solution and reduce their CAPEX. 

AI is still heavily influencing the HPC infrastructure market as it represents a considerable opportunity for HPC solution vendors. HyperscaleAI infrastructure by itself is about $8 billion. It seems that for the moment, AI and HPC future are closely intertwined.

Sources: Intersect360 research - Pre-SC20 Market Update & Jack Dongarra - An overview of HPC

Thursday, November 05, 2020

The real motivation behind the Matrix engine (GPU/TPU/...) adoption in HPC

 There is current backlash in the HPC community against GPU/TPU/... aka matrix engine accelerator adoption. Most of the arguments are performance, efficiency and real HPC workload driven. 

Like in a recent paper by Jens Domke et Al., colleagues of Dr Matsuoka at RIKEN and AIST in Japan, explore if the inclusion of specialized matrix engines in general-purpose processors are genuinely motivated and merited, or is the silicon better invested in other parts.

I wrote before in that a lot of new HPC systems overuse matrix engine hardware in their architecture. In this paper, the authors looked at the broad usefulness of matrix engines. They found that there is only a small fraction of real-world application that use and benefit from accelerated dense matrix multiplications operations. Moreover, when combining HPC and ML or when you try to accelerate traditional HPC applications, the inference computation is very lightweight compared to the heavyweight HPC compute. 

While I agree with the argument put forward some other aspects that go beyond HPC need to be taken into consideration as to why there is such a push for matrix engine adoption. And these aspects are mainly market-driven. If you compare markets, there is significantly more money in the "hyped" years old AI market (training + inference) vs the 30 years old "mature" HPC market. 

In raw numbers, the HPC market is worth $39 Billion. In comparison, the AI market is worth $256 Billions in hardware along. If you focus on AI semiconductor only it is still $32 Billion alone! And the growth projections are not in favour of HPC. 

If you then look into the N^4 computing complexity for AI vs at best N^3 for HPC. Or look where (institutions, companies, systems/individuals such as in cars, wearables, medical appliances, etc.) those AI systems are going vs HPC systems. You quickly understand the significant difference and potential between the two markets.

If you take the ROI of AI-related business into consideration, it now makes more sense why HPC institutes are investing in such type of hardware. Such investment will allow them to tap into a promising and fast-expanding market. The matrix engine movement is simply a market-driven investment to ensure the best ROI for HPC centres.

Sunday, November 01, 2020

ARM ecosystem disintegration and the rise of RISC-V

#ARM acquisition by #Nvidia is making people uneasy. 

And the early sign of the unravelling of the #ARM ecosystem start to appear: ThunderX3 general-purpose ARM CPU has been cancelled.

One would ask why spending $$ to build a better product and increase its number of consumers if, for that, it will have to use the Nvidia IP and compete directly against the IP owner.
If you combine this with the difficult viability of putting together a general-purpose #ARM alternative to #Intel / #AMD as #ARM vendors are effectively competing on cost with much lower volumes.

We start to understand why Marvell decided to shift toward the much more trendy IPU/PDU/Smartnic market.

On the other hand, I think we will see an acceleration of RISC-V adoption. Eating away at the traditional #ARM market share. This will be driven by the large scale edge deployment of #riscv sees chips with a RISC-V core and an #NPU (neural processing unit). These chips can be churned out at incredibly cheap cost, less than $10, and these will become ubiquitous really rapidly.

It might take 10-15 years but ultimately this will seal the fate of the ARM franchise.

Saturday, October 17, 2020

CryptMPI: A Fast Encrypted MPI Library

 As more #HPC applications move to cloud infrastructure, securing and protecting HPC sensitive data in such an environment becomes critical.

But HPC solution tends to fall short when it comes to security. Security features tend to be perceived as detrimental to the performance of the applications.

By example, encrypted communication has always been seen as incurring very significant overheads when you are aiming for microsecond latency.

The author of the Crypt MPI paper demonstrates that you can ensure the privacy and integrity of sensitive data with minimal performance degradation using an enhance MPI library.

I hope to see this kind of feature integrated as standard in a future version of MPI.



Wednesday, July 08, 2020

Software RDMA revisited : setting up SoftiWARP on Ubuntu 20.04

Almost ten years ago I wrote about installing SoftIwarp on Ubuntu 10.04. Today I will be revisiting the process. First, what is SoftIwarp: Soft-iWARP is a software-based iWARP stack that runs at reasonable performance levels and seamlessly fits into the OFA RDMA environment provides several benefits. SoftiWARP is a software RDMA device that attaches with the active network cards to enable RDMA programming. For anyone starting with RDMA programming, RDMA-enabled hardware might not be at hand. SoftiWARP is a very useful tool to set up the RDMA environment, and code and experiments with.

To install SoftIwarp you have to go through 4 stages: Setting up the environment, Building SoftIwarp, Configuring Softiwarp, Testing.

Setting up RDMA environment

Before you start you should prepare the environment for building a kernel module and userspace library.
Basic building environment

sudo apt-get install build-essential libelf-dev cmake

Installing userspace libraries and tools

sudo apt-get install libibverbs1 libibverbs-dev librdmacm1 \
librdmacm-dev rdmacm-utils ibverbs-utils

Insert common RDMA kernel modules

sudo modprobe ib_core
sudo modprobe rdma_ucm

Check if everything is correctly installed : 

sudo lsmod | grep rdma 

You should see something like this : 

rdma_ucm               28672  0
ib_uverbs             126976  1 rdma_ucm
rdma_cm                61440  1 rdma_ucm
iw_cm                  49152  1 rdma_cm
ib_cm                  57344  1 rdma_cm
ib_core               311296  5 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,ib_cm

Now set up some library for the userspace libs : 

sudo apt-get install build-essential cmake gcc libudev-dev libnl-3-dev \
libnl-route-3-dev ninja-build pkg-config valgrind

Installing SoftiWARP

10 years ago you had to clone the SoftiWARP source code and build it ( Now you are lucky, it is by default in the Linux kernel 5.3 and above!

You just have to type : 

sudo modprobe siw

verify it works : 

sudo lsmod | grep siw
you should see : 
siw                   188416  0
ib_core               311296  6 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,siw,ib_cm
libcrc32c              16384  3 nf_conntrack,nf_nat,siw

moreover, you should check if you have an Infiniband device present : 

ls /dev/infiniband 

Result : 


You also need to add the following file in your /etc/udev/rules.d/90-ib.rules directory containing the below entries : 

 ####  /etc/udev/rules.d/90-ib.rules  ####
 KERNEL=="umad*", NAME="infiniband/%k"
 KERNEL=="issm*", NAME="infiniband/%k"
 KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
 KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
 KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"

If it doesn't exist you need to create it.

I would suggest you add also the module to the list of modules to load at boot by adding them to /etc/modules file

You need now to reboot your system.

Userspace library

Normally, recent library support softiwarp out of the box. But if you want to compile your own version follow the step bellow. However, do this at your own risk... I recommend to stick with the std libs.

Optional build SIW userland libraries: 

All the userspace library are in a nice single repository. You just have to clone the repo and build all the shared libraries. If you want you can also just build libsiw but it's just easier to build everything at once. 

git clone
cd ./softiwarp-user-for-linux-rdma/

Now we have to setup the $LD_LIBRARY_PATH so that build libraries can be found. 
cd ./softiwarp-user-for-linux-rdma/build/lib/

or you can add the line in your .bashrc profile:

End of optional section

Setup the SIW interface : 

Now we will be setting up the loopback and a standard eth interface as RDMA device:

sudo rdma link add <NAME OF SIW DEVICE > type siw netdev <NAME OF THE INTERFACE>

In this case for me : 

sudo rdma link add siw0 type siw netdev enp0s31f6
sudo rdma link add siw_loop type siw netdev l0

You can check the two devices have been correctly set up using ivc_devices and ibv_devinfo command
result of ibv_devices  :
    device              node GUID
    ------           ----------------
    siw0             507b9ddd7a170000
    siw_loop         0000000000000000

result of ibv_devinfo :

hca_id: siw0
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   507b:9ddd:7a17:0000
 sys_image_guid:   507b:9ddd:7a17:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  1024 (3)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet
hca_id: siw_loop
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   0000:0000:0000:0000
 sys_image_guid:   0000:0000:0000:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet

Testing with RPING: 

Now we simply test the setup with rping : 

In one shell : 
rping -s -a <serverIP> 

in the other : 

rping -c -a <serverIP> -v 

And you should see the rping working successfully! 

You are now all set to use RDMA without the need for expensive hardware. 

Thursday, July 02, 2020

[Links of the Day] 02/07/2020 : Database query optimization, Deep Learning Anomaly detection survey, Large scale packet capture system

  • event-reduce : accelerate query result after write. Basically if cache part of the write and recalculate the new query result using past query result and the recent write event. The authors observe an up to 12 times faster displaying of new query results after a write occurred.
  • Deep Learning for Anomaly Detection: A Survey : comprehensive survey of anomaly detection techniques out there. 
  • Moloch : Large scale, open-source, indexed packet capture and search.

Tuesday, June 30, 2020

Data is the new oil fueling Machine learning adoption but Businesses are discovering #AI is no silver bullet

Data is the new oil. However, unlike oil, as data scarcity is becoming less of a problem, processing costs are skyrocketing. The business world is waking up to the fact that while the cost of computing keeps getting cheaper all the time. The cost of training machine learning models is outpacing the compute cost drop.

Moreover,  business are finding challenging to adopt #ai, and the economist report numbers are showing how often #machinelearning projects in the real business world fail :
  • Seven out of ten said their #ai projects had generated little impact so far.
  • Two-fifths of those with “significant investments” in ai had yet to report any benefits at all.
Companies are finding that #machinelearning is not the promised silver bullet. The non-tech company are discovering what tech companies had to learn the hard way: that they are no Google, Facebook, ...

To successfully deploy an AI/ML/DL project you need: a vast amount of data, skilled employee, solid engineering practice, access to infrastructure and last but not least, a clear understanding of the business problem.

I have a false hope that corporation will abandon the silver bullet thinking, but I would settle for avoiding another #ai winter cycle.

[Links of the Day] 30/06/2020 : Homomorphic encryption for Machine Learning, Neural Network on Silicon, Python Graph visualization library

  • PySyft : Python framework for homomorphic encryption for Machine learning. It allows you to train model on encrypted data without the need to decrypt it. It's 40x slower than normal method but you this means you don't have to deal with the new EU regulation on AI. 
  • Neural Networks on Silicon : a collection of papers and works on Neural Networks on Silicontopic
  • Pygraphistry Python visual graph analytics library to extract, transform, and load big graphs in Graphistry 

Thursday, June 25, 2020

[Links of the Day] 25/06/2020 : Architecture decision record, Database stress test, Rust Network Function Framework

  • Architecture decision record : Methods and tools for capturing software design choices that address a functional or non-functional requirement that is architecturally significant. [template]
  • pstress : Perconna Database concurrency and crash recovery testing tool
  • capsule : A framework for network function development. If you want to do fast packet process in a memory safe programing language (RUST) this is for you.

Tuesday, June 23, 2020

The rise of Domain Specific Accelerators

Two recent articles indicate a certain pick up of Domain-specific Accelerators adoption. With the end of Moore's Law, domain-specific hardware solution remains one of the few paths to continuing to increase the performance and efficiency of computing hardware.

For a long time, domain-specific Accelerators adoption was limited by economics factors. Historically, the small feature sizes, small batch sizes, and high cost of fab time (for ASICs) translated in a prohibitive per unit cost.
However, economic factors have shifted :

  • move toward standardised opensource tooling,
  • more flexible licensing model,
  • RISC-V architecture coming of age and maturing rapidly
  • Fab cost dropping
  • Wide availability of FPGA (AWS F1)
  • Rise of co-designed high-level programming language reducing the learning curve and design cycle.
  • power/performance wall of general-purpose compute unit

We are about to see a dramatic shift toward heterogeneous compute infrastructure over the next couples of years.

[Links of the Day] 23/06/2020 : Thinking while moving, AI snake oil, Graph Database

  • Thinking While Moving too often current machine learning system is used in a rigid control loop. Leading to saccades. The authors of this paper propose concurrent execution of the fingering system with the controlled system. Allowing more fluid operations and shorter execution time of the task.
  • AI snake oil : a lot of AI solution project fail to return their initial investment. Too many buzzwords and not enough understanding of the limits of the current technology. At least NVIDIA is selling GPU by the millions. When there is a gold rush, the one making a fortune is the one selling shovels.
  • TerminusDB : in-memory graph database management system. it's interesting to see that 99% of the source code is Prolog and they JSON-LD as the definition, storage and interchange format for the query language. The original use case for this solution targeted financial data stored as time series but lacking graph correlation.

Monday, June 22, 2020

Stop throwing GPU at HPC, Not all scientific problems are compute dense

The current race to exascale has put a heavy emphasis on GPU-based acceleration at the detriment of other HPC architecture. However, Crossroads and Fugaku supercomputer are demonstrating that it is not all about GPU.

The vast majority of the (pre-)exascale machines are relying heavily on GPU acceleration targeting scientific problems that can be cast as dense matrix-matrix multiplication problems.

However, there are large numbers of scientific problems that are not compute dense. And such GPU architectures are ill-equipped to accelerate these problems. Sadly, the current trends seem to have relegated those type of scientific challenges to second class citizens in the HPC world. If you look at extreme-scale graph problems by example, the graph500 benchmark clearly shows that these type of problem have been orphaned. 4 out of ten systems are more than seven years old and nearing their end of life. Moreover, the newer systems show marginal progress toward accelerating extreme-scale graph traversal. 

I understand that the current machine learning hype heavily influences the HPC ecosystem. However, we have to remind ourselves that there is life beyond FLOPS. And the Fugaku and Crossroads system demonstrates it is possible to achieve strategic compute leadership without sacrificing the architecture to the altar of exaflop compute dense gods. 

The Japanese latest ARM-based Fugaku supercomputer is demonstrating that it can address both compute dense GPU optimised and the one that not reducible to dense linear algebra and therefore incompatible with GPU technologies. The Japanese supercomputer built around the ARM v8.2A A64FX CPU just picked up the number one in the HPC Top 500 Green benchmark and the Graph500 BFS benchmark.

Hopefully, this will be a wake-up call within the HPC community to properly fund R&D efforts orthogonal to the compute dense and exaflops benchmark friendly architecture.

Update 22/06/2020 : after publishing this article Fugaku just got ranked Nb 1 at the top 500 Linpack benchmarks with close to half an exaflops! (415 Petaflops).And Fugaku is pretty much topping every single HPC ranking :

  • Top500 : Nb 1.
  • Top500-Green : Nb 1.
  • HPCG : Nb 1.
  • HPL-AI: Nb 1.
  • Graph500 : Nb 1.

Friday, June 19, 2020

With enough data and/or fine tuning, simpler models are as good as more complex models

This is an age-old issue that seems to repeat itself in every field. There are a couple of recent papers published criticising the race to beat SOTA.

This recent paper demonstrates that older and simpler model perform as well as newer models as long as they get enough data to train.

This has some interesting impact on production systems. As if you already have a good enough model, throwing more data at it can help achieve close to SOTA result.
Which means that you won't have to build from scratch a new model to keep up with SOTA in your production system. You just need to collect more data as the system run and retrain your model once in a while.
Also, less complex models tend to have shorter Inference time in production. Which would be a quite crucial component as well that gets impacted by model complexity.

In another recent paper, the authors look at Metric learning papers from the past four years and demonstrate that the performance claims over the old method (often more than double) are mainly due to the lack of tuning.
Most of the time the authors of the SOTA beating algorithm show two evaluations. One where they finetune their algorithm on the test set and compare against the off the shelf tuning SOTA algorithm.

"Our results show that when hyperparameters are properly tuned via cross-validation, most methods perform similarly to one another"

"...this brings into question the results of other cutting edge papers not covered in our experiments. It also raises doubts about the value of the hand-wavy theoretical explanations in metric learning papers."
This happens time and time again across the industry and academia: perf benchmark of CPU Intel vs AMD, GPU Nvidia vs ATI, Network, Storage, etc....
This can be due to lack of knowledge, time, integrity, etc..

To conclude, be careful, the latest shiny model might note the best one for your production. If you spend enough time and data on older models you might achieve the same performance at lower inference cost.
Obviously, this assumes that you already have the best practice when it comes to model monitoring in production :)

Thursday, June 18, 2020

Tuesday, June 16, 2020

Thursday, June 11, 2020

[Links of the Day] 11/06/2020 : Metric Time-Series Database, Machine Learning for metrics, Causal Time series Analysis

  • Victoria Metrics : fast, cost-effective and scalable time-series database, if you need a backend for Prometheus, by example, this is the DB for you.
  • Sieve : a platform to derive actionable insights from monitored metrics in distributed systems. the platform is composed of two separate systems. One geared toward trace reduction and selection with intelligent sampling using a form of zero-positive learning. And a second system that extracts correlations between the services generating the traces.
  • Tigramite : causal time series analysis python package. It allows to efficiently reconstruct causal graphs from high-dimensional time-series datasets and model the obtained causal dependencies for causal mediation and prediction analyses [github]

Tuesday, June 09, 2020

[Links of the Day] 09/06/2020 : #WASM on #K8S , Fast Anomaly Detection on Graphs, Linux one ring

  • Krustlet : it seems that web assembly is getting more pervasive, we have kernel WASM, WASM for Deep learning, now Krustlet offer WASM for Kubernetes via Kubelet.
  • Fast Anomaly Detection in Graphs :  Really cool real-time anomaly detection on dynamic graphs.  the authors claim to be 644 times faster than SOTA with 42-48% higher accuracy. What is even more attractive is the constant memory usage which is fantastic for production deployment. [github
  • io_uring : this will dominate the future of the Linux interface. It is currently eating up every single IO interface and probably won't stop just there. 

Saturday, June 06, 2020

Yet another Red Queen Project : Franco-German Gaia-X

For some reason, the EU and especially the French government love moonshot project. The only problem is that they tend to be launched after the moon as already been colonized.

Gaia-X is not a moonshot, but a Red Queen project. I use this term in reference to the Red Queen hypothesis or Red Queen effect, which is derived from Lewis Carroll's Through the Looking-Glass :
Now, here, you see, it takes all the running you can do, to keep in the same place.

Gaia-X is a Red queen project because the French and German government (and the EU to some extend) are trying to forcefully evolve the digital ecosystem to stay in the same place. Also, because they always launch this initiative way too late or without any long term strategic planning both in term of funding and impact. 

Let's look at Gaia-x and why there is an air of "deja vu". First, it's not a cloud service; it's a "platform" aggregating cloud hosting services from dozens of companies. Does that remind you of anything? Bingo, the European cloud initiative, which aim at : 
"Strengthen Europe's position in data-driven innovation, improve its competitiveness and cohesion, and help create a Digital Single Market in Europe."

This initiative started back in 2012; at the time, I didn't get the strategy and structure of the effort. And unfortunately, I still don't. EU wanted to regulate and impose EU standard to the industry hoping to spruce the EU cloud ecosystem via standards and funding sprinkling. I use the term sprinkling because EU thought that by seeding a constellation of research projects and local initiatives it would magically help sprout an EU cloud giant.

The "standard effort" side of the program seems to have fizzled out. Judge by yourself: The official final report is here.

Gaia-x seems to be an offshoot of the European Cloud Partnership side of the cloud initiative, aiming at increasing trust when using cloud services: 
"it's (Gaia-X) conceived as a platform joining up cloud-hosting services from dozens of companies, allowing businesses to move their data freely with all information protected under Europe's tough data processing rules."

Compounding with the regulatory compliance spin, the project promoters cannot refrain themselves from using the vendor lock-in FUD: 
"One important concept underpinning Gaia-X is "reversibility", a principle that would allow users to switch providers quickly. " 

They conveniently forgot to mention that by using Gaia-x, you will be replacing provider lock-in for platform lock-in. 

If you dig a little bit on the technical side you find out that this reads more like a program to keep academic research institutes busy and rehashes fantasies of dynamically matching service providers to consumers and policies. Dynamic matching was something that was a hot topic in academia during the SOAP times but isn't used at all in practice. Moreover, it doesn't use any established logic programming paradigm and re-invents an ad-hoc service ontology/taxonomy and query language.

Last but not least, one of the glaring omission from the platform is the complete lack of specification regarding a common accounting, payment and monetization of services. Where is the processing and payment service? It is conveniently absent.

Providing an accounting and payment platform for dynamically orchestrated services from a multitude of providers is not only hard. It's near impossible. Without this crucial element, the platform is stillborn.

If France and Germany want to avoid turning Gaia-X into another Qwant. Maybe pivoting the platform to a more niche domain such as a government and large company cloud services procurement platform. This would fit right in the compliance, sovereignty and interoperability narrative as well as the business profile of most of the consortium participants.

Thursday, June 04, 2020

[Links of the Day] 04/06/2020 : XoR filters, SIMD + Json , Online tracking and publisher's revenues

  • Xor Filters :  Xor filters are great as they provide a fast and small version of bloom or cuckoo filter. However, there is some key difference. Xor filters require all the members of the set be provided upfront. While, Bloom filters allow adding members, but not removing them and finally Cuckoo filters allow removing members. So just pick what's best for you.
  • SimdJson : nice performance leveraging the CPU feature. However, the lack of support for null entry feel like cheating ( and probably crash with the most common real-life payload)
  • Online Tracking and Publishers' Revenues : The authors demonstrate that the use of cookie only represent a 4% increase of revenue vs non-cookie for an advertiser. Which brings question the differential benefit between ad publisher like Google and Facebook vs the advertiser. Bringing into question why should advertiser pay for the loss of privacy that only benefits their platform provider.

Tuesday, June 02, 2020

[Links of the Day] 02/06/2020 : Real time network topology, Detecting node failure using graph entropy, Monitoring machine learning in production

  • Skydive : open source real-time network topology and protocols analyzer providing a comprehensive way of understanding what is happening in your network infrastructure.
  • Vertex : the authors propose to use vertex entropy for detecting and understanding node failures in a network. By understanding the entropy in a graph they are able to circumvent the lack of locality in the information available and pinpoint critical nodes. 
  • Monitoring Machine Learning Models in Production :  Once you have deployed your machine learning model to production it rapidly becomes apparent that the work is not over.

Thursday, May 28, 2020

[Links of the Day] 28/05/2020 : Reverse Oauth proxy , Prometheus timeseries backend, Google use machine learning to improve audio chat

  • Oauth Proxy : A reverse proxy that provides authentication with Google, Github or other providers.
  • Zebrium : Prometheus backend project, not sure why you don't want to just export the data from Prometheus into a distributed column-store data warehouse like Clickhouse, MemSQL, Vertica. This gives you fast SQL analysis across massive datasets, real-time updates regardless of order, and unlimited metadata, cardinality and overall flexibility. Maybe because they want to focus on the monitoring / reactive aspect and less on the analytics.
  • Improving Audio Quality with WaveNetEQ : Google uses machine learning to deal with packet loss, jitter, and delays. An interesting bit of info: " 99% of Google Duo calls need to deal with packet losses, excessive jitter or network delays. Of those calls, 20% lose more than 3% of the total audio duration due to network issues, and 10% of calls lose more than 8%."