Reflections Of The Void: 05/01/2020

Thursday, May 28, 2020

Oauth Proxy : A reverse proxy that provides authentication with Google, Github or other providers.
Zebrium : Prometheus backend project, not sure why you don't want to just export the data from Prometheus into a distributed column-store data warehouse like Clickhouse, MemSQL, Vertica. This gives you fast SQL analysis across massive datasets, real-time updates regardless of order, and unlimited metadata, cardinality and overall flexibility. Maybe because they want to focus on the monitoring / reactive aspect and less on the analytics.
Improving Audio Quality with WaveNetEQ : Google uses machine learning to deal with packet loss, jitter, and delays. An interesting bit of info: " 99% of Google Duo calls need to deal with packet losses, excessive jitter or network delays. Of those calls, 20% lose more than 3% of the total audio duration due to network issues, and 10% of calls lose more than 8%."

Combining knowledge graphs : The authors describe a new entity alignment technique that factors in information about the graph in the vicinity of the entity name. It provides a 10% higher accuracy while reducing computational cost for model generation.
Wizard : project looking into real-world storage reliability for cost-effective data and storage resource management system for reliability enhancement.
Elle : Jespen black-box transactional safety checker based on cycle detection. You can find more in the Arxiv Paper by Kyle Kingsbury and Peter Alvaro : "Elle: Inferring Isolation Anomalies from Experimental Observations"

Age partitioned Bloom Filters : Age-Partitioned Blocked Bloom Filter variant
Open source libraries to deploy, monitor, version and scale your machine learning : A curated list of open source libraries to deploy, monitor, version and scale your machine learning
Data Sentinel : Linkedin platform for automatically validating the quality of large-scale data in production environments

Statistical Consequences of Fat Tails : Nassim Nicholas Taleb book investigates the misapplication of conventional statistical techniques to fat-tailed distributions and looks for remedies
AutoML Zero : aims to automatically discover computer programs that can solve machine learning tasks, starting from empty or random programs and using only basic math operations. The goal is to simultaneously search for all aspects of an ML algorithm—including the model structure and the learning strategy—while employing minimal human bias.
Sno : Distributed version-control for geospatial and tabular data

Kernel Wasm : Looks like people want to run WASM everywhere. This time the authors propose to run wasm program in the kernel. In this case, I just wonder if it would not be more judicious to try run WASM in EBPF. From the GitHub repo it seems that they might actually try to do the opposite. [github]
Knowledge Graphs : A comprehensive introduction to knowledge graphs. If you want to learn more about the knowledge graph I would recommend reading the following paper before reading the arxiv one.
A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions : really innovative approach by Intel folks there. I have started to see interesting trends in machine learning where instead of trying to train the ML model using a dataset that contains the whole spectrum of possibility. The authors start to use contrastive methods instead. In this case, the ML model is trained on a non-abnormal dataset in order to identify abnormal behaviour. It is much easier in performance evaluation to obtain ideal, or standard metrics rather than abnormal scenario. In this case, the author uses the ideal hardware performance counter to train their model in order to identify abnormal behaviour. [poster]

Learning From Unlabeled Data : Slidedeck of a talk by Thang Luong of Google research. Thang present a novel method for learning from unlabeled data and more specifically semi-supervised learning methods. These methods were used to generate Google Meena Chatbot model.
Flying Squid : Looks like a super-fast Snorkel with even better performance. Like Snorkel this is used to quickly building classifiers of datasets that would be otherwise extremely time-consuming (and expensive) to label by hand for training purposes.
Gandalf : Azure machine learning system trained to catch bad rollout deployment. The aims of this system is to catch bad deployment before they can have ripple effects across the whole system.

Tactical manuals and guides for startups : an awesome collection of strategic posts, essays or documents for startups. While these are great resources, it doesn't replace experience.
AutoML Pipeline : The power of Juila meet Machine learning. However, beware as just feeding data into a system and hoping to get the best result coming out without any effort is doomed to deliver sub-optimal results. Often you end up with an ok-ish solution that blows up in production down the line.
Tcmalloc : Google Thread Caching Malloc

umake : no more compilation wait, this tool offers fast with cached compilation.
Deepspeed : a deep learning optimization library. The authors claim some amazing gains over the standard library. The nice thing is that it reuse the PyTorch API, which makes it easy to use. [github]
Nuclear Matters Handbook : ever wanted to know how the US handles Nuclear deterrent and nuclear matters? look no further and read this book. It provides an overview of the U.S. nuclear enterprise and how the United States maintains a safe, secure, and effective nuclear deterrent.