Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Thursday, July 02, 2020

[Links of the Day] 02/07/2020 : Database query optimization, Deep Learning Anomaly detection survey, Large scale packet capture system

  • event-reduce : accelerate query result after write. Basically if cache part of the write and recalculate the new query result using past query result and the recent write event. The authors observe an up to 12 times faster displaying of new query results after a write occurred.
  • Deep Learning for Anomaly Detection: A Survey : comprehensive survey of anomaly detection techniques out there. 
  • Moloch : Large scale, open-source, indexed packet capture and search.



Thursday, June 11, 2020

[Links of the Day] 11/06/2020 : Metric Time-Series Database, Machine Learning for metrics, Causal Time series Analysis

  • Victoria Metrics : fast, cost-effective and scalable time-series database, if you need a backend for Prometheus, by example, this is the DB for you.
  • Sieve : a platform to derive actionable insights from monitored metrics in distributed systems. the platform is composed of two separate systems. One geared toward trace reduction and selection with intelligent sampling using a form of zero-positive learning. And a second system that extracts correlations between the services generating the traces.
  • Tigramite : causal time series analysis python package. It allows to efficiently reconstruct causal graphs from high-dimensional time-series datasets and model the obtained causal dependencies for causal mediation and prediction analyses [github]



Tuesday, April 28, 2020

[Links of the Day] 28/04/2020 : Distributed Time Series Database, Data lakes, Translate data between format

  • Modern data lakes : if you think you need a data lake, you probably don' need one and are better off using S3/athena or GCP/bigquery . If you know you want a data lake you might be mature enough to need one and should read this article.
  • M3DB : Distributed Time Series database from Uber, it tries to address horizontal scaling of storage and queries or long term storage limitation of existing solutions.
  • ConfBase : a practical tool for inferring and instantiating schemas and translate between data formats. The tools support JSON, GraphQL, YAML, TOML, and XML. [github]


Thursday, April 16, 2020

[Links of the Day] The future of Machine Learning is DBMS, Fun Exploring Explanations, High performance Regex

  • Hyperscan : high-performance multiple regex matching library
  • Cloudy with a chance of DBMS : A. Colier reviews the 10 year ML prediction paper. the TL;DR: Model, model everywhere in enterprise databases. You have already seen of a glimpse of what that means with Big-query ML
  • Explorables : awesome website explaining a lot of concepts through play. Lot of computer science stuff in there. 

Tuesday, April 14, 2020

[Links of the Day] 14/04/2020 : Time series dynamical attractors Autoencoder , Binarized Neural Network framework, Machine learning and Databases

  • Deep learning of dynamical attractors from time series measurements : the authors propose a general embedding technique for time series, consisting of an autoencoder trained with a novel latent-space loss function. Worth giving it a look if you deal with time series.
  • larq : open-source Python library for training neural networks with extremely low-precision weights and activations, such as Binarized Neural Networks. Basically, this framework is aiming at embedded / FPGA / ASIC machine learning models deployment. A fantastic resource and great model zoo on top of that.
  • Cloudy with a chance of DBMS : Databases are going to embedded more and more machine learning solution. Big query from Google already does that. But it's just a question of time for most mainstream DB to offer ML service.

Thursday, September 19, 2019

[Links of the Day] 19/09/2019 : Golang DNS lib, tool for SQL query across databases, Linux Kernel Devops

  • NewDNS : Want to build a DNS in GO, this is for you.
  • OctoSQL : query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL. 
  • kdevops : devops framework for Linux kernel development that relies on ansible, vagrant and terraform, ansible roles through the Ansible Galaxy, and terraform modules. It aims at making setting up and testing the Linux kernel for any project as easy as possible. I wish I had that a couple of years ago.


Thursday, February 21, 2019

[Links of the Day] 21/02/2019: Database Internals, Fosdem 2019 Videos, Cloud Programming Simplified

  • Database Internals ; excellent series delving into the internal mechanism and algorithm of modern and not so modern database systems.
  • Fosdem 2019 Videos: Fosdem 2019 conference video start to filter through on the interweb
  • Cloud Programming Simplified: A Berkeley View on Serverless Computing paper which gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfil its full potential.

Tuesday, January 15, 2019

[Links of the Day] 15/01/2019 : Incident Response best practice, Database Schema Crawler, Fingerprinting TLS

  • ja3 : something I discovered recently. Apparently, you can fingerprint SSL and TLS session in order to identify the service being run behind the encrypted socket. Really awesome if you want to spot malware or bitcoin miner on your network. Or pretty much any other services as long as you have a fingerprint to compare with.
  • SchemaCrawler : a cool tool for database schema discovery. This is a must when you have to take on board a legacy DB system that lacks clear documentation. 
  • Incident response : pager duty open sourced they incident response process. This is a really great set of tools, process and best practice for incident response. What is even more eye-opening is the part the describe the incident resolution scenario that didn't work and point out some great anti-patterns. A must read for any SRE team out there and anybody else that has an on-call duty and their managers.


Tuesday, August 14, 2018

[Links of the Day] 14/08/2018: high-perf analytics database, Cloud events specs, Large scale system design

  • LocustDB : Massively parallel, high-performance analytics database.
  • CloudEvents Specifications : CNCF effort to create a specification for describing event data in a common way.
  • System Design Primer : really cool set of document helping any developer to learn to design large-scale systems.


Thursday, August 09, 2018

[Links of the Day] 09/08/2018 : Consciousness and integrated information, Optical FPGA, Events DB

  • Making Sense of Consciousness as Integrated Information : in this papers, the authors argue that we currently have a dissociation between cognition and experience and that it might impact in the future in an hyper-connected world.
  • Towards an optical FPGA : it look like programmable silicon photonic circuits is the next frontier in the hardware accelerator. Converting light into an electrical signal has rapidly become too expensive and modern CPU have a hard time coping with the pace of evolution of networking capabilities. 
  • TraildDB : tool for storing and querying series of events. Fast small efficient.


Monday, May 21, 2018

[Links of the Day] 21/05/2018 : Automation and Make, FoundationDB, Usenix NSDI18


  • Automation and Make : this is a really good description of best practice for Makefile and automation. 
  • FoundationDB : Apple open source it's distributed DB system, another contender enters the fray. With Spanner on google cloud, CockroachDB and now FoundationDB. The Highly resilient distributed transactional system start to reach widespread usage.  [Github]
  • Usenix NSDI 2018 Notes: a very good overview of NSDI conference, and naturally the morning paper is currently doing a more in-depth analysis of the main papers. [day 2&3]




Tuesday, January 16, 2018

[Links of the Day] 16/01/2018 : planetary scale DB - AntidoteDB, Benchmarks for Machine Learning and the hardware running the algorithms

  • AntidoteDB : large scale ( planet-scale ) distributed DB system. Competing with the like of cockroachDB or spanner. The core differentiator the architecture heavily rely on CRDT for its core functionality. It is a spin-off from the SyncFree EU research project. Sadly like a lot of EU or research-driven startup spin-off the documentation and website are slightly lacking polish. The architecture reference link is broken and a lot of stuff seems to be work in progress. Common guys! If you want to build a community and a product you really need to pick up the pace. This project has great potential, don't let it go to waste. 
  • Machine Learning Benchmarks - Hardware Provider : a very good survey of machine learning benchmark of the current cloud provider. What is even more useful from that benchmark is that you get a cost overview of running ML application. Which is often a big unknown at the moment. 
  • DeepMind Control Suite : benchmark suite for machine learning algorithms using a set of continuous control tasks with a standardised structure and interpretable rewards


Thursday, November 16, 2017

[Links of the Day] 16/11/2017 : Sparse and dense array database, Rythm of memory, Routing over blockchain

  • TileDB : manages massive dense and sparse multi-dimensional array data simply. This is a really good project as often there is no real support in existing database. 
  • Rhythm of memory : the brain is a complex organ. And we just barely scratched the surface. Scientist discovered that part of the memory processing in the brain is segregated in different subcomponent that process information in parallel and at different speeds. This gives a glimpse of how the brain works and how it is able to store and access so much data at the various level of granularity fast.
  • IPvPub : I really think there is something behind this concept. While I tend to be wary of the current trend of sprinkling blockchain everywhere. Using this technology for large-scale address resolution and routing can solve so many problems... Reduce reliance on DNS system in the age of lambda. What I really want to see is this integrated with a Lambda framework for simple exposure of service endpoints.


Thursday, October 12, 2017

[Links of the Day] 12/10/2017 : bitcoin resource list, time series DB seminar, Microservices debugger

  • Bitcoin resource list : extensive list of bitcoin resource ranging from basic introduction, history, tutorial, to in-depth tech materials
  • Time Series Database Lecture : 2017 Carnegie Mellon university lectures. This is quite good as it not this series of lecture not only offer high-quality theoretical knowledge in the field but also invited talk from key commercial and opensource player in this field ( influxdb, timescale, etc..) 
  • Squash : microservices debugger, because now you can't rely on your monolith debugging skill and tool set anymore ( ^_^).


Thursday, June 08, 2017

[Links of the Day] 08/06/2017 : Machine Learning Tuning DBMS, Direct SSD to GPU SQL and Large Graph DB processing

  • Tuning DBMS with Machine Learning : From the people behind Peloton, they demonstrate a way to automatically tune DB using machine learning. This is rather interesting, however, there is a key element that is missing in the approach: Cost. Your DB system can become highly optimise but your AWS cost can skyrocket too. What you need is a system that automatically tunes perf & cost tradeoff to maximise ROI Sometimes being a little bit slower can save $$
  • MOSAIC : More heterogeneous approach: graph processing engine that exploits all the hardware resources available in a standard Xeon host processor, Xeon Phi coprocessors, NVMe, and a fast interconnect. Because fast processing of your Facebook social network for fast advertisment targeting is worth it :) [slides]
  • PG-Strom : By-passing CPU for SQL operation by allowing direct SSD to GPU communication for Postgress SQL processing. We are slowly entering the age of heterogeneous computing system were core CPU get relegated to highly generic tasks. [slides]


Wednesday, April 26, 2017

[Links of the Day] 26/04/2017 : Aphyr Scala Day, Sia blockchain file storage , How brains are built

  • Aphyr Scala Day 17 : Aphyr breaks database for a living and then talks about it :) 
  • Sia : a Blockchain-based marketplace for file storage, the really attractive thing is the cost comparison of SIA vs public cloud system. Which is between a tenth to a hundredth time cheaper than S3 or other similar solution. I would be curious to see the performance thought.
  • How brains are built: High-level overview of principles of computational neuroscience.




Monday, April 03, 2017

[Links of the Day] 03/04/2017 : Conway's Game of life Clock, Human-Bot social interaction, SQL time series DB


  • Digital clock in Conway's Game of Life : I can't even start to comprehend how you can design this. But this is beyond cool.
  • Online Human-Bot Interactions: Detection, Estimation, and Characterization : An analysis of bots on socials network (twitter). I think we need a reverse Turing test. When a robot can detect when they talk to a human.... Reverse captcha to weed out that pesky meat-bag from meddling from our robotic overlord affairs.
  • Timescale : SQL compatible time series database. Another competitor for Influxdb. Let's just say that the clustering feature will make or break it as Influxdb has some serious issue there [github]



Monday, November 07, 2016

[Links of the Day] 07/11/2016 : Baidu Open Source Repo(s), Wan Replicated DB

  • Baidu : Baidu open source code on Github. It looks like it replicate a lot of service / feature that other hyperscale system use. Raft seems to be the default underlying consensus protocol for all applications. A lot of nice goodies in there, especially: 
    • BFS : Baidu file system that provide the underlying persistence for Baidu real time application. Its a distributed multi datacenter using raft for metadata coherence and use a shared nothing approach for linear scalability. 
    • Tera : Distributed database 
    • Galaxy : mesos / kubernetes equivalent. 
    • Paddle : Distributed machine learning 
    • iNexus : Distributed K/V store . Looks similar to consul and it also use raft as the underlying consensus protocol
  • Bedrock : Wan replicated distributed data (base). Designed to use SSD and other nice features. 



Wednesday, November 02, 2016

[Links of the Day] 02/11/2016 : Unik fast easy unikernel builder, Noms decentralized DB and dev books

  • Noms : decentralized database using GIT principle. There is some nice feature in there, such as content addressing (no duplicate), append only and last but not least : decentralized. Which means you can fork / merge, disconnect etc.. for seconds, hours or years. Like GIT, however i am not sure yet how they handle merge and conflict resolution.... 
  • Programming books : Good list of books for developers.
  • Unik : Tool by EMC to compile unikernels directly rather than going the binary route. Its nice to see an increased effort to facilitate unikernel adoption. Previously i talked about the effort. This is a slightly different approach here as it all to build unikernel in almost any langage using a tool chain as intuitive as docker. If this trend continue we might see a decline in container adoption with a move to unikernel. But its not for the short term as the optimisation cycle of the containers technology didn't fully kick in yet. [video] [slides]

Tuesday, May 17, 2016

[Links of the day] 17/05/2016: CMU DB lectures , Seminal IA papers, Storage noisy neighbors

  • Database Systems Lectures: Carnegie Mellon University lectures on database system. It gives a really good overview of the state of the art of database systems.
  • Intelligence without representation & Intelligence Without Reason : 1991 Seminal paper by Rodney A. Brooks from the MIT artificial intelligence lab. In these the author argue that intelligent behavior could be generated without having explicit manipulable internal representations and it also can be generated without having explicit reasoning systems present.
  • Noisy Neighbor analysis : a look at the effect of deploying heavy workload onto modern storage systems and the collateral effect on overall performance for all the participant in the cluster.