Reflections Of The Void: distributed file system

Showing posts with label distributed file system. Show all posts

Tuesday, October 30, 2018

[Links of the Day] 30/10/2018 : Python Object CLI generator, Alibaba Distributed File System, Microsoft API Design guideline

Python-fire : library for automatically generating command line interfaces (CLIs) from absolutely any Python object
PolarFS : Alibaba n Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database. Keep an eye on this one, as the authors are planning to deliver a TLA+ proof soon. Moreover, I hope that they also run a benchmark against GPFS or Luster rather than Ceph. Ceph is not really competing in the same league.
API design : pretty much the gold standard in API design. A must-read for anybody designing or using API.

Thursday, June 29, 2017

[Links of the Day] 29/06/2017 : BeeGFS distributed FS, Virtual memory in Big memory systems, PdfX

An Introduction to BeeGFS : Fraunhofer distributed parallel file system for HPC system. Mainly a concurrent of Lustre I would say. This had potential and they are making a foray into the business side of storage. Let's see how well they fare. [website]
Preserving the Virtual Memory Abstraction : the author work aims at maintaining the virtual memory abstraction throughout a set of various hardware implementation. [thesis]
PDFx : really cool tool that allows you to extract all the reference and metadata and download them!!

Wednesday, April 26, 2017

[Links of the Day] 26/04/2017 : Aphyr Scala Day, Sia blockchain file storage , How brains are built

Aphyr Scala Day 17 : Aphyr breaks database for a living and then talks about it :)
Sia : a Blockchain-based marketplace for file storage, the really attractive thing is the cost comparison of SIA vs public cloud system. Which is between a tenth to a hundredth time cheaper than S3 or other similar solution. I would be curious to see the performance thought.
How brains are built: High-level overview of principles of computational neuroscience.

Monday, November 07, 2016

[Links of the Day] 07/11/2016 : Baidu Open Source Repo(s), Wan Replicated DB

Baidu : Baidu open source code on Github. It looks like it replicate a lot of service / feature that other hyperscale system use. Raft seems to be the default underlying consensus protocol for all applications. A lot of nice goodies in there, especially:

BFS : Baidu file system that provide the underlying persistence for Baidu real time application. Its a distributed multi datacenter using raft for metadata coherence and use a shared nothing approach for linear scalability.
Tera : Distributed database
Galaxy : mesos / kubernetes equivalent.
Paddle : Distributed machine learning
iNexus : Distributed K/V store . Looks similar to consul and it also use raft as the underlying consensus protocol

Bedrock : Wan replicated distributed data (base). Designed to use SSD and other nice features.

Tuesday, October 11, 2016

Notes on SNIA Storage Developer Conference 2016

This year SNIA Storage Developer Conference, chosen bits :

MarFS : scalable near-POSIX file system using object storage. What is really impressive is that MarFS is part of a 5 tiers storage system of the trinity project. Yes FIVE tiers, RAM -> BurstBuffer -> Lustre -> MarFS-> Tape. MarFs seats above Tape for long term archival and aim at providing storage persistence that span year(s) of usage. In comparison Lustre just above aim at keeping the data for weeks only. What bother me is the logic behind this approach as most Supercomputer system have a 5-6 year lifespan. This implies that the project usage will span multiple generation of systems. [Github]
Hyperconverged Cache : It seems that Intel start to realize what we discovered years ago in the Hecatonchire project. Once you start to have near Ram performance, dis-aggregating and pooling your ressource becomes the natural next step for efficiency. And this is what they aim to achieve with a distributed storage cache system that would aggregate their 3dxpoint system across a cluster in order to deliver fast and coherent cache layer. However without RDMA this approach seems a little bit pointless. The only things that seems to save them is that the cloud storage backend ( Ceph ) has a big enough latency gap they can exploit.
Erasure Code : Very good overview of modern erasure code and their trade-offs. As always no code are equal but not all use case are the same.

Persistent Memory : As storage shift away from HDD to Pmem , the number of talk around persistent memory exploded this year. The main focus seems to shift from pure NVM consumption to remote access model.

NVMe over fabric : two talk on the recent progress of NVMe over fabric. Nothing really new there, just that it seems that it will be the standard in remote storage access in the near future. [Mellanox] [Linux NVMf]
RDMA : It seems that Intel and other are aiming for direct Persistent memory access using RDMA, bypassing the NVMe stack. The idea is to eliminate the latency from the NVMe stack. However this require some change in the RDMA stack in order to guarantee persistence of data.

IOPMEM : interesting work where the author propose to bypass CPU interaction between PCIe devices. Basically enabling DMA between NVM and other devices. It then allows RDMA NIC to directly talk to the NVM device on the same PCI switch. However it doesn't really explain what persistence guarantee are associated with the different operations.
RDMA verbs extension : basically Mellanox propose to add a RDMA flush verbs that would mimic the CPU flush command . This operation would guarantee consistency and persistence of remote data.
PMoF : address the really difficult aspect of guaranteeing persistence and consistency of accessing persistent memory over fabric. Basically this talk describe all the nitty gritty detail to avoid losing/corrupting data during access over over fabric. This is what the RDMA flush verb will need to address but for the moment require a lot of manual operation.

Last but not least we can see reference here and there to 3dXpoint from Intel however it seems that the company tuned down its marketing machine. Probably fearing some backlash because of the continuous claw-back on claimed performance front.

Thursday, May 12, 2016

[Links of the day] 12/05/2016: Lustre + Omnipath in Bridges Supercomputer & Storage Media Evolution

Lustre + Omnipath : HPC filesystem of choice meet Intel Omnipath fabric. Intel was poised to release such crossover as it continue to push in the HPC domain and rack infrastructure domination . Remember that Intel acquired Whamcloud (Lustre) a while back.
Storage Media Overview: Historic Perspectives of storage solution. Interesting snippet of information all storage media revenue decreased from 2014 to 2015 except for NAND. However, NAND revenue increased by 30% in 2014 but only 3% in 2015. Hinting a plateau of the tech and entering a commoditization phase with lower margin. [Video]
Bridges :supercomputer being built at the Pittsburgh Supercomputing Center (PSC), they have a really cool Virtual Tour .

Monday, March 21, 2016

[Links of the day] 21/03/2016: Twitter distributed file system, mechanical computer, broadband access

Strong consistency in Manhattan : how twitter enforce consistency in its distributed storage system.
Design Principles of Konrad Zuse's Mechanical Computers : metallic plate and hammer, that's all you need for making a computer. Well not really but at least no electricity is involved.
Towards Enabling Broadband for a Billon Plus Population with TV White Spaces : authors present how to use TV white space to connect wifi cluster and bring internet to large portion of population that do not have access to it.

Friday, March 18, 2016

[Links of the day] 18/03/2016: RethinkDB FS and NSDI16 - Load balancer, reconfigurable fabric

Google Load Balancer : Google use a N+1 model for its load balancer vs the classical Active/passive model. Really nice low level network load balancing solution.
XFabric : Reconfigurable In-Rack Network for Rack-Scale Computers
Regrid : a method of storing large files inside a RethinkDB database. Each file is stored as a series of binary chunks inside RethinkDB
NSDI16 : this year harvest of Usenix networking papers.

Tuesday, March 08, 2016

[Links of the day] 08/03/2016 : Deeplearning google tech talk, NVMW 2016 workshop

Jeff Dean Google TechTalk : Head of Google Deep learning effort tech talk in Seoul
NVMW 2016 : 7th Annual Non-Volatile Memories Workshop 2016, there is some really interesting paper / talks :

Supporting Superpages in Non-Contiguous Physical Memory : can greatly enhance the performance without the need to rewrite the applications code. Simply leveraging superpage in linux and the propose solution
Using Nonvolatile Pooled Memory Buffers In NVM Express over Fabrics Systems : we had it coming for a while
NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories : future of FS?

Wednesday, November 25, 2015

Links of the day 25/11/2015: Micron NVDIMM, Ceph load balancer, Programming with Promise

Micron Persistent Memory & NVDIMM : Micron announced the production of 8GB DDR4 NVDIMM. Similar to the diablo tech, viking nvdimm. However for some reason Micron decided to externalize its super-capacitor to an external module compare to the other vendor who integrated it on the stick themselves. The trade-off is that you can fit more silicon on a stick however it obviously restrict the HW they can deployed onto. [slides]
Mantle: a programmable metadata load balancer for the ceph file system. It allows the user to dynamically inject information into the Ceph load balancer in order to optimize data placement and hence performance of the overall system
How do Promises Work? : excellent coverage of the promise technology, how it works, when to use it (or not) and how to avoid some of its pitfall.