Showing posts with label rdma. Show all posts
Showing posts with label rdma. Show all posts

Wednesday, July 08, 2020

Software RDMA revisited : setting up SoftiWARP on Ubuntu 20.04

Almost ten years ago I wrote about installing SoftIwarp on Ubuntu 10.04. Today I will be revisiting the process. First, what is SoftIwarp: Soft-iWARP is a software-based iWARP stack that runs at reasonable performance levels and seamlessly fits into the OFA RDMA environment provides several benefits. SoftiWARP is a software RDMA device that attaches with the active network cards to enable RDMA programming. For anyone starting with RDMA programming, RDMA-enabled hardware might not be at hand. SoftiWARP is a very useful tool to set up the RDMA environment, and code and experiments with.

To install SoftIwarp you have to go through 4 stages: Setting up the environment, Building SoftIwarp, Configuring Softiwarp, Testing.

Setting up RDMA environment

Before you start you should prepare the environment for building a kernel module and userspace library.
Basic building environment

sudo apt-get install build-essential libelf-dev cmake

Installing userspace libraries and tools

sudo apt-get install libibverbs1 libibverbs-dev librdmacm1 \
librdmacm-dev rdmacm-utils ibverbs-utils

Insert common RDMA kernel modules

sudo modprobe ib_core
sudo modprobe rdma_ucm


Check if everything is correctly installed : 

sudo lsmod | grep rdma 

You should see something like this : 

rdma_ucm               28672  0
ib_uverbs             126976  1 rdma_ucm
rdma_cm                61440  1 rdma_ucm
iw_cm                  49152  1 rdma_cm
ib_cm                  57344  1 rdma_cm
ib_core               311296  5 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,ib_cm

Now set up some library for the userspace libs : 

sudo apt-get install build-essential cmake gcc libudev-dev libnl-3-dev \
libnl-route-3-dev ninja-build pkg-config valgrind


Installing SoftiWARP

10 years ago you had to clone the SoftiWARP source code and build it (https://github.com/zrlio/softiwarp.git). Now you are lucky, it is by default in the Linux kernel 5.3 and above!

You just have to type : 

sudo modprobe siw

verify it works : 

sudo lsmod | grep siw
you should see : 
siw                   188416  0
ib_core               311296  6 rdma_cm,iw_cm,rdma_ucm,ib_uverbs,siw,ib_cm
libcrc32c              16384  3 nf_conntrack,nf_nat,siw

moreover, you should check if you have an Infiniband device present : 

ls /dev/infiniband 

Result : 

rdma_cm


You also need to add the following file in your /etc/udev/rules.d/90-ib.rules directory containing the below entries : 

 ####  /etc/udev/rules.d/90-ib.rules  ####
 KERNEL=="umad*", NAME="infiniband/%k"
 KERNEL=="issm*", NAME="infiniband/%k"
 KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
 KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
 KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
 ########


If it doesn't exist you need to create it.


I would suggest you add also the module to the list of modules to load at boot by adding them to /etc/modules file

You need now to reboot your system.


Userspace library

Normally, recent library support softiwarp out of the box. But if you want to compile your own version follow the step bellow. However, do this at your own risk... I recommend to stick with the std libs.

Optional build SIW userland libraries: 

All the userspace library are in a nice single repository. You just have to clone the repo and build all the shared libraries. If you want you can also just build libsiw but it's just easier to build everything at once. 

git clone https://github.com/zrlio/softiwarp-user-for-linux-rdma.git
cd ./softiwarp-user-for-linux-rdma/
./buid.sh

Now we have to setup the $LD_LIBRARY_PATH so that build libraries can be found. 
cd ./softiwarp-user-for-linux-rdma/build/lib/
export LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH


or you can add the line in your .bashrc profile:
export LD_LIBRARY_PATH=<<PATHTOTHELIBRARIES>>:$LD_LIBRARY_PATH

End of optional section



Setup the SIW interface : 


Now we will be setting up the loopback and a standard eth interface as RDMA device:

sudo rdma link add <NAME OF SIW DEVICE > type siw netdev <NAME OF THE INTERFACE>


In this case for me : 

sudo rdma link add siw0 type siw netdev enp0s31f6
sudo rdma link add siw_loop type siw netdev l0

You can check the two devices have been correctly set up using ivc_devices and ibv_devinfo command
result of ibv_devices  :
    device              node GUID
    ------           ----------------
    siw0             507b9ddd7a170000
    siw_loop         0000000000000000

result of ibv_devinfo :

hca_id: siw0
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   507b:9ddd:7a17:0000
 sys_image_guid:   507b:9ddd:7a17:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  1024 (3)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet
hca_id: siw_loop
 transport:   iWARP (1)
 fw_ver:    0.0.0
 node_guid:   0000:0000:0000:0000
 sys_image_guid:   0000:0000:0000:0000
 vendor_id:   0x626d74
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  invalid MTU (0)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet

Testing with RPING: 

Now we simply test the setup with rping : 

In one shell : 
rping -s -a <serverIP> 

in the other : 

rping -c -a <serverIP> -v 


And you should see the rping working successfully! 

You are now all set to use RDMA without the need for expensive hardware. 


Tuesday, March 05, 2019

[Links of the Day] 05/03/2019 : RDMA for containers, K8s for hobbyist, OpenSource mobile core network

  • FreeFlow : virtual RDMA networking purely with a software-based approach using commodity RDMA NICs for containers. 
  • K8s clusters for hobbyist : how to setup and operate a fully functional, secure Kubernetes cluster on a cloud provider such as Hetzner Cloud, DigitalOcean or Scaleway.
  • Magma : open-source software platform that gives network operators an open, flexible and extendable mobile core network solution.

Friday, April 14, 2017

[Links of the Day] 14/04/2016 : OpenFabric Workshop , Docker's Containerd , Category Theory

  • OpenFabrics Workshop 2017 : Some interesting talk this year at the open fabric conference:
    • uRDMA : Userspace RDMA using DPDK. This opens up a certain amount of possibility, especially for object storage solution. [Video , Slides, github]
    • Crail : Using urdma above to deliver accelerated storage solution for Apache big data projects [Slides, github]
    • Remote Persistent Memory: I think this is the next killer app for RDMA. If Intel doesn't jump onto it and deliver a dpdk like solution. [Video, Slides]
    • On Demand paging: slowly the tech is crawling its way up to upstream acceptance. While on-demand paging introduces a certain performance cost. It also allows a greater flexibility in consuming RDMA. One of the interesting aspects that nobody mentioned yet is how this feature could be used with persistent memory. I think that there is some good potential for p2p NVM storage solution.[Video, Slides]
  • Containerd : Containerd move to github, the docker "industry standard" container runtime is also reaching its v.0.2.x release.  [github]
  • Category Theory : If you are into functional programming and Haskell. This is a must read book for you.

Tuesday, October 11, 2016

Notes on SNIA Storage Developer Conference 2016

This year SNIA Storage Developer Conference, chosen bits : 
  • MarFS : scalable near-POSIX file system using object storage. What is really impressive is that MarFS is part of a 5 tiers storage system of the trinity project. Yes FIVE tiers, RAM -> BurstBuffer -> Lustre -> MarFS-> Tape. MarFs seats above Tape for long term archival and aim at providing storage persistence that span year(s) of usage. In comparison Lustre just above aim at keeping the data for weeks only. What bother me is the logic behind this approach as most Supercomputer system have a 5-6 year lifespan. This implies that the project usage will span multiple generation of systems. [Github]
  • Hyperconverged Cache : It seems that Intel start to realize what we discovered years ago in the Hecatonchire project. Once you start to have near Ram performance, dis-aggregating and pooling your ressource becomes the natural next step for efficiency. And this is what they aim to achieve with a distributed storage cache system that would aggregate their 3dxpoint system across a cluster in order to deliver fast and coherent cache layer.  However without RDMA this approach seems a little bit pointless. The only things that seems to save them is that the cloud storage backend ( Ceph ) has a big enough latency gap they can exploit. 
  • Erasure Code : Very good overview of modern erasure code and their trade-offs. As always no code are equal but not all use case are the same. 

Persistent Memory :  As storage shift away from HDD to Pmem , the number of talk around persistent memory exploded this year. The main focus seems to shift from pure NVM consumption to remote access model. 
  • NVMe over fabric : two talk on the recent progress of NVMe over fabric. Nothing really new there, just that it seems that it will be the standard in remote storage access in the near future. [Mellanox] [Linux NVMf]
  • RDMA :  It seems that Intel and other are aiming for direct Persistent memory access using RDMA, bypassing the NVMe stack. The idea is to eliminate the latency from the NVMe stack. However this require some change in the RDMA stack in order to guarantee persistence of data. 
    • IOPMEM  : interesting work where the author propose to bypass CPU interaction between PCIe devices. Basically enabling DMA between NVM and other devices. It then allows RDMA NIC to directly talk to the NVM device on the same PCI switch. However it doesn't really explain what persistence guarantee are associated with the different operations.
    • RDMA verbs extension : basically Mellanox propose to add a RDMA flush verbs that would mimic the CPU flush command . This operation would guarantee consistency and persistence of remote data. 
    • PMoF : address the really difficult aspect of guaranteeing persistence and consistency of accessing persistent memory over fabric. Basically this talk describe all the nitty gritty detail to avoid losing/corrupting data during access over over fabric. This is what the RDMA flush verb will need to address but for the moment require a lot of manual operation. 
Last but not least we can see reference here and there to 3dXpoint from Intel however it seems that the company tuned down its marketing machine. Probably fearing some backlash because of the  continuous claw-back on claimed performance front. 



Wednesday, September 14, 2016

[Links of the day] 14/09/2016 : Ethic in AI , Survey of fully homomorphic encryption, RDMA over Ethernet at scale at Microsoft

  • Ethical Preference-Based Decision SupportSystems : when AI and other autonomous agent start to be more ubiquitous in the human environment. As the decision of these systems will start to have a greater impact on our daily life, trust will need to be build and to achieve that these system will need that they are perceived to act in a moral and ethical way. 
  • A brief survey of Fully Homomorphic Encryption, computing on encrypted data : fully homomorphic encryption allow you to manipulate encripted data without decrypting it. This is great for database and other systems as it allow service to modify and update information without the need to know its content. Effectively partitioning operation from knowledge. However this comes at a cost (but its going down). Might finally end up with the security pipe dream where the data is immediately encrypted and is only manipulated in this form until it is finally consumed.
  • RDMA over Commodity Ethernet at Scale : It is interesting to see that RDMA start to slowly permeate hyper-scale data-center. However it is even more interesting to see that Microsoft decided to go for the RoCE version of it instead of infiniband. It make sens as there was a lot of investment in scaling the Ethernet for their cloud infrastructure and allow a lot of reuse and collocate normal and RDMA traffic on a single underlying fabric.


Friday, May 13, 2016

[Links of the day] 13/05/2016 : NVMesh , NVM file system

  • nvmesh : pure software product using a shared nothing architecture that leverages, NVMe SSD, SR-IOV and RDMA. Performance are interesting: 4M read and 2.8M write 4k IOPS, 16GB/s throughput and super low latency with 90µs/25µs for read and write from client to server. Whats is really interesting is the dual mode of operations: shared nothing with direct storage access for really fast access or centralized one which offer more redundancy and serviceability feature at the cost of a lower ( but still fast ) performance [video]
  • Fine-grained Metadata Journaling on NVM : the authors propose to move away from the limitation of block based journaling to a fine grained approach more suitable for NVM storage. They propose to move to a inode based transaction and journaling approach, each inode representing 256 byte. The solution seems cache friendly however it beg the question : why do we need to go through the CPU .. With DAX and other system it should be more efficient to completely bypass it[slides]
  • Fast and Failure-Consistent Updates of ApplicationData in Non-Volatile Main Memory File System : being crash consistent is the number 1 requirement for any storage solution. Current File system optimized for NVM doesn't seem to be good enough. The authors propose an alternative file system specifically tailored for consistency and high performance by moving away from the FS level consistency and target application level consistency solution. Naturally this put a greater burden on the application layer.. Then again researcher really need to move away from the classical FS solution and deliver a new paradigm. [slides]


Wednesday, May 04, 2016

[Links of the day] 04/05/2016 : Openserver Summit & Fortran OpenCoArray


  • OpenCoArray : Fortran is not dead, and the work on the Co array with accelerator demonstrate it.
  • Openserver summit
    • pcie 4.0 : Some really nice improvement with the upcoming standard in term of performance and especially RAS. However not mr-iov capability yet.. This is sorely missing to make PCIe a true contender on the rack scale fabric level. 
    • Azure SmartNIC : Microsoft use FPGA based smartnic to shorten the update cycle of their Azure cloud fabric. Its a really impressive solution. 
    • Persistent Memory over Fabrics : Mellanox pushing for RDMA based persistent memory solution. Probably trying to corner the market quickly as 3dXpoint and Omnipath solution from Intel are just around the corner. However what caught my attention is slide 14:  HGST PCM Remote Access Demo. What is really interesting is that HGST is probably one step away from merging NVM and RDMA fabric onto a single package. With that they would be able to offer a direct competition with DSSD at lower cost ( following the Eth Drive model ). 

Tuesday, January 26, 2016

[Links of the day] 26/01/2015: All about SNIA NVM summit 2016

NVM Summit : January 20th 2016 SNIA summit on non volatile memory, here are some of the interesting slides deck:
  • Solid State storage Market : nothing really new, we have to wait until 3dxpoint reach the market to shake things up. Hopefully it will help accelerate the drop of $ per GB even if the trend shown below seems to stall over the next few years. 
  • Going Remote at low latency : a look into what type of change would be necessary to improve the RDMA API in order to facilitate directe NVM access, bypassing the NVMe over fabric altogether.
  • Persistent Memory over Fabric : Well mellanox is obviously edging its bet here, but lets see how the NVMf stack evolve.

Wednesday, December 16, 2015

Links of the day 16/12/2015 : curl | sh, Huawei NUWA and EU Mikelangelo project

  • curl | sh : People telling people to execute arbitrary code over the network. What harm can it do ?
  • Nuwa : Huawei micro-server with disaggregated architecture based on commercial chips, and the computing and storage nodes can be switched on demand.
  • Mikelangelo : EU project with some interesting bit of technology and consortium Member : Huawei, Cloudius (for OSv). Note the vRDMA concept where they aim to use shared memory approach over RDMA to speed up communication across VM.

Thursday, November 05, 2015

Links of the day 05/11/2015: SNIA conf goodness , Parallella , Worst case Distributed system design

  • Worst-Case Distributed Systems Design : nice blog post on the advantage of designing distributed systems with handle worst-case scenarios gracefully in order to improve average case behavior handling.
  • Parallella : 18-core credit card sized computer, nice playground for parallel programming ( and hence publications)
  • 2015 Storage Developer Conference Presentations | SNIA : slide deck of the conference, some gems in there: 
    • RDMA + PMEM : coupling pmem with RDMA in order to deliver remote persistent memory.
    • NVDIMM cookbook : very good overview of what you can do and use case of NVDIMM.
    • Hashing algo for storage : Very good overview of the main hashing techniques, trade-off etc.. of hashing techniques for K/V (not just for storage)
    • Pro & Cons of Erasure Code vs Replication vs Raid : as always its depends but here is the exec summary :  RAID is reaching its limit, Erasure code is the preferred option for large scale however replication is required if you want certain type of performance. Finally everything will be really dependent of the storage defined system running it.

Wednesday, October 28, 2015

Links of the day 28/10/2015: K/V DB for side data, RDMA + HTM , metrics 2.0

  • PalDB : embeddable write-once key-value store written in Java by LinkedIn crowd. This solution kind of sit between json/file and level/rocksdb [github]
  • RDMA + HTM : very nice concept, would love to see the extension of HTM to NVM tech in order to close the loop. T His would deliver tremendous speed up capability. But also another type of concurrency model to understand. 
  • Metrics 2.0 : emerging set of conventions, standards and concepts around time series metrics metadata

Wednesday, September 16, 2015

The upcoming Storage API battle

There is an interesting trend within the storage ecosystem. We are witnessing a polarization of the offer. On one side, we are seeing the rise of high performance rack scale solution(DSSD, NVMe over fabric solution, etc..) . And on the other side we have the object storage solution which are more datacenter scale. While both leverage heavily non volatile memory they play a different different role within the ecosystem.

The rack scale storage target “very” high performance solution, delivering very low latency high bandwidth access time. Often in the 100 of usec or less. However often these solution come at a higher financial cost due to more expensive hardware (custom NVM), network fabric (IB+NVMe, PCIe + NVMe, Omnipath +NVMe, Pure PCIe, etc..) and significant power consumption (>2000W/5U for DSSD). Finally, these offer access via a specialized API that needs to be either accessed natively or adapted to other more standard one.
On the other side we have the object storage solution. Users access object storage through applications that will typically use a REST API. This makes object storage ideal for all online, Cloud environments. Moreover they tend to be a lot more cost efficient especially with the rise of Ethernet connected drives (up to 50% less TCO).
Stuck in the middle is the classic Filer / POSIX compliant solution that seems to slowly dwindle away. To a certain extent the rack scale solution should have a bright future in the niche (but still significant) market for enterprise that still consider that their application requires custom for what they think is a custom problem. On the other side the object storage is gaining momentum by riding the unstoppable cloud tide.



While, both technology can and should co-evolve, they both suffer from software limitation and to a certain extend hardware one as bandwidth and latency get dangerously close to what cpu are capable to handle. This requires a drastic shift on how applications are developed if user want to actually get any benefit from these solution. However, few company are willing to risk specializing their code using an API that can become obsolete when the next generation of storage solution pops up.
Storage startup/company out there start to discover that  API is playing a significant role success of their product and performance while still important will lose of its importance. It will either make rewriting applications to access your storage infinitely easier task or transform it into a painful experience by forcing to go through hoops and/or adaptation layers with the performance cost associated. 
The fight for the next generation storage API only started. There is and will be more push toward standardization which will be fueled by the customer tiredness with every revolving siloed point solution. People who use object storage one want it to behave more like POSIX storage, but they also want to keep the storage costs at an object level and improve the performance. On the other hand people using rack scale storage want to retain the performance but increase its simplicity and also want the price to come down. It is going to be extremely hard to deliver both but hopefully we might finally see a rationalization of the storage market as having an object storage system that allows byte range access is very appealing. 

Friday, August 28, 2015

Links of the day 28/08/2015 : libfabric, IO visor and demoscene

  • IO visor : it seems that the effort from Plumgrid around eBPF are picking up speed and an official linux foundation project is been setup. However one must wonder how such solution will compete against the pure user space solution relying on DPDK and consort? You can find a more in depth slide deck of the concept [here].
  • Libfabric : Intel is announcing in grand pomp the "Open source" library supporting its Omni path fabric. However it is not other that the fantastic Lib fabric. This library offer a set of next-generation, community-driven, ultra-low latency networking APIs. The APIs are not tied to any particular networking hardware model, it support Infiniband / Iwarp, usNic from Cisco, OPA from intel What is interesting is that it goes one step further than the RDMA library while maintaining a good balance between low level tuning and high level program-ability. While the learning curve might be a little bit more steep compared to Accelio from Mellanox it delivers ( I think ) greater advantage and flexibility.
  • Winning 1kb intro : released at Assembly 2015, prepare to be amazed


Tuesday, August 18, 2015

Links of the day 18/08/2015 : #RDMA fabric, #ReRAM NVDIMM, #Micron & 64bit dev problem

  • Congestion Control for Large-Scale RDMA Deployments : Microsoft Research paper demonstrating how they look at flow control in mixed use environment with RoCE. What's interesting is that it show that Microsoft start to heavily leverage RDMA in its data-center in order to deliver fast guaranteed performance especially in the storage domain.
  • ReRAM Storage Class Memory NVDIMM : Viking technology already had a Flash NVDIMM offer, but it seems they are expanding with the faster ReRAM type of tech. Probably in direct competition with 3DxPoint from Intel. I need to dig into the spec to see what are the trade-off between the different tech in term of cost/GB/Perf.
  • 64 bit program development forgotten problems : me bug you long long time.
  • Micron analyst presentation : interesting to see that the NB of datacenter customer represent is roughly equal to the number of enterprise one. Obviously we are at the tipping point where datacenter will clearly outtake enterprise in IT HW consumption following the natural shift from product to utility. Does that means that traditional storage vendor will be relegated to niche market in the future? 






Monday, August 17, 2015

Links of the day 17/08/2015 : #NVMe & #RDMA , #Strategy , Cryptography in hostile environment


  • NVMe over RDMA fabric : interesting bit PMC Sierra and Mellanox unveiled NVMe ove RDMA fabric as well as peer direct technology for NVM storage. This open up a certain world of possibility where you could combine without CPU involvement GPU - NVM(e) - RDMA. Literally offloading all the storage operations.
  • Strategy Scenario and the use of mapping : excellent series of posts by Simon Wardley showing how leveraging his mapping technique allow CEO - CIO to navigate the tortuous strategic decision. The Analysis of the scenario can be found here 
  • The network is hostile : TL;DR: we don't encrypt enough and early enough




Friday, June 19, 2015

Links of the day 19 - 06 - 2015

Today's links 19/06/2015: Google network, Replication with RDMA, Triton #container

  • Google Network : TL:DR version without the "OMG we are so great tone"  SDN relying on an underlying clos switch topology. [Video]
  • DRBD9 : now with RDMA added .. Next DPDK? We are moving away slowly from the all kernel solution for low latency and a userspace oriented ecosystem.
  • Triton : Joyent triton container tech. Interestingly enough they offer CPU bursting and RAM pooling for extra oomph when needed ( and if available)



Tuesday, May 05, 2015

The rise of micro storage services


Current and emergent storage solutions are composed of sophisticated building blocks: dedicated fabric, RAID controller, layered cache, object storage etc. There is the feeling that storage is going against the current evolution of the overall industry where complex services are composed of small, independent processes and services, each organized around individual capabilities.

What surprised me is that most of the storage innovation trends focus on very high level solutions that try to encompass as many features possible in a single package. Presently, insufficient efforts are being made to build storage systems based on small, independent, low-care and indeed low-cost components. In short, nimble, independent modules that can be rearranged to deliver optimal solution based on the needs of the customer, without the requirement to roll out a new storage architecture every time is simply lacking - a "jack of all trade" without,or limited "master of none" drawbacks or put another way modules that extend or mimic what is happening in the container - microservices space.

Ethernet Connected Drives

Despite this, all could change rapidly as an enabler (or precursor, depending how you look at it), of this alternative solution as it is currently emerging and surprisingly, coming from the Hard Drive vendors : Ethernet Connected Drives [slides][Q&A].This type of storage technology is going to enable the next generation of hyperscale cloud storage solution. Therefore, massive scale out potential with better simplicity and maintainability,not to mention lower TCO.

Ethernet Connected Drives are a step in the right direction as they allow a reduction in capital and operating costs by reducing:
  • software stack (File System, Volume Manager, RAID system);
  •  corresponding server infrastructure; connectivity costs and complexity; 
  • granularity which enable greater variable costs by application (e.g. cold storage, archiving, etc.).
Currently, there are two vendors offering this solution : Seagate with Kinetic and HGST with the Open Ethernet Drive. In fact we are already seeing some rather interesting applications of the technology. Seagate released a port of SheepDog project onto its Kinect product [Kinect-sheepdog] there by enabling the delivery of a distributed object storage system for volume and container services that doesn't requires dedication. Indeed there is a proof of concept presented HEPiX of HGST drive running CEPH or Dcache. While these solutions don’t fit all the scenarios, nevertheless, both of these solutions demonstrate the versatility of the technology and its scalability potential (not to mention the cost savings).

What these technologies enables is basically the transformation of the appliances that house masses of HDD into switches thereby eliminating the need for a block or file header as there is now a straight IP connectivity to the drive making these ideal for object based backends.

Emergence of fabric connected Hardware storage:

What we should see over the next couple of years is the emergence of a new form of storage appliance acting as a fabric facilitator for a large amount of compute and network enable storage devices. To a certain extend it would be similar to HP's moonshot except with a far greater density.

Rather than just focusing on Ethernet, it would be easy to see PCI, Intel photonic, Infiniband or more exotic fabrics been used. Obviously Ethernet still remains the preferred solution due to its ubiquity in the datacenter. However, we should not underestimate the need for a rack scale approach which would deliver greater benefit if designed correctly.
While HGST Open Ethernet solution is one good step towards the nimble storage device, the drive enclosure form factor is still quite big and I wouldn't be surprised if we see a couple of start-ups coming out of stealth mode in the next couple of months with fabric (PCIe most likely) connected Flash. This would be an equivalent of the Ethernet connected drive interconnected using a switch + backplane fabric as shown in the crudely designed diagram below.







Is it all about hardware?

No, indeed quite the opposite. That said, there is a greater chance of penetration of new hardware in the storage ecosystem as compared to the server market. This is probably where ARM has a better chance of establishing a beach head within the hyperscale datacenter as the microserver path seems to have failed.
What this implies is that it is often easier to deliver and sell a new hardware or appliance solution in the storage ecosystem than a pure software one. Software solutions tend to take a lot longer to get accepted, but when they pierce through, they quickly take over and replace the hardware solution. Look at the object storage solution such as CEPH or other hyper-converged solution. They are a major threat to the likes of Netapp and EMC.
To get back on the software side as a solution, I would predict that history repeats itself to varying degrees of success or failure. Indeed, like the microserver story we see, hardware micro storage solutions while rising, at the same time we see the emergence of software solutions that will deliver more nimble storage features than before.

In conclusion, I feel that we are going to see the emergence of many options for a massive scale-out, using different variants of the same concept: take the complex storage system and break it down to its bare essential components; expose each single element as its own storage service; and then build the overall offer dynamically from the ground up. Rather than leveraging complexed pooled storage services we would have dynamically deployed storage applications for specific demands composed of a suite of small services, each running in its own process and communicating with lightweight mechanisms.These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a minimum of centralized management relating to these services, which may be written in different programming languages and use different data storage technologies, . which is just the opposite of current offers where there are a lot of monolithic storage applications (or appliances) that are then scaled by replicating across servers.

This type of architecture would enable a true, on-demand dynamic tiered storage solution. To reuse a current buzzword, this would be a “lambda storage architecture”.
But this is better left for another day’s post that would look into such architecture and lifecycle management entities associated with it.





Monday, April 20, 2015

Links of the day 20 - 04 - 2015

Today's links 20/04/2015: NVMe over RDMA, Trademark , Pub-Sub for geo-replicated services , multicore middleware
  • Mangstor : demonstration of "NVMe over RDMA" shows over 2-Million NVMe 4KB Random read and ~1.7M random write low latency IOPs, fully saturating dual 40Gb ehternet ports.
  • Trademark : somebody trademarked "THE DATA CENTER IS THE COMPUTER "
  • Wormhole : Reliable Pub-Sub to Support Geo-replicated Internet Services at Facebook
  • MDTM : Multicore-Aware Data Transfer Middleware Project​ aims to accelerate data movement at multiore systems. It addresses inefficiencies in existing data movement tools when running on multicore systems by harnessing multicore parallelism to scale data movement on end systems.

Monday, April 06, 2015

Links of the day 06 - 04 - 2015

Today's links 06/04/2015: HPC cache, architecture pattern, virtual RDMA, Queue
  • cachelot : High-performance cache library and distributed caching server. Memcached compatible.
  • Software ArchitecturePatterns : Understanding Common Architecture Patterns and When to Use th€em by Mark Richards
  • Virtual RDMA : presentation on virtual RDMA device using SR-IOV for virtual environment.
  • libtorrent alert queue :new architecture of libtorrent alert queue using an heterogeneous queue

Wednesday, March 25, 2015

Links of the day 25 - 03 - 2015

Today's links 25/03/2015: #RDMA, #NVM, Google Build tools, OFS dev workshop