Monday, August 22, 2011

Soft RoCE, an alternative to Soft iWarp

Introduction

The Soft RoCE distribution is available now as a specially patched OFED-1.5.2 distribution, which is known as OFED-1.5.2-rxe. Users familiar with the installation and configuration of OFED software will find this easy to use. It is supported by System Fabric Works. Please refer to the official website for soft-RoCE for the further details.

Features: 

Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure.
  • Utilize the same transport and network layers from IB
    • Stack and swap the link layer for Ethernet.
    • Implement IB verbs over Ethernet.
  • Not quite IB strength, but it’s getting close.
  • As of OFED 1.5.1, code written for OFED RDMA , auto-magically works with RoCE.
Performance :
(from IMPLEMENTATION &  IMPLEMENTATION & COMPARISON OF  COMPARISON OF RDMA OVER ETHERNET RDMA OVER ETHERNET)


  • RoCE is capable of providing near-Infiniband QDR  performance for :
  • Latency-critical applications at message sizes from  128B to 8KB
  • Bandwidth-intensive applications for messages <1KB.
  • Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.
  • Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.

Installation 

The Soft RoCE distribution contains the entire OFED-1.5.2 distribution, with the addition of the Soft RoCE code.

Download link : http://www.systemfabricworks.com/downloads/roce

Installation of the OFED-1.5.2-rxe distribution works exactly the same as a “standard” OFED distribution.  Installation can be accomplished interactively, via the “install.pl” program, or automatically via “install.pl –c ofed.conf”.  The required new components are “librxe” and “ofa-kernel.” The latter is not new, but in our "rxe" version of the OFED distribution it includes the rxe/Soft RoCE kernel module.
 

Usage

After install OFED-1.5.2-rxe, you can use rxe_cfg command to configure Soft-RoCE. Herein, I list a few most useful commands for us.
# rxe_cfg -h
Usage:
rxe_cfg [options] start|stop|status|persistent|devinfo
rxe_cfg debug on|off| (Must be compiled in for this to work)
rxe_cfg crc enable|disable
rxe_cfg mtu [rxe0]  (set ethernet mtu for one or all rxe transports)
rxe_cfg [-n] add eth0
rxe_cfg [-n] remove rxe1|eth2
Options:
 -n: do not make the configuration action persistent
 -v: print additional debug output
 -l: in status display, show only interfaces with link up
 -h: print this usage information
 -p 0x8916: (start command only) - use specified (non-default) eth_proto_id
1. Enable the Soft-RoCE module
rxe_cfg start
# rxe_cfg start

Name  Link  Driver   Speed   MTU   IPv4_addr        S-RoCE  RMTU
eth0  yes   bnx2             1500  198.124.220.136
eth1  yes   bnx2             1500
eth2  yes   iw_nes           9000  198.124.220.196
eth3  yes   mlx4_en  10GigE  1500  192.168.2.3
rxe eth_proto_id: 0x8915

2. Disable the Soft-RoCE module
rxe_cfg stop

3. Add a Ethernet interface to the Soft-RoCE module
rxe_cfg add [ethx]

4. Remove a Ethernet interface from the Soft-RoCE module
rxe_cfg remove [ethx|rxex] 
 
Tuning for performance:

1) MTU size
The Soft-RoCE interface only support four MTU size: 512, 1024, 2048 and 4096. In order to max the performance, we can choose 4096.
Commands: ifconfig [ethx] mtu 9000 // set the jumbo frame for the original Ethernet interface.
rxe_cfg mtu [rxex] 4096 // set the max MTU to the according rxe interface.


Note: you also need to enable your switch to support jumbo frame

2) CRC checking
To max the performance, we need to disable crc checking.
Commands: rxe_cfg crc disable

3) Ethernet tx queue length
Also, we need to give a large number to the txqueuelen parameter of the original Ethernet interface.
Commands: ifconfig [ethx] txqueuelen 10000

Wednesday, May 04, 2011

Architecture Overview of an Open Source Low TCO cloud storage system

I present here a possible solution for a low TCO open source cloud storage system, For those out there creating their own cloud (or hosting service).

I am not claiming that it will suits everyone  needs but at least i hope it will give you some valuable pointers and alternatives.
Also you might want to adapt it for your specific needs because you might not require every single feature of the system.


Summary:

This setup allows you to build your own redundant storage network with common PC hardware, easier but far more expensive way to achieve this would be to get a SAN and some fiber channel attached hosts. This setup provide similar feature as the one provided by Amazon ESB as well as a HR cluster file system for your cloud storage.

Features: 
  • High availability (DRBD , cluster file system)
  • High reliability ( DRBD)
  • Flexible Dynamic storage resource managment 
    • File system export or Block device " amazon ESB style"
  • Dynamic fail over  configuration ( Pacemaker Corosync )
    • Active / Passive ; N+1 ; N to N ; Split site

Overview:
  • A set of paired storage back end composed of  hosts that use DRBD to keep the data redundant between each paired hosts.
  • On top of DRBD we have LVM (or CLVM ) using LVM we can do on-the-fly logical partition resizing, snapshots ,including hosting snapshot+diffs,you can even resize a logical partition across multiple underlying DRBD  partition .
    •  note: LVM can be used as a front end and backend of DRBD
  • LVM block device  will be exported to the cluster nodes using GNBD. Another node makes a GNBD import and the block device appears to be a local block device there, ready to mount into the file hierarchy.
  • OCFS2 as a cluster file system allow all cluster nodes to access this file system concurrently.
    •  Another possibility is to export a GNBD device for each virtual machine (but you still need a distributed/ network file system for config etc..).
  • Use of Pacemaker and Corosync to manage resources for  HA / HR 
  • For the managment / control and monitoring part, a custom made solution might be needed. 
    • GRAM could be used to expose the resource management
    • Any monitoring framework should be able to do the trick

"Simple" Schema: 








Pro:
  • Most of the independent parts are Proven solution used in large scale production environment
  • Open source / readily available tools
  • OCFS2 provide back-end storage for Image while GNBD can provide on demand block storage for the cloud instance
  • Easy Accounting : as any other file system ( might need custom build tools thought depending of the needs/ requirement) 
  • COTS components


Con :
  • DRBD provide HA/HR through replication ( think RAID 1 ) which means you have  HA/ HR and speed at the expense of  half of your storage( slightly more if you are using raid for your actual disk storage)
  • Complex / and risk of cascading failure due to dominoes effect ( similar to what happen in Amazon cloud recently with their ESB) 
  • Performance will be extremely dependent of the set of physical resource available as well as the topology usage :
    • It will require a lot of tweaking / customization to extract the best performance ( ex dual head for DRBD, load balancing etc.. ) and every setup will be different
    • Creation of dedicated monitoring tools will be require in order to manage and automate the performance tweaking
  • Require to create custom tool for management, scheduling / job management etc..( the hard bit)



Tools , Links / Pointers :

Voila!


Thursday, March 31, 2011

How to install Soft-iWARP on Ubuntu 10.10, AKA how to have RDMA enabled system without the expensive hardware.


What is Soft-iWARP and why installing it : 

Soft-iWARP [1] is a software-based iWARP stack that runs at reasonable performance levels and seamlessly fits into the OFA RDMA environment provides several benefits:
  • As a generic (RNIC-independent) iWARP device driver, it immediately enables RDMA services on all systems with conventional Ethernet adapters, which do not provide RDMA hardware support.
  • Soft-iWARP can be an intermediate step when migrating applications and systems to RDMA APIs and OpenFabrics.
  • Soft-iWARP can be a reasonable solution for client systems, allowing RNIC-equipped peers/servers to enjoy the full benefits of RDMA communication.
  • Soft-iWARP seamlessly supports direct as well as asynchronous transmission with multiple outstanding work requests and RDMA operations.
  • A software-based iWARP stack may flexibly employ any available hardware assists for performance-critical operations such as MPA CRC checksum calculation and direct data placement. The resulting performance levels may approach those of a fully offloaded iWARP stack.

How to install Soft-iWARP  :

When I tried to install Soft-iWARP   I ran into several issue and I am writing this installation manual so i don’t forget how to do it and other people can enjoy  a semi seamless installation.

My setup: 

  •          Core 2 duo
  •          e1000 nic
  •          4 Gb of RAM
  •          Ubuntu 10.10 server
  •          Linux kernel : 2.6.35-28-generic
Pre-Install

Getting the Linux kernel dev environment:
  • sudo apt-get install fakeroot build-essential crash kexec-tools makedumpfile kernel-wedge
  • sudo apt-get build-dep linux
  • sudo apt-get install git-core libncurses5 libncurses5-dev libelf-dev binutils-dev


Setting up  the Infiniband / libibverbs  environment
  • sudo apt-get install libibverbs1 libibcm1 libibcm-dev  libibverbs-dev  libibcommon1 ibverbs-utils
  • check if  “/etc/udev/rules.d/40-ib.rules” exist
1.       If the file doesn’t exist , create it and populate it with (you need to reboot after):

 ####  /etc/udev/rules.d/40-ib.rules  ####
 KERNEL=="umad*", NAME="infiniband/%k"
 KERNEL=="issm*", NAME="infiniband/%k"
 KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
 KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
 KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
 KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
 ########


Verifying the Infiniband  / libibverbs  environment

·         Run these command in the following order:
1.  sudo modprobe  rdma_cm
2.  sudo modprobe ib_uverbs
3.  sudo modprobe rdma_ucm
4.  sudo lsmod  




5.ls /dev/infiniband/





    
Getting the compile / dev environment for Soft-iWARP   : 
  • sudo apt-get install libtool autoconf         


Installing Soft-iWARP   :

Getting Soft-iWARP   
Getting the source from the git repository :

Compiling and installing the Soft-iWARP    kernel module : 

Inside the soft-iwarp kernel module folder:
  1.          make
  2.          sudo make install
Compiling and installing the Soft-iWARP   userlib: 

Inside the soft-iwarp userlib folder:
  1.          ./autogen.sh
  2.          Run AGAIN  ./autogen.sh
  3.          ./configure
  4.          make
  5.          sudo make install
·         at that stage soft-iwrap install the infiniband stuff in “/usr/local/etc/libibverbs.d” we need to make asymbolic link t /etc/infiniband.d
o   sudo ln -s /usr/local/etc/libibverbs.d /etc//libibverbs.d


Inserting the  Soft-iWARP    kernel module:
1.  sudo modprobe  rdma_cm
2.  sudo modprobe ib_uverbs
3.  sudo modprobe rdma_ucm
4.  sudo    insmod  /lib/modules/"your_kernel"/extra/siw.ko
5.  sudo lsmod
§  check if all the modules are correctly installed
6.  ls /dev/infiniband/



Soft-iWARP    kernel  module parameters:
        loopback_enabled: if set, attaches siw also to the looback device.
                To be set only during module insertion.

        mpa_crc_enabled:  if set, the MPA CRC gets generated and checked
                both in tx and rx path (kills all throughput).
                To be set only during module insertion.

        zcopy_tx:         if set, payload of non signalled work requests
                (such as non signalled WRITE or SEND as well as all READ
                 responses) are transferred using the TCP sockets
                sendpage interface. This parameter can be switched on and
                off dynamically (echo 1 >> /sys/module/siw/parameters/zcopy_tx
                for enablement, 0 for disabling). System load benefits from
                using 0copy data transmission: a servers load of
                one 2.5 GHz CPU may drop from 55 percent down to 23 percent
                if serving 100KByte READ responses at 10GbE's line speed.

Verifying  the Soft-iWARP    is correctly installed :
running :
  •          ibv_devinfo
  •          ibv_devices
·         You should get :
·          

  • if you get:
libibverbs: Warning: couldn't load driver 'siw': libsiw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
o   you need to run “sudo  ldconfig” to update the library path


Running the simple test: 



Install rdma-utils :
  • sudo apt-get install  rdmacm-utils


Then you can test the soft-iwarp with :
  • rping : RDMA ping
  • udaddy : Establishes a set of unreliable RDMA datagram communication paths between two nodes using the librdmacm, optionally transfers datagrams between the nodes, then tears down the communication.
  • mckey : Establishes a set of RDMA multicast communication paths between nodes using the librdmacm, optionally transfers datagrams to receiving nodes, then tears down the communication.
  • ucmatose  : Establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects.

Note : the ibv_* utils don't work with iWARP device . you will get errors if you try to use them such as:


[root@iwarp1 ~]# ibv_rc_pingpong
Couldn't get local LID
[root@iwarp1 ~]# ibv_srq_pingpong
Couldn't create SRQ
[root@iwarp1 ~]# ibv_srq_pingpong
Couldn't create SRQ
[root@iwarp1 ~]# ibv_uc_pingpong
Couldn't create QP
[root@iwarp1 ~]# ibv_ud_pingpong
Couldn't create QP
Note 2 : If you only have one machine and you want to use the  loopback (optional you can still use your normal nic), to test if you have it enabled: