Reflections Of The Void: iwarp

Showing posts with label iwarp. Show all posts

Saturday, August 03, 2013

Hecatonchire Version 0.2 Released!

Version 0.2 of Hecatonchire has been released.

What's New:

Write Invalidate coherency model added for those who want to use Heca natively in their application as Distributed Shared Memory( more on that in a subsequent post)
Significant improvement in performance of page transfer as well as a numbres of bugs squashed.
Specific Optimisation for KVM.
Scale out memory mirroring
Hybrid Post copy live migration
Moved to linux Kernel 3.9 Stable
Moved to Qemu-kvm 1.4 stable
Added Test / Proof of concept tools ( specifically for the new coherency model)
Improved Documentation

Voila!

We are now focusing on Stabilizing the code as well as robustness ( we aim at making the code production ready by 0.4) . Also, we are starting significant work to transparently integrate Hecatonchire so it can be transparently leverage via a cloud stack and more specifically openstack.

You can download it here : http://hecatonchire.com/#download.html
You can see the install doc here: https://github.com/hecatonchire/heca-misc/tree/heca-0.2/docs
And finally the changelog there : http://hecatonchire.com/#changelog-0.2.html
Or you can just pull the Master branch on github: https://github.com/hecatonchire

Stay tuned for more in depth blog post on Hecatonchire.

Wednesday, February 06, 2013

Hecatonchire: Memory Scale out -up for Linux Kernel

This post present a more detail description of the memory scale out solution offered by the Hecatonchire project.

I. Motivation:

Currently the amount of memory available on a physical server constraint the performance and scalability of applications be limiting the amount of RAM available for consumption. The combination of the explosive data set growth with the need to deliver near real time analytics to achieve competitive advantage requires companies to use large memory systems. However this solution tends to be expensive, have difficulty to scale and have a shorter life span due to the exponential application workload increase. Not to mention that they often create an island within virtualised datacentre and cloud environment which increase their operational cost.

While hardware capabilities improve, the amount of data to be analysed continues to grow. We can imagine that technological innovation in the area of hardware resources may not be able to keep up with this growth and indeed could reach a halt. Rather than frequently buying the newest (and often most expensive) cutting-edge machines the market has to offer, a better answer to the problem of data volume growth can be to enable memory

Scale out: This means that data can be distributed and accessed in parallel within a set of relatively inexpensive commodity hardware machines (within a cloud). Additional machines can then be connected to handle even larger amounts of data.
Scale Up with Non Volatile Memory (NVM) as extended memory (non-volatility is not important here): This would enable the extension of current system at lower cost while maintaining decent performance.
Hybrid: allowing remote NVM to be accessed over network.

However in order to guarantee performance we need to leverage either RDMA fabric or direct device access to allow zero copy transfer of page and low cpu / memory overhead .

The result, a system able to distribute memory across a cluster of machines and / or tapping into NVM, providing fast response times for many types of applications that require large amount of memory that exceed the typical Core - Memory ratio of 1:4GB. It allows also the introduction of highly specialised HW within datacentres and treats memory the same way we currently treat storage.

We will first discuss the state of the art and its limitations. Then we will discuss a possible approach and finally the performance of a proof of concept memory scale out module. Note that most of the effort currently focused on the scale out approach and the scale up using NVM device is currently on at the concept stage.

II. State of the art:

When a user or administrator desired to extend its RAM memory it typically leverage the swap mechanism to extend the memory of the system by allowing it to leverage block device space. They create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems.

However the current solution suffers from significant drawback:

Speed : The swap process relies on the block io layer which is expensive and doesn’t suit very well swap to ram scenario (https://lkml.org/lkml/2010/5/20/314 ) . This is further exacerbated when the block layer is access over network which limit its usage for diskless system that can tolerate the poor performance. As a result However, the low latency characteristics of flash devices and RDMA fabrics mean that the standard I/O path adds additional overhead for swapping.
Size: On 32bit kernel we are limited to 64 swap spaces at a maximum of 64 GB. While with the 64 bit kernel 128 swap areas are permitted, each with 16 Tb.
Granularity: Swap is system wide which prevents the user from restricting its usage to certain process. The only control relies on memlock functionality which requires programmatic modification and awareness of the deployment scenario.
Flexibility: Due to the limited number of swap area we cannot easily strip the memory across remote memory recipient for performance. By ex: 1TB with 1 GB strip across 10 different memory node/ server. On 64 bit linux we could only have 128 stripes while a 1000 would be required.

Alternative solutions to swapping were proposed such as cleancache / frontswap / ramster combo (with the transcendent memory patch set) . RAMster is a transparent memory aggregation solution, integrated into mainline Linux. It allows a host to rely on physical memory of another single remote host, to handle large memory workloads. It supports local compression of data using zcache, and off-line transfer of memory contents, to mitigate networking costs. Despite its advantages, it does not fully fit the case of memory-demanding applications:

It does not yet include fault-resilience features, it uses a peer-to-peer schema, and it creates some disruption in native operation of Linux, by delaying the transfer of pages that have already been unmapped from a process in a matter undetectable by Linux. These approaches while valid force the creation of dedicated backend solution to offer de-duplication, compression, tiering of memory while such features already exist within the standard kernel.

Paging / Swapping to nvram:

As previously mentioned the low latency characteristics of flash devices mean that the standard I/O path adds additional overhead for swapping . Several attempt where proposed to design a specific NVM aware swapping backend however so far none of them made their way with the Linux kernel as they tend to be hyper specialised and do not translate very well across the different device.

III. Proposed approach (memory aggregation) :

1. Overview:

High level concept design was focused on meeting the requirements for scaling-out memory-demanding applications, described in the introduction section: elastic, resource-selective, transparent, scalable, extensible and with high-performance. These goals were to be achieved without disrupting existing functionalities in any of the participating hosts, e.g., no disruption to the operating system, hardware devices or applications running in parallel on these hosts.

a. Scaling out Memory over network:

A cluster is built around a memory demander: a host running a memory-demanding application or VM. Available memory for that process is extended, relying on physical memory designated for that purpose by other hosts participating in the cluster. To extend its available memory, the memory-demanding application or VM registers segments of its virtual address space, which are sponsored by one or more memory sponsors.

Memory sponsors are other hosts participating in the cluster. They malloc() sufficient memory in a local process, to serve as swap-space for memory segments they sponsor. Yet beyond that they are free to operate, utilizing their resources for unrelated purposes. Therefore only essential resources

are selectively used by the cluster. When the memory demander host runs out of physical memory, it sends memory contents in sponsored segments to be stored at their respective memory sponsors. Memory segments can be registered as sponsored at any stage in the life-cycle of the memory-demanding process.

When it is initially set-up, it is possible to register large sponsored segments, pre-creating a large cluster. Alternatively, additional memory segments can be registered as sponsored over time, gradually extending available physical memory for the process. Sponsored segments can also be unregistered over time, shrinking the available memory. The cluster is therefore elastic, extendable or shrinkable over time.

b. Scaling up with NVM:

We can rely on the same approach for NVM device however we cannot use user space process to act as memory host in this case. As a result we propose to use an alternative approach: we can rely on a dedicated module managing the mapping of memory segment to the NVM device. This module can register and expose segment to the other process via simple sysfs call.

IV. Kernel Implementation:

For transparent integration with existing applications we implemented a propose Linux Kernel Module, with four small changes to the Kernel itself: two hooks in the write-out sequence and two in the page fault sequence. Our goal was to create an RDMA-based alternative to existing disk-based swap storage, while also keeping existing swap functionalities intact.

We sought to integrate with the Memory Management Unit (MMU) in a way similar to the way disk-based swap integrates with it, and carry out the RDMA processes in a manner as close as possible to the processes which they replace. We relies on existing data structures, such as page table entries and swap entries.

The proposed approach reserves a single swap area entry for the purpose of memory aggregation. This swap entry is used to identify (flag) pages that are either remote or on NVM devices. When a memory segment is registered as sponsored, we set page table entries with this specific flag. The swap entries contain information as to the host which sponsors that memory segment. Note: on a 64bit system the swap entry size allow us to scale to up to 2^32 different memory segments. Moreover each memory sponsor is identified by a single 128 bit ID and each cluster or group is also identified by a single 128 bit ID. This allows the system to scale within large cloud / datacentre environment.

This approach has the advantage to be non-intrusive and leverage existing kernel code however it require the pre population of all the PTE associated with the memory segment . We could use the creation of a special VMA however in order to be the less intrusive possible we decided against but we might reconsider this possibility based on feedback.

When a fault occurs in an address, we recognizes its flagged swap entry, and is able to request the data from the sponsoring hosts to which the entry points. If a write-out later occurs (by example under memory pressure by C-group constraint), and page contents are transferred, the page table entry is re-set, pointing to the sponsor host. Moreover we distinguish between write and read fault as we allow different coherency model to be used for distributed shared memory application of the technology (shared nothing, write invalidate) .

The actual page transfer operation can be either done via DMA by the dedicate NVM module for the device or in the scale out scenario via a specifically designed RDMA engine which ensure zerocopy and low latency transfer kernel to kernel.

Hecatonchire Simplified Memory Management Architecture

This manner of integration with the MMU and existing data structures has several advantages. First of all, as mentioned, it is transparent to running applications/VMs, and non-disruptive for existing memory management. The only needed changes in applications are the system calls needed to register sponsored memory segments. In the common case when applications run on top of

a VM, the VM may transparently handle these system calls, leaving the application completely unchanged.

Furthermore, any optimizations imposed on page table entries are helpful too, and future compatibility with such enhancements is guaranteed. In addition, as we branches out only when specifically writing out a page or resolving a page fault, the underlying processes deciding which page to write-out, or faulting on an address are unchanged. These are highly optimized processes, and should not be affected by the existence of a remote memory cluster. A single exception to this design choice is the process of prefetching.

The prefetching is a surrogate for the native Kernel one, which exists within the page faults flows which are replaced by our execution flows. When sending a page request to a remote host, we uses the spare time spent in networking to issue further requests, prefetching related pages. These pages are later faulted in. The current prefetch method is rather naïve but provide good performance as we simply try to pre-fault a limited set of surrounding pages of the one being fetch.

This solution also offer transparent support of kernel features such as:

Memory swapping: memory sponsor can swap out the page as they are standard normal anonymous page
KSM: Page can be de-duplicated; if we encounter a KSM paged we simply break the KSM page.
Transparent huge page: If a fault is encountered in a THP we break it in order to have a single granularity for all the pages
EPT/NPT: the solution support shadow page table in order to be virtualization friendly
Asynchronous page fault (KVM): The page faulting process has been specifically optimised to support async page fault implemented by KVM
Cgroup: cgroup are transparently supported and every page is accounted eliminating potential black hole effect as the one created by Ramster.

Advantages:

Transparent integration
No overhead when not compiled
Virtually no overhead to normal operation
Minimal hooks
Can be extended to provide distributed shared memory features with different coherency protocol.
Design to scale
Zero Copy (RDMA or DMA)
Use Anonymous page
RDMA: low latency

Issue / Unknown:

We do not know yet what the best way to handle Memlock / mmap is
We did not plan how to support process fork (this can be supported with a special VMA flagging)
PTE entries need to be pre filled (this can be mitigated with a special VMA)
Require a specific module to manage NVM devices
Require HW RDMA support for scale out (you can use Emulated RDMA such as SoftIwarp or softRoCE but it is not upstream yet)

V. Proof of concept and performance:

We produce an initial proof of concept. This PoC focused on the memory scale out scenario over RDMA. We developed an in kernel RDMA communication engine and a module for managing the memory. We tested the PoC against various scenario and RDMA fabric (SoftIWARP, Iwarp, Infiniband).

Hardware:

CPU : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
1 socket with 4 cores, Pci-e 2
Memory: 16 GB

Network Fabric:

Infiniband : QDR connectX2 mellanox with 40Gbps switch
Iwarp : 10GbE Chelsio T422CR with Fujitsu 10GbE switch
Softiwarp : 10GbE Chelsio T422CR with fujitsu 10GbE switch
Kernel: Linux stable 3.6 + Hecatonchire Module

Performance for hard fault resolution (from page fault to its resolution):

Softiwarp 10GE: ~320 micro seconds
Iwarp 10 GE (Chelsio T422-CR) : ~ 45 micro seconds
Infiniband QDR 40Gbps (Mellanox ConnectX2): 25 micro seconds

Random Walk over 1GB of remote RAM (4 threads):

Compounded Perceived Page Fault Resolution (prefetch enabled):

Infiniband QDR : ~2 micro seconds
Iwarp 10GE : ~4 micro seconds

Page fault resolution per seconds:

Infiniband QDR: ~600 000 pages / second
Iwarp 10GE: ~300 000 pages /seconds

Note: Some limitation (IOPS , latency, bandwith) are due to the workstation limitation and we are currently doing scaling test on Production class servers.

Parallel quick sort over 2 GB dataset with memory constraint (C-group) Infiniband only:

Memory Ratio: 3:4 , Overhead: 2.08%
Memory Ratio: 1:2 , Overhead: 2.62%
Memory Ratio: 1:3, Overhead: 3.35%
Memory Ratio: 1:4, Overhead: 4.15%
Memory Ratio: 1:5, Overhead: 4.71%

We also did various performance tests with KVM and HANA DB, you can fine the result in the following slide deck

VI. Code:

The proof of concept has been developed within the Hecatonchire project and the source code is available at the following URL: http://www.hecatonchire.com .

VII. Conclusion

In summary Hecatonchire enable:

Better Performance:

Each function is isolated. We limit the scope of what each box must do
We can leverage dedicated hardware and software resulting in increased performance.

Superior Scalability:

Functions are isolated from each other. As a result we can alter one function without impacting the others.

Improved Economics:

Deliver a cost-effective deployment of resource. We have improved provisioning and consolidation of disparate equipment

Monday, August 22, 2011

Soft RoCE, an alternative to Soft iWarp

Introduction

The Soft RoCE distribution is available now as a specially patched OFED-1.5.2 distribution, which is known as OFED-1.5.2-rxe. Users familiar with the installation and configuration of OFED software will find this easy to use. It is supported by System Fabric Works. Please refer to the official website for soft-RoCE for the further details.

Features:

Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure.

Utilize the same transport and network layers from IB

Stack and swap the link layer for Ethernet.
Implement IB verbs over Ethernet.

Not quite IB strength, but it’s getting close.
As of OFED 1.5.1, code written for OFED RDMA , auto-magically works with RoCE.

Performance :
(from IMPLEMENTATION & IMPLEMENTATION & COMPARISON OF COMPARISON OF RDMA OVER ETHERNET RDMA OVER ETHERNET)

RoCE is capable of providing near-Infiniband QDR performance for :

Latency-critical applications at message sizes from 128B to 8KB

Bandwidth-intensive applications for messages <1KB.

Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.

Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.

Installation

The Soft RoCE distribution contains the entire OFED-1.5.2 distribution, with the addition of the Soft RoCE code.

Download link : http://www.systemfabricworks.com/downloads/roce

Installation of the OFED-1.5.2-rxe distribution works exactly the same as a “standard” OFED distribution. Installation can be accomplished interactively, via the “install.pl” program, or automatically via “install.pl –c ofed.conf”. The required new components are “librxe” and “ofa-kernel.” The latter is not new, but in our "rxe" version of the OFED distribution it includes the rxe/Soft RoCE kernel module.

Usage

After install OFED-1.5.2-rxe, you can use rxe_cfg command to configure Soft-RoCE. Herein, I list a few most useful commands for us.

# rxe_cfg -h
Usage:
rxe_cfg [options] start|stop|status|persistent|devinfo
rxe_cfg debug on|off| (Must be compiled in for this to work)
rxe_cfg crc enable|disable
rxe_cfg mtu [rxe0]  (set ethernet mtu for one or all rxe transports)
rxe_cfg [-n] add eth0
rxe_cfg [-n] remove rxe1|eth2
Options:
 -n: do not make the configuration action persistent
 -v: print additional debug output
 -l: in status display, show only interfaces with link up
 -h: print this usage information
 -p 0x8916: (start command only) - use specified (non-default) eth_proto_id

1. Enable the Soft-RoCE module
rxe_cfg start

# rxe_cfg start

Name  Link  Driver   Speed   MTU   IPv4_addr        S-RoCE  RMTU
eth0  yes   bnx2             1500  198.124.220.136
eth1  yes   bnx2             1500
eth2  yes   iw_nes           9000  198.124.220.196
eth3  yes   mlx4_en  10GigE  1500  192.168.2.3
rxe eth_proto_id: 0x8915

2. Disable the Soft-RoCE module

rxe_cfg stop

3. Add a Ethernet interface to the Soft-RoCE module

rxe_cfg add [ethx]

4. Remove a Ethernet interface from the Soft-RoCE module

rxe_cfg remove [ethx|rxex]

Tuning for performance:

1) MTU size
The Soft-RoCE interface only support four MTU size: 512, 1024, 2048 and 4096. In order to max the performance, we can choose 4096.

Commands: ifconfig [ethx] mtu 9000 // set the jumbo frame for the original Ethernet interface.

rxe_cfg mtu [rxex] 4096 // set the max MTU to the according rxe interface.

Note: you also need to enable your switch to support jumbo frame

2) CRC checking
To max the performance, we need to disable crc checking.

Commands: rxe_cfg crc disable

3) Ethernet tx queue length
Also, we need to give a large number to the txqueuelen parameter of the original Ethernet interface.

Commands: ifconfig [ethx] txqueuelen 10000

Thursday, March 31, 2011

How to install Soft-iWARP on Ubuntu 10.10, AKA how to have RDMA enabled system without the expensive hardware.

What is Soft-iWARP and why installing it :

Soft-iWARP [1] is a software-based iWARP stack that runs at reasonable performance levels and seamlessly fits into the OFA RDMA environment provides several benefits:

As a generic (RNIC-independent) iWARP device driver, it immediately enables RDMA services on all systems with conventional Ethernet adapters, which do not provide RDMA hardware support.
Soft-iWARP can be an intermediate step when migrating applications and systems to RDMA APIs and OpenFabrics.
Soft-iWARP can be a reasonable solution for client systems, allowing RNIC-equipped peers/servers to enjoy the full benefits of RDMA communication.
Soft-iWARP seamlessly supports direct as well as asynchronous transmission with multiple outstanding work requests and RDMA operations.
A software-based iWARP stack may flexibly employ any available hardware assists for performance-critical operations such as MPA CRC checksum calculation and direct data placement. The resulting performance levels may approach those of a fully offloaded iWARP stack.

How to install Soft-iWARP :

When I tried to install Soft-iWARP I ran into several issue and I am writing this installation manual so i don’t forget how to do it and other people can enjoy a semi seamless installation.

My setup:

Core 2 duo
e1000 nic
4 Gb of RAM
Ubuntu 10.10 server
Linux kernel : 2.6.35-28-generic

Pre-Install

Getting the Linux kernel dev environment:

sudo apt-get install fakeroot build-essential crash kexec-tools makedumpfile kernel-wedge

```
sudo apt-get build-dep linux
```

sudo apt-get install git-core libncurses5 libncurses5-dev libelf-dev binutils-dev

Setting up the Infiniband / libibverbs environment

sudo apt-get install libibverbs1 libibcm1 libibcm-dev  libibverbs-dev  libibcommon1 ibverbs-utils

check if  “/etc/udev/rules.d/40-ib.rules” exist

1. If the file doesn’t exist , create it and populate it with (you need to reboot after):

#### /etc/udev/rules.d/40-ib.rules ####

KERNEL=="umad*", NAME="infiniband/%k"

KERNEL=="issm*", NAME="infiniband/%k"

KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"

KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"

KERNEL=="uat", NAME="infiniband/%k", MODE="0666"

KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"

KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"

########

Verifying the Infiniband / libibverbs environment

· Run these command in the following order:

1. sudo modprobe rdma_cm

2. sudo modprobe ib_uverbs

3. sudo modprobe rdma_ucm

4. sudo lsmod

5.ls /dev/infiniband/

Getting the compile / dev environment for Soft-iWARP :

sudo apt-get install libtool autoconf

Installing Soft-iWARP :

Getting Soft-iWARP

Getting the source from the git repository :

You can get the source from : https://github.com/zrlio/softiwarp

Compiling and installing the Soft-iWARP kernel module :

Inside the soft-iwarp kernel module folder:

make
sudo make install

Compiling and installing the Soft-iWARP userlib:

Inside the soft-iwarp userlib folder:

./autogen.sh
Run AGAIN ./autogen.sh
./configure
make
sudo make install

· at that stage soft-iwrap install the infiniband stuff in “/usr/local/etc/libibverbs.d” we need to make asymbolic link t /etc/infiniband.d

o sudo ln -s /usr/local/etc/libibverbs.d /etc//libibverbs.d

Inserting the Soft-iWARP kernel module:

1. sudo modprobe rdma_cm

2. sudo modprobe ib_uverbs

3. sudo modprobe rdma_ucm

4. sudo insmod /lib/modules/"your_kernel"/extra/siw.ko

5. sudo lsmod

§ check if all the modules are correctly installed

6. ls /dev/infiniband/

Soft-iWARP kernel module parameters:

loopback_enabled: if set, attaches siw also to the looback device.

To be set only during module insertion.

mpa_crc_enabled: if set, the MPA CRC gets generated and checked

both in tx and rx path (kills all throughput).

To be set only during module insertion.

zcopy_tx: if set, payload of non signalled work requests

(such as non signalled WRITE or SEND as well as all READ

responses) are transferred using the TCP sockets

sendpage interface. This parameter can be switched on and

off dynamically (echo 1 >> /sys/module/siw/parameters/zcopy_tx

for enablement, 0 for disabling). System load benefits from

using 0copy data transmission: a servers load of

one 2.5 GHz CPU may drop from 55 percent down to 23 percent

if serving 100KByte READ responses at 10GbE's line speed.

Verifying the Soft-iWARP is correctly installed :

running :

ibv_devinfo
ibv_devices

· You should get :

if you get:

libibverbs: Warning: couldn't load driver 'siw': libsiw-rdmav2.so: cannot open shared object file: No such file or directory

libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0

o you need to run “sudo ldconfig” to update the library path

Running the simple test:

Install rdma-utils :

sudo apt-get install rdmacm-utils

Then you can test the soft-iwarp with :

rping : RDMA ping
udaddy : Establishes a set of unreliable RDMA datagram communication paths between two nodes using the librdmacm, optionally transfers datagrams between the nodes, then tears down the communication.
mckey : Establishes a set of RDMA multicast communication paths between nodes using the librdmacm, optionally transfers datagrams to receiving nodes, then tears down the communication.
ucmatose : Establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects.

Note : the ibv_* utils don't work with iWARP device . you will get errors if you try to use them such as:

[root@iwarp1 ~]# ibv_rc_pingpong
Couldn't get local LID
[root@iwarp1 ~]# ibv_srq_pingpong
Couldn't create SRQ
[root@iwarp1 ~]# ibv_srq_pingpong
Couldn't create SRQ
[root@iwarp1 ~]# ibv_uc_pingpong
Couldn't create QP
[root@iwarp1 ~]# ibv_ud_pingpong
Couldn't create QP

Note 2 : If you only have one machine and you want to use the loopback (optional you can still use your normal nic), to test if you have it enabled:

Subscribe to: Posts ( Atom )

Reflections Of The Void