Reflections Of The Void: 02/01/2013

This post present a more detail description of the memory scale out solution offered by the Hecatonchire project.

I. Motivation:

Currently the amount of memory available on a physical server constraint the performance and scalability of applications be limiting the amount of RAM available for consumption. The combination of the explosive data set growth with the need to deliver near real time analytics to achieve competitive advantage requires companies to use large memory systems. However this solution tends to be expensive, have difficulty to scale and have a shorter life span due to the exponential application workload increase. Not to mention that they often create an island within virtualised datacentre and cloud environment which increase their operational cost.

While hardware capabilities improve, the amount of data to be analysed continues to grow. We can imagine that technological innovation in the area of hardware resources may not be able to keep up with this growth and indeed could reach a halt. Rather than frequently buying the newest (and often most expensive) cutting-edge machines the market has to offer, a better answer to the problem of data volume growth can be to enable memory

Scale out: This means that data can be distributed and accessed in parallel within a set of relatively inexpensive commodity hardware machines (within a cloud). Additional machines can then be connected to handle even larger amounts of data.
Scale Up with Non Volatile Memory (NVM) as extended memory (non-volatility is not important here): This would enable the extension of current system at lower cost while maintaining decent performance.
Hybrid: allowing remote NVM to be accessed over network.

However in order to guarantee performance we need to leverage either RDMA fabric or direct device access to allow zero copy transfer of page and low cpu / memory overhead .

The result, a system able to distribute memory across a cluster of machines and / or tapping into NVM, providing fast response times for many types of applications that require large amount of memory that exceed the typical Core - Memory ratio of 1:4GB. It allows also the introduction of highly specialised HW within datacentres and treats memory the same way we currently treat storage.

We will first discuss the state of the art and its limitations. Then we will discuss a possible approach and finally the performance of a proof of concept memory scale out module. Note that most of the effort currently focused on the scale out approach and the scale up using NVM device is currently on at the concept stage.

II. State of the art:

When a user or administrator desired to extend its RAM memory it typically leverage the swap mechanism to extend the memory of the system by allowing it to leverage block device space. They create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems.

However the current solution suffers from significant drawback:

Speed : The swap process relies on the block io layer which is expensive and doesn’t suit very well swap to ram scenario (https://lkml.org/lkml/2010/5/20/314 ) . This is further exacerbated when the block layer is access over network which limit its usage for diskless system that can tolerate the poor performance. As a result However, the low latency characteristics of flash devices and RDMA fabrics mean that the standard I/O path adds additional overhead for swapping.
Size: On 32bit kernel we are limited to 64 swap spaces at a maximum of 64 GB. While with the 64 bit kernel 128 swap areas are permitted, each with 16 Tb.
Granularity: Swap is system wide which prevents the user from restricting its usage to certain process. The only control relies on memlock functionality which requires programmatic modification and awareness of the deployment scenario.
Flexibility: Due to the limited number of swap area we cannot easily strip the memory across remote memory recipient for performance. By ex: 1TB with 1 GB strip across 10 different memory node/ server. On 64 bit linux we could only have 128 stripes while a 1000 would be required.

Alternative solutions to swapping were proposed such as cleancache / frontswap / ramster combo (with the transcendent memory patch set) . RAMster is a transparent memory aggregation solution, integrated into mainline Linux. It allows a host to rely on physical memory of another single remote host, to handle large memory workloads. It supports local compression of data using zcache, and off-line transfer of memory contents, to mitigate networking costs. Despite its advantages, it does not fully fit the case of memory-demanding applications:

It does not yet include fault-resilience features, it uses a peer-to-peer schema, and it creates some disruption in native operation of Linux, by delaying the transfer of pages that have already been unmapped from a process in a matter undetectable by Linux. These approaches while valid force the creation of dedicated backend solution to offer de-duplication, compression, tiering of memory while such features already exist within the standard kernel.

Paging / Swapping to nvram:

As previously mentioned the low latency characteristics of flash devices mean that the standard I/O path adds additional overhead for swapping . Several attempt where proposed to design a specific NVM aware swapping backend however so far none of them made their way with the Linux kernel as they tend to be hyper specialised and do not translate very well across the different device.

III. Proposed approach (memory aggregation) :

1. Overview:

High level concept design was focused on meeting the requirements for scaling-out memory-demanding applications, described in the introduction section: elastic, resource-selective, transparent, scalable, extensible and with high-performance. These goals were to be achieved without disrupting existing functionalities in any of the participating hosts, e.g., no disruption to the operating system, hardware devices or applications running in parallel on these hosts.

a. Scaling out Memory over network:

A cluster is built around a memory demander: a host running a memory-demanding application or VM. Available memory for that process is extended, relying on physical memory designated for that purpose by other hosts participating in the cluster. To extend its available memory, the memory-demanding application or VM registers segments of its virtual address space, which are sponsored by one or more memory sponsors.

Memory sponsors are other hosts participating in the cluster. They malloc() sufficient memory in a local process, to serve as swap-space for memory segments they sponsor. Yet beyond that they are free to operate, utilizing their resources for unrelated purposes. Therefore only essential resources

are selectively used by the cluster. When the memory demander host runs out of physical memory, it sends memory contents in sponsored segments to be stored at their respective memory sponsors. Memory segments can be registered as sponsored at any stage in the life-cycle of the memory-demanding process.

When it is initially set-up, it is possible to register large sponsored segments, pre-creating a large cluster. Alternatively, additional memory segments can be registered as sponsored over time, gradually extending available physical memory for the process. Sponsored segments can also be unregistered over time, shrinking the available memory. The cluster is therefore elastic, extendable or shrinkable over time.

b. Scaling up with NVM:

We can rely on the same approach for NVM device however we cannot use user space process to act as memory host in this case. As a result we propose to use an alternative approach: we can rely on a dedicated module managing the mapping of memory segment to the NVM device. This module can register and expose segment to the other process via simple sysfs call.

IV. Kernel Implementation:

For transparent integration with existing applications we implemented a propose Linux Kernel Module, with four small changes to the Kernel itself: two hooks in the write-out sequence and two in the page fault sequence. Our goal was to create an RDMA-based alternative to existing disk-based swap storage, while also keeping existing swap functionalities intact.

We sought to integrate with the Memory Management Unit (MMU) in a way similar to the way disk-based swap integrates with it, and carry out the RDMA processes in a manner as close as possible to the processes which they replace. We relies on existing data structures, such as page table entries and swap entries.

The proposed approach reserves a single swap area entry for the purpose of memory aggregation. This swap entry is used to identify (flag) pages that are either remote or on NVM devices. When a memory segment is registered as sponsored, we set page table entries with this specific flag. The swap entries contain information as to the host which sponsors that memory segment. Note: on a 64bit system the swap entry size allow us to scale to up to 2^32 different memory segments. Moreover each memory sponsor is identified by a single 128 bit ID and each cluster or group is also identified by a single 128 bit ID. This allows the system to scale within large cloud / datacentre environment.

This approach has the advantage to be non-intrusive and leverage existing kernel code however it require the pre population of all the PTE associated with the memory segment . We could use the creation of a special VMA however in order to be the less intrusive possible we decided against but we might reconsider this possibility based on feedback.

When a fault occurs in an address, we recognizes its flagged swap entry, and is able to request the data from the sponsoring hosts to which the entry points. If a write-out later occurs (by example under memory pressure by C-group constraint), and page contents are transferred, the page table entry is re-set, pointing to the sponsor host. Moreover we distinguish between write and read fault as we allow different coherency model to be used for distributed shared memory application of the technology (shared nothing, write invalidate) .

The actual page transfer operation can be either done via DMA by the dedicate NVM module for the device or in the scale out scenario via a specifically designed RDMA engine which ensure zerocopy and low latency transfer kernel to kernel.

Hecatonchire Simplified Memory Management Architecture

This manner of integration with the MMU and existing data structures has several advantages. First of all, as mentioned, it is transparent to running applications/VMs, and non-disruptive for existing memory management. The only needed changes in applications are the system calls needed to register sponsored memory segments. In the common case when applications run on top of

a VM, the VM may transparently handle these system calls, leaving the application completely unchanged.

Furthermore, any optimizations imposed on page table entries are helpful too, and future compatibility with such enhancements is guaranteed. In addition, as we branches out only when specifically writing out a page or resolving a page fault, the underlying processes deciding which page to write-out, or faulting on an address are unchanged. These are highly optimized processes, and should not be affected by the existence of a remote memory cluster. A single exception to this design choice is the process of prefetching.

The prefetching is a surrogate for the native Kernel one, which exists within the page faults flows which are replaced by our execution flows. When sending a page request to a remote host, we uses the spare time spent in networking to issue further requests, prefetching related pages. These pages are later faulted in. The current prefetch method is rather naïve but provide good performance as we simply try to pre-fault a limited set of surrounding pages of the one being fetch.

This solution also offer transparent support of kernel features such as:

Memory swapping: memory sponsor can swap out the page as they are standard normal anonymous page
KSM: Page can be de-duplicated; if we encounter a KSM paged we simply break the KSM page.
Transparent huge page: If a fault is encountered in a THP we break it in order to have a single granularity for all the pages
EPT/NPT: the solution support shadow page table in order to be virtualization friendly
Asynchronous page fault (KVM): The page faulting process has been specifically optimised to support async page fault implemented by KVM
Cgroup: cgroup are transparently supported and every page is accounted eliminating potential black hole effect as the one created by Ramster.

Advantages:

Transparent integration
No overhead when not compiled
Virtually no overhead to normal operation
Minimal hooks
Can be extended to provide distributed shared memory features with different coherency protocol.
Design to scale
Zero Copy (RDMA or DMA)
Use Anonymous page
RDMA: low latency

Issue / Unknown:

We do not know yet what the best way to handle Memlock / mmap is
We did not plan how to support process fork (this can be supported with a special VMA flagging)
PTE entries need to be pre filled (this can be mitigated with a special VMA)
Require a specific module to manage NVM devices
Require HW RDMA support for scale out (you can use Emulated RDMA such as SoftIwarp or softRoCE but it is not upstream yet)

V. Proof of concept and performance:

We produce an initial proof of concept. This PoC focused on the memory scale out scenario over RDMA. We developed an in kernel RDMA communication engine and a module for managing the memory. We tested the PoC against various scenario and RDMA fabric (SoftIWARP, Iwarp, Infiniband).

Hardware:

CPU : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
1 socket with 4 cores, Pci-e 2
Memory: 16 GB

Network Fabric:

Infiniband : QDR connectX2 mellanox with 40Gbps switch
Iwarp : 10GbE Chelsio T422CR with Fujitsu 10GbE switch
Softiwarp : 10GbE Chelsio T422CR with fujitsu 10GbE switch
Kernel: Linux stable 3.6 + Hecatonchire Module

Performance for hard fault resolution (from page fault to its resolution):

Softiwarp 10GE: ~320 micro seconds
Iwarp 10 GE (Chelsio T422-CR) : ~ 45 micro seconds
Infiniband QDR 40Gbps (Mellanox ConnectX2): 25 micro seconds

Random Walk over 1GB of remote RAM (4 threads):

Compounded Perceived Page Fault Resolution (prefetch enabled):

Infiniband QDR : ~2 micro seconds
Iwarp 10GE : ~4 micro seconds

Page fault resolution per seconds:

Infiniband QDR: ~600 000 pages / second
Iwarp 10GE: ~300 000 pages /seconds

Note: Some limitation (IOPS , latency, bandwith) are due to the workstation limitation and we are currently doing scaling test on Production class servers.

Parallel quick sort over 2 GB dataset with memory constraint (C-group) Infiniband only:

Memory Ratio: 3:4 , Overhead: 2.08%
Memory Ratio: 1:2 , Overhead: 2.62%
Memory Ratio: 1:3, Overhead: 3.35%
Memory Ratio: 1:4, Overhead: 4.15%
Memory Ratio: 1:5, Overhead: 4.71%

We also did various performance tests with KVM and HANA DB, you can fine the result in the following slide deck

VI. Code:

The proof of concept has been developed within the Hecatonchire project and the source code is available at the following URL: http://www.hecatonchire.com .

VII. Conclusion

In summary Hecatonchire enable:

Better Performance:

Each function is isolated. We limit the scope of what each box must do
We can leverage dedicated hardware and software resulting in increased performance.