Reflections Of The Void

Tuesday, April 09, 2013

Hecatonchire High Availaility and redundancy of memory scale out node

Delivering transparent memory scale-out presents certain challenges. Two main aspects can be compromised by the deployment of such a solution:

Performance may be hindered, due to the generic nature of implementation at the infrastructure level.
Fault Resilience might be compromised, as the frequency of failures increases along with the increased number of hosts participating in the system. The latter challenge is significant: the decrease in Mean Time Between Failures (MTBF) exposes a non-resilient system to an increased risk of down-times, which might be unacceptable in many business environments.

In Hecatonchire we developed a solution revolving around a scale-out approach with mirroring of the remote memory in two or more parallel hosts.

Memory Scale out redundancy design in Hecatonchire

Design

To enable fault-resilience, we mirrored the memory stored at memory sponsors. All remote memory, residing in memory sponsors, is mirrored, and therefore failure in one sponsor does not cause any loss of data. Furthermore, such failure does not cause any delay or downtime in the operation of the running application.

When the kernel module faults on an address, it sends the request to both memory sponsors. Upon reception of the ﬁrst response, the memory demander uses it to allow the application to continue operating. When a page is not required anymore on the computing node, the kernel module swaps the page out. To do so, it sends the page to both memory sponsors, and waits for validation that both of them stored the content, before discarding the page.

The biggest advantage of this approach is zero-downtime failover of memory sponsors. If one sponsor fails, the other sponsor continues responding to the memory demander’s requests for memory content.

Fault Tolerance in Hecatonchire

Tradeoff

The advantage of fault-resiliency in our approach is mitigated by several trade-offs. First of all, this approach needs twice the amount of memory, compared to a non-resilient scheme. Furthermore, the approach potentially induces an additional performance overhead: swap-out operations are delayed (waiting for responses from two hosts), and some additional computation and storage is needed.

Moreover, the mirroring approach doubles the needed bandwidth, luckily today’s fabrics have bandwidth capacities reaching 100Gbps and it is currently rare that application consume more than 50Gbps continuously.

Note that our approach does not deal with fault tolerance for the main host running the application (the memory demander).The issue of VM replication has numerous solutions that are orthogonal to our approach, such as Kemari and Remus.

Benchmark with Synthetic Workload

To assess the impact of using the High availability solution of Hecatonchire we used a realistic scenario: Running an application over the system, and measuring completion time overhead, compared to running the application on top of sufficient local DRAM. For this scenario we used an application which performs a quicksort over an arbitrary amount of generated data, in similar fashion to previous evaluations of DSM systems. This simulates a memory-intensive application, as memory is accessed rapidly, with minimal computations per access. Our two main KPIs are the overhead in completion time of using an redundant cluster to extend available memory, compared to using sufficient local, physical memory; and more importantly, the trend of performance degradation as the cluster scales. Scaling was created both by extending the workload size - amount of data to be sorted - and by limiting the amount of physical memory on the computation host. The ratio of workload size to available memory on the computation host reﬂects a scaling factor of the system.

Note: in the figure the redundant memory cluster is called RRAIM for redundant array of inexpensive RAM

The results show completion times for the quicksort benchmark, with a total workload of 500MB, 1GB and 2GB, respectively. The cgroup memory in the computation host was capped, such that workload size was up to ×5 larger than available memory, reﬂecting memory aggregation for the process.

The results reﬂect very low overhead for using HA , compared to local DRAM. Table displays the % of overhead per scaling factor. Quintupling (×5) the available memory for the quicksort application process using RRAIM results in an overhead less than 5%; using a fault-tolerant schema still results in an overhead less than 10%.
More importantly, the trend of performance degradation, as the scaling factor increases, reﬂects good scalability of the solution : the difference between doubling the available memory and quintupling it is less than 2.5 − 4%. The singular steep incline in the 512 GB scenario probably represents thrashing as the size of physical memory was too small for the working set - rather than a simple overhead increase. And this seems to be confirmed as enough memory on the computation host is available to hold the working set, this phenomenon does not occur (1GB + scenario).

Benchmark with Enterprise Application Workload

The quicksort benchmark suffers from two disadvantages: The ﬁrst is using small memory workloads (up to 2GB in our evaluations); and the second being a synthetic benchmark, not necessarily reﬂecting real-world scenarios. We therefore chose to evaluate the performance of RRAIM in transparently scaling-out a commercial in-memory database - SAP HANA - using the standard commercial benchmark. Note that this benchmark was not designed to test distributed systems at all: Only the performance of the database itself. A low overhead in completion time of the benchmark, when extending the available memory using RRAIM, reﬂects a transparent scale-out, with minimal disruption to the running database. The average overhead for running HANA with HECA on a 128 GB VM instance with 600 GB of DATA loaded up and running a SAP-H variant is on average 4% in a 1:2 and 1:3 memory ratio scenario. When we introduce RRAIM the overhead stay almost the same as showed in the following table.

Wednesday, February 06, 2013

Hecatonchire: Memory Scale out -up for Linux Kernel

This post present a more detail description of the memory scale out solution offered by the Hecatonchire project.

I. Motivation:

Currently the amount of memory available on a physical server constraint the performance and scalability of applications be limiting the amount of RAM available for consumption. The combination of the explosive data set growth with the need to deliver near real time analytics to achieve competitive advantage requires companies to use large memory systems. However this solution tends to be expensive, have difficulty to scale and have a shorter life span due to the exponential application workload increase. Not to mention that they often create an island within virtualised datacentre and cloud environment which increase their operational cost.

While hardware capabilities improve, the amount of data to be analysed continues to grow. We can imagine that technological innovation in the area of hardware resources may not be able to keep up with this growth and indeed could reach a halt. Rather than frequently buying the newest (and often most expensive) cutting-edge machines the market has to offer, a better answer to the problem of data volume growth can be to enable memory

Scale out: This means that data can be distributed and accessed in parallel within a set of relatively inexpensive commodity hardware machines (within a cloud). Additional machines can then be connected to handle even larger amounts of data.
Scale Up with Non Volatile Memory (NVM) as extended memory (non-volatility is not important here): This would enable the extension of current system at lower cost while maintaining decent performance.
Hybrid: allowing remote NVM to be accessed over network.

However in order to guarantee performance we need to leverage either RDMA fabric or direct device access to allow zero copy transfer of page and low cpu / memory overhead .

The result, a system able to distribute memory across a cluster of machines and / or tapping into NVM, providing fast response times for many types of applications that require large amount of memory that exceed the typical Core - Memory ratio of 1:4GB. It allows also the introduction of highly specialised HW within datacentres and treats memory the same way we currently treat storage.

We will first discuss the state of the art and its limitations. Then we will discuss a possible approach and finally the performance of a proof of concept memory scale out module. Note that most of the effort currently focused on the scale out approach and the scale up using NVM device is currently on at the concept stage.

II. State of the art:

When a user or administrator desired to extend its RAM memory it typically leverage the swap mechanism to extend the memory of the system by allowing it to leverage block device space. They create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems.

However the current solution suffers from significant drawback:

Speed : The swap process relies on the block io layer which is expensive and doesn’t suit very well swap to ram scenario (https://lkml.org/lkml/2010/5/20/314 ) . This is further exacerbated when the block layer is access over network which limit its usage for diskless system that can tolerate the poor performance. As a result However, the low latency characteristics of flash devices and RDMA fabrics mean that the standard I/O path adds additional overhead for swapping.
Size: On 32bit kernel we are limited to 64 swap spaces at a maximum of 64 GB. While with the 64 bit kernel 128 swap areas are permitted, each with 16 Tb.
Granularity: Swap is system wide which prevents the user from restricting its usage to certain process. The only control relies on memlock functionality which requires programmatic modification and awareness of the deployment scenario.
Flexibility: Due to the limited number of swap area we cannot easily strip the memory across remote memory recipient for performance. By ex: 1TB with 1 GB strip across 10 different memory node/ server. On 64 bit linux we could only have 128 stripes while a 1000 would be required.

Alternative solutions to swapping were proposed such as cleancache / frontswap / ramster combo (with the transcendent memory patch set) . RAMster is a transparent memory aggregation solution, integrated into mainline Linux. It allows a host to rely on physical memory of another single remote host, to handle large memory workloads. It supports local compression of data using zcache, and off-line transfer of memory contents, to mitigate networking costs. Despite its advantages, it does not fully fit the case of memory-demanding applications:

It does not yet include fault-resilience features, it uses a peer-to-peer schema, and it creates some disruption in native operation of Linux, by delaying the transfer of pages that have already been unmapped from a process in a matter undetectable by Linux. These approaches while valid force the creation of dedicated backend solution to offer de-duplication, compression, tiering of memory while such features already exist within the standard kernel.

Paging / Swapping to nvram:

As previously mentioned the low latency characteristics of flash devices mean that the standard I/O path adds additional overhead for swapping . Several attempt where proposed to design a specific NVM aware swapping backend however so far none of them made their way with the Linux kernel as they tend to be hyper specialised and do not translate very well across the different device.

III. Proposed approach (memory aggregation) :

1. Overview:

High level concept design was focused on meeting the requirements for scaling-out memory-demanding applications, described in the introduction section: elastic, resource-selective, transparent, scalable, extensible and with high-performance. These goals were to be achieved without disrupting existing functionalities in any of the participating hosts, e.g., no disruption to the operating system, hardware devices or applications running in parallel on these hosts.

a. Scaling out Memory over network:

A cluster is built around a memory demander: a host running a memory-demanding application or VM. Available memory for that process is extended, relying on physical memory designated for that purpose by other hosts participating in the cluster. To extend its available memory, the memory-demanding application or VM registers segments of its virtual address space, which are sponsored by one or more memory sponsors.

Memory sponsors are other hosts participating in the cluster. They malloc() sufficient memory in a local process, to serve as swap-space for memory segments they sponsor. Yet beyond that they are free to operate, utilizing their resources for unrelated purposes. Therefore only essential resources

are selectively used by the cluster. When the memory demander host runs out of physical memory, it sends memory contents in sponsored segments to be stored at their respective memory sponsors. Memory segments can be registered as sponsored at any stage in the life-cycle of the memory-demanding process.

When it is initially set-up, it is possible to register large sponsored segments, pre-creating a large cluster. Alternatively, additional memory segments can be registered as sponsored over time, gradually extending available physical memory for the process. Sponsored segments can also be unregistered over time, shrinking the available memory. The cluster is therefore elastic, extendable or shrinkable over time.

b. Scaling up with NVM:

We can rely on the same approach for NVM device however we cannot use user space process to act as memory host in this case. As a result we propose to use an alternative approach: we can rely on a dedicated module managing the mapping of memory segment to the NVM device. This module can register and expose segment to the other process via simple sysfs call.

IV. Kernel Implementation:

For transparent integration with existing applications we implemented a propose Linux Kernel Module, with four small changes to the Kernel itself: two hooks in the write-out sequence and two in the page fault sequence. Our goal was to create an RDMA-based alternative to existing disk-based swap storage, while also keeping existing swap functionalities intact.

We sought to integrate with the Memory Management Unit (MMU) in a way similar to the way disk-based swap integrates with it, and carry out the RDMA processes in a manner as close as possible to the processes which they replace. We relies on existing data structures, such as page table entries and swap entries.

The proposed approach reserves a single swap area entry for the purpose of memory aggregation. This swap entry is used to identify (flag) pages that are either remote or on NVM devices. When a memory segment is registered as sponsored, we set page table entries with this specific flag. The swap entries contain information as to the host which sponsors that memory segment. Note: on a 64bit system the swap entry size allow us to scale to up to 2^32 different memory segments. Moreover each memory sponsor is identified by a single 128 bit ID and each cluster or group is also identified by a single 128 bit ID. This allows the system to scale within large cloud / datacentre environment.

This approach has the advantage to be non-intrusive and leverage existing kernel code however it require the pre population of all the PTE associated with the memory segment . We could use the creation of a special VMA however in order to be the less intrusive possible we decided against but we might reconsider this possibility based on feedback.

When a fault occurs in an address, we recognizes its flagged swap entry, and is able to request the data from the sponsoring hosts to which the entry points. If a write-out later occurs (by example under memory pressure by C-group constraint), and page contents are transferred, the page table entry is re-set, pointing to the sponsor host. Moreover we distinguish between write and read fault as we allow different coherency model to be used for distributed shared memory application of the technology (shared nothing, write invalidate) .

The actual page transfer operation can be either done via DMA by the dedicate NVM module for the device or in the scale out scenario via a specifically designed RDMA engine which ensure zerocopy and low latency transfer kernel to kernel.

Hecatonchire Simplified Memory Management Architecture

This manner of integration with the MMU and existing data structures has several advantages. First of all, as mentioned, it is transparent to running applications/VMs, and non-disruptive for existing memory management. The only needed changes in applications are the system calls needed to register sponsored memory segments. In the common case when applications run on top of

a VM, the VM may transparently handle these system calls, leaving the application completely unchanged.

Furthermore, any optimizations imposed on page table entries are helpful too, and future compatibility with such enhancements is guaranteed. In addition, as we branches out only when specifically writing out a page or resolving a page fault, the underlying processes deciding which page to write-out, or faulting on an address are unchanged. These are highly optimized processes, and should not be affected by the existence of a remote memory cluster. A single exception to this design choice is the process of prefetching.

The prefetching is a surrogate for the native Kernel one, which exists within the page faults flows which are replaced by our execution flows. When sending a page request to a remote host, we uses the spare time spent in networking to issue further requests, prefetching related pages. These pages are later faulted in. The current prefetch method is rather naïve but provide good performance as we simply try to pre-fault a limited set of surrounding pages of the one being fetch.

This solution also offer transparent support of kernel features such as:

Memory swapping: memory sponsor can swap out the page as they are standard normal anonymous page
KSM: Page can be de-duplicated; if we encounter a KSM paged we simply break the KSM page.
Transparent huge page: If a fault is encountered in a THP we break it in order to have a single granularity for all the pages
EPT/NPT: the solution support shadow page table in order to be virtualization friendly
Asynchronous page fault (KVM): The page faulting process has been specifically optimised to support async page fault implemented by KVM
Cgroup: cgroup are transparently supported and every page is accounted eliminating potential black hole effect as the one created by Ramster.

Advantages:

Transparent integration
No overhead when not compiled
Virtually no overhead to normal operation
Minimal hooks
Can be extended to provide distributed shared memory features with different coherency protocol.
Design to scale
Zero Copy (RDMA or DMA)
Use Anonymous page
RDMA: low latency

Issue / Unknown:

We do not know yet what the best way to handle Memlock / mmap is
We did not plan how to support process fork (this can be supported with a special VMA flagging)
PTE entries need to be pre filled (this can be mitigated with a special VMA)
Require a specific module to manage NVM devices
Require HW RDMA support for scale out (you can use Emulated RDMA such as SoftIwarp or softRoCE but it is not upstream yet)

V. Proof of concept and performance:

We produce an initial proof of concept. This PoC focused on the memory scale out scenario over RDMA. We developed an in kernel RDMA communication engine and a module for managing the memory. We tested the PoC against various scenario and RDMA fabric (SoftIWARP, Iwarp, Infiniband).

Hardware:

CPU : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
1 socket with 4 cores, Pci-e 2
Memory: 16 GB

Network Fabric:

Infiniband : QDR connectX2 mellanox with 40Gbps switch
Iwarp : 10GbE Chelsio T422CR with Fujitsu 10GbE switch
Softiwarp : 10GbE Chelsio T422CR with fujitsu 10GbE switch
Kernel: Linux stable 3.6 + Hecatonchire Module

Performance for hard fault resolution (from page fault to its resolution):

Softiwarp 10GE: ~320 micro seconds
Iwarp 10 GE (Chelsio T422-CR) : ~ 45 micro seconds
Infiniband QDR 40Gbps (Mellanox ConnectX2): 25 micro seconds

Random Walk over 1GB of remote RAM (4 threads):

Compounded Perceived Page Fault Resolution (prefetch enabled):

Infiniband QDR : ~2 micro seconds
Iwarp 10GE : ~4 micro seconds

Page fault resolution per seconds:

Infiniband QDR: ~600 000 pages / second
Iwarp 10GE: ~300 000 pages /seconds

Note: Some limitation (IOPS , latency, bandwith) are due to the workstation limitation and we are currently doing scaling test on Production class servers.

Parallel quick sort over 2 GB dataset with memory constraint (C-group) Infiniband only:

Memory Ratio: 3:4 , Overhead: 2.08%
Memory Ratio: 1:2 , Overhead: 2.62%
Memory Ratio: 1:3, Overhead: 3.35%
Memory Ratio: 1:4, Overhead: 4.15%
Memory Ratio: 1:5, Overhead: 4.71%

We also did various performance tests with KVM and HANA DB, you can fine the result in the following slide deck

VI. Code:

The proof of concept has been developed within the Hecatonchire project and the source code is available at the following URL: http://www.hecatonchire.com .

VII. Conclusion

In summary Hecatonchire enable:

Better Performance:

Each function is isolated. We limit the scope of what each box must do
We can leverage dedicated hardware and software resulting in increased performance.