Reflections Of The Void: 04/01/2013

Delivering transparent memory scale-out presents certain challenges. Two main aspects can be compromised by the deployment of such a solution:

Performance may be hindered, due to the generic nature of implementation at the infrastructure level.
Fault Resilience might be compromised, as the frequency of failures increases along with the increased number of hosts participating in the system. The latter challenge is significant: the decrease in Mean Time Between Failures (MTBF) exposes a non-resilient system to an increased risk of down-times, which might be unacceptable in many business environments.

In Hecatonchire we developed a solution revolving around a scale-out approach with mirroring of the remote memory in two or more parallel hosts.

Memory Scale out redundancy design in Hecatonchire

Design

To enable fault-resilience, we mirrored the memory stored at memory sponsors. All remote memory, residing in memory sponsors, is mirrored, and therefore failure in one sponsor does not cause any loss of data. Furthermore, such failure does not cause any delay or downtime in the operation of the running application.

When the kernel module faults on an address, it sends the request to both memory sponsors. Upon reception of the ﬁrst response, the memory demander uses it to allow the application to continue operating. When a page is not required anymore on the computing node, the kernel module swaps the page out. To do so, it sends the page to both memory sponsors, and waits for validation that both of them stored the content, before discarding the page.

The biggest advantage of this approach is zero-downtime failover of memory sponsors. If one sponsor fails, the other sponsor continues responding to the memory demander’s requests for memory content.

Fault Tolerance in Hecatonchire

Tradeoff

The advantage of fault-resiliency in our approach is mitigated by several trade-offs. First of all, this approach needs twice the amount of memory, compared to a non-resilient scheme. Furthermore, the approach potentially induces an additional performance overhead: swap-out operations are delayed (waiting for responses from two hosts), and some additional computation and storage is needed.

Moreover, the mirroring approach doubles the needed bandwidth, luckily today’s fabrics have bandwidth capacities reaching 100Gbps and it is currently rare that application consume more than 50Gbps continuously.

Note that our approach does not deal with fault tolerance for the main host running the application (the memory demander).The issue of VM replication has numerous solutions that are orthogonal to our approach, such as Kemari and Remus.

Benchmark with Synthetic Workload

To assess the impact of using the High availability solution of Hecatonchire we used a realistic scenario: Running an application over the system, and measuring completion time overhead, compared to running the application on top of sufficient local DRAM. For this scenario we used an application which performs a quicksort over an arbitrary amount of generated data, in similar fashion to previous evaluations of DSM systems. This simulates a memory-intensive application, as memory is accessed rapidly, with minimal computations per access. Our two main KPIs are the overhead in completion time of using an redundant cluster to extend available memory, compared to using sufficient local, physical memory; and more importantly, the trend of performance degradation as the cluster scales. Scaling was created both by extending the workload size - amount of data to be sorted - and by limiting the amount of physical memory on the computation host. The ratio of workload size to available memory on the computation host reﬂects a scaling factor of the system.

Note: in the figure the redundant memory cluster is called RRAIM for redundant array of inexpensive RAM

The results show completion times for the quicksort benchmark, with a total workload of 500MB, 1GB and 2GB, respectively. The cgroup memory in the computation host was capped, such that workload size was up to ×5 larger than available memory, reﬂecting memory aggregation for the process.

The results reﬂect very low overhead for using HA , compared to local DRAM. Table displays the % of overhead per scaling factor. Quintupling (×5) the available memory for the quicksort application process using RRAIM results in an overhead less than 5%; using a fault-tolerant schema still results in an overhead less than 10%.
More importantly, the trend of performance degradation, as the scaling factor increases, reﬂects good scalability of the solution : the difference between doubling the available memory and quintupling it is less than 2.5 − 4%. The singular steep incline in the 512 GB scenario probably represents thrashing as the size of physical memory was too small for the working set - rather than a simple overhead increase. And this seems to be confirmed as enough memory on the computation host is available to hold the working set, this phenomenon does not occur (1GB + scenario).

Benchmark with Enterprise Application Workload

The quicksort benchmark suffers from two disadvantages: The ﬁrst is using small memory workloads (up to 2GB in our evaluations); and the second being a synthetic benchmark, not necessarily reﬂecting real-world scenarios. We therefore chose to evaluate the performance of RRAIM in transparently scaling-out a commercial in-memory database - SAP HANA - using the standard commercial benchmark. Note that this benchmark was not designed to test distributed systems at all: Only the performance of the database itself. A low overhead in completion time of the benchmark, when extending the available memory using RRAIM, reﬂects a transparent scale-out, with minimal disruption to the running database. The average overhead for running HANA with HECA on a 128 GB VM instance with 600 GB of DATA loaded up and running a SAP-H variant is on average 4% in a 1:2 and 1:3 memory ratio scenario. When we introduce RRAIM the overhead stay almost the same as showed in the following table.