Reflections Of The Void: hecatonchire

Showing posts with label hecatonchire. Show all posts

Thursday, November 06, 2014

On the emergence of hardware level API for dis-aggregated datacenter resources

The technologies enabling the modular dis-aggregated data-center concept are reaching maturation point as demonstrated by the latest technology showcase RSA from Intel or to a lesser extent FusionCube / FusionSphere from Huawei. The needs for such technologies arise from the fact that current cloud and data-center technology does not and cannot fulfill all the demands of cloud users for multiple reasons. On one hand, as the number of cores and amount of memory on servers continues to increase (over the next few years, we expect to see servers with hundreds of cores and terabytes of memory per server commonly used), leasing an entire server may be too large for many customer’s needs with resources wasted. On the other hand, with the emergence of a broad class of high-end applications for analytic, data-mining, etc., the available amounts of memory and compute power on a single server may be insufficient.

Moreover, leasing cloud infrastructure resources in a fixed combination of CPU, memory, etc. are only efficient when the customer load requirements are both known in advance and remain constant over time. As neither of these conditions are met for a majority of customers, the ability to dynamically mix-and-match different amounts of compute, memory, and I/O resources is the natural evolutionary step after the hyper-converged solutions.

The objective here is to address the gaps that allow us to go beyond the boundaries of the traditional server, effectively breaking the barrier of using a single physical server for resources. in other words, we will be able to provision compute, memory, and I/O resources across multiple hosts within the same rack, while being consumed dynamically by varying quantities at run-time instead of in fixed bundles. This will effectively enable a fluid transformation of current cloud infrastructures targeting fixed commodity sized physical nodes to a very large pool of resources that can be tapped into and consumed without classical server limitation.

Intel has been advertising its RSA stack for a while and it is finally becoming reality, however, the real interesting part is not the technology. Indeed, to a certain extent, a lot of technology already exists and enables us to implement resource pooling. We demonstrated that it was already feasible to deliver cloud memory pooling in the Hecatonchire Project as well as with other vendors such as TidalScale or ScaleMP, who already offer compute aggregation. However, the last two solutions are monolithic and lack the flexibility needed to be used with the cloud consumption model and as a result, they are confined to a niche market.

What can really kick the dis-aggregated model into top gear is that Intel has now teamed up with a couple vendors and has already, created a draft hardware API specification called Redfish. Such API can be leveraged by a higher level of the stack thus allowing more intelligent, flexible and predictable resource consumption of how, where, and when workloads (VMs, containers, standard processes/threads) get scheduled onto that hardware. In a certain way this then enables Mesos / Kubernetes to deliver enhanced scheduling for every hardware aspect.

This brings some interesting capabilities to existing cloud technologies, cores and memory which then can be dynamically reallocated across the workload and arguably, , it greatly reduces the need for load balancing via live migration. You would then dynamically re-allocate the resource underneath (core, memory) rather than the whole system, thus making such process more robust and less error prone.

On the container side it would solve a lot of security headache the community is now facing. Rather than going with the physical->virtual->container route, you could simply run physical->container with a fine grained per core allocation using RSA / Redfish. Effectively you would provide fine grained subscription from your system in order to get maximal separation and performance guarantees. One can use this for separating critical applications while guaranteeing performance and isolation and indeed something we can already do now with jailhouse, at the cost of under subscribing your system.

If Intel is successful in disseminating (or having the other vendors standardize around it’s Hardware API), it would allow the technology to leap forward, as it’s biggest enemy is the difficulty to port across management API from one fabric, compute, I/O, storage model to another.

Saturday, August 03, 2013

Hecatonchire Version 0.2 Released!

Version 0.2 of Hecatonchire has been released.

What's New:

Write Invalidate coherency model added for those who want to use Heca natively in their application as Distributed Shared Memory( more on that in a subsequent post)
Significant improvement in performance of page transfer as well as a numbres of bugs squashed.
Specific Optimisation for KVM.
Scale out memory mirroring
Hybrid Post copy live migration
Moved to linux Kernel 3.9 Stable
Moved to Qemu-kvm 1.4 stable
Added Test / Proof of concept tools ( specifically for the new coherency model)
Improved Documentation

Voila!

We are now focusing on Stabilizing the code as well as robustness ( we aim at making the code production ready by 0.4) . Also, we are starting significant work to transparently integrate Hecatonchire so it can be transparently leverage via a cloud stack and more specifically openstack.

You can download it here : http://hecatonchire.com/#download.html
You can see the install doc here: https://github.com/hecatonchire/heca-misc/tree/heca-0.2/docs
And finally the changelog there : http://hecatonchire.com/#changelog-0.2.html
Or you can just pull the Master branch on github: https://github.com/hecatonchire

Stay tuned for more in depth blog post on Hecatonchire.

Thursday, August 01, 2013

Slide Deck - Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap update 08/2013

This slide deck provide the vision, status and roadmap update of the Hecatonchire project
Website: http://hecatonchire.com/
Git: https://github.com/hecatonchire

Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013 from Benoit Hudzia

Monday, June 24, 2013

Scaling Out SAP HANA In Memory DB with Hecatonchire

Many shared-memory parallel applications do not scale beyond a few tens of cores.However, may benefit from large amounts of memory:

In-memory databases
Datamining
VM
Scientific applications
etc

Moreover, memory in the nodes of current clusters are often Overscaled in order to fit the requirements of “any” application and remains unused most of the time. One of the objective we try to achieve with project Hecatonchire is to unleash your memory-constrained application by using the memory in the rest of nodes.In this post we demonstrate how Hecatonchire enables users to have memory that grows with their business or applications, not before. While using high-volume components to build high-value systems and eliminating physical limitation of Cloud / Datacenter or servers.

The application : SAP HANA

HANA DB takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications. It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system. HANA DB is 100% ACID compliant.

The Benchmark, hardware and Methodology

Application: SAP HANA ( In memory Database)
Workload: OLAP ( TPC-H Variant)

Data size

For Mall and Medium Instance: ~600 GB uncompressed ( ~30 GB compressed in RAM)
For Large: 300 GB compressed Data ( 2 TB of uncompressed data)

18 different Queries (TPC-H Variant)
15 iteration of each query set

Virtual Machine:

Small Size: 64 GB Ram - 32 vCPU
Medium Size: 128 GB RAM – 40 vCPU
Large Size: 1 TB RAM 40 vCPU
Hypervisor: KVM

Hardware:

Server with Intel Xeon West Mere

4 socket
10Core
1 TB or 512 GB RAM

Fabric:

Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Results

The results demonstrate the scalability and performance of the Hecatonchire solution, for small and large instance we only notice an average of 3% overhead compare to non-scale out benchmark.

Hecatonchire Over Head vs Standard Virtualised HANA ( Small Instance)

Moreover, we can notice in the query break down that for very short (almost point) query (13-14) the cost of accessing scaled out memory is immediately perceptible. However, Hecatonchire demonstrated that we are able to smooth out the impact of the scaling out for lengthier and memory intensive query.

Per Query Overhead breakdown for Small instance Benchmark

We officially tested Hecatonchire and HANA only up to 1 TB and obtained similar results as the one with the small and medium instance (3% overhead). We are currently running test for 4 to 8 TB scale out solution in order to validate larger scale scenario which require new feature that are currently added into the code. Stay tuned for a new and improved Heca!

1 TB Virtualized SAP HANA scaled out with Hecatonchire

Finally, we can demonsrated that for Hecatonchire scale very well when we spread the memory across multiple memory provider nodes. We not only benefit form the increased bandwidth but also from the improved latency with excellent performance result with 1.5% overhead when 2 third of the memory is externalised.

Various Hecatonchire deployment scenario of a virtualized medium Instance HANA

Note: typically running HANA on Top of KVM add by default a 5% to 8% overhead compare to a bare bone instance that we didn't take into account in the result as we are only comparing virtualized against scaled out virtualized.

Tuesday, April 09, 2013

Hecatonchire High Availaility and redundancy of memory scale out node

Delivering transparent memory scale-out presents certain challenges. Two main aspects can be compromised by the deployment of such a solution:

Performance may be hindered, due to the generic nature of implementation at the infrastructure level.
Fault Resilience might be compromised, as the frequency of failures increases along with the increased number of hosts participating in the system. The latter challenge is significant: the decrease in Mean Time Between Failures (MTBF) exposes a non-resilient system to an increased risk of down-times, which might be unacceptable in many business environments.

In Hecatonchire we developed a solution revolving around a scale-out approach with mirroring of the remote memory in two or more parallel hosts.

Memory Scale out redundancy design in Hecatonchire

Design

To enable fault-resilience, we mirrored the memory stored at memory sponsors. All remote memory, residing in memory sponsors, is mirrored, and therefore failure in one sponsor does not cause any loss of data. Furthermore, such failure does not cause any delay or downtime in the operation of the running application.

When the kernel module faults on an address, it sends the request to both memory sponsors. Upon reception of the ﬁrst response, the memory demander uses it to allow the application to continue operating. When a page is not required anymore on the computing node, the kernel module swaps the page out. To do so, it sends the page to both memory sponsors, and waits for validation that both of them stored the content, before discarding the page.

The biggest advantage of this approach is zero-downtime failover of memory sponsors. If one sponsor fails, the other sponsor continues responding to the memory demander’s requests for memory content.

Fault Tolerance in Hecatonchire

Tradeoff

The advantage of fault-resiliency in our approach is mitigated by several trade-offs. First of all, this approach needs twice the amount of memory, compared to a non-resilient scheme. Furthermore, the approach potentially induces an additional performance overhead: swap-out operations are delayed (waiting for responses from two hosts), and some additional computation and storage is needed.

Moreover, the mirroring approach doubles the needed bandwidth, luckily today’s fabrics have bandwidth capacities reaching 100Gbps and it is currently rare that application consume more than 50Gbps continuously.

Note that our approach does not deal with fault tolerance for the main host running the application (the memory demander).The issue of VM replication has numerous solutions that are orthogonal to our approach, such as Kemari and Remus.

Benchmark with Synthetic Workload

To assess the impact of using the High availability solution of Hecatonchire we used a realistic scenario: Running an application over the system, and measuring completion time overhead, compared to running the application on top of sufficient local DRAM. For this scenario we used an application which performs a quicksort over an arbitrary amount of generated data, in similar fashion to previous evaluations of DSM systems. This simulates a memory-intensive application, as memory is accessed rapidly, with minimal computations per access. Our two main KPIs are the overhead in completion time of using an redundant cluster to extend available memory, compared to using sufficient local, physical memory; and more importantly, the trend of performance degradation as the cluster scales. Scaling was created both by extending the workload size - amount of data to be sorted - and by limiting the amount of physical memory on the computation host. The ratio of workload size to available memory on the computation host reﬂects a scaling factor of the system.

Note: in the figure the redundant memory cluster is called RRAIM for redundant array of inexpensive RAM

The results show completion times for the quicksort benchmark, with a total workload of 500MB, 1GB and 2GB, respectively. The cgroup memory in the computation host was capped, such that workload size was up to ×5 larger than available memory, reﬂecting memory aggregation for the process.

The results reﬂect very low overhead for using HA , compared to local DRAM. Table displays the % of overhead per scaling factor. Quintupling (×5) the available memory for the quicksort application process using RRAIM results in an overhead less than 5%; using a fault-tolerant schema still results in an overhead less than 10%.
More importantly, the trend of performance degradation, as the scaling factor increases, reﬂects good scalability of the solution : the difference between doubling the available memory and quintupling it is less than 2.5 − 4%. The singular steep incline in the 512 GB scenario probably represents thrashing as the size of physical memory was too small for the working set - rather than a simple overhead increase. And this seems to be confirmed as enough memory on the computation host is available to hold the working set, this phenomenon does not occur (1GB + scenario).

Benchmark with Enterprise Application Workload

The quicksort benchmark suffers from two disadvantages: The ﬁrst is using small memory workloads (up to 2GB in our evaluations); and the second being a synthetic benchmark, not necessarily reﬂecting real-world scenarios. We therefore chose to evaluate the performance of RRAIM in transparently scaling-out a commercial in-memory database - SAP HANA - using the standard commercial benchmark. Note that this benchmark was not designed to test distributed systems at all: Only the performance of the database itself. A low overhead in completion time of the benchmark, when extending the available memory using RRAIM, reﬂects a transparent scale-out, with minimal disruption to the running database. The average overhead for running HANA with HECA on a 128 GB VM instance with 600 GB of DATA loaded up and running a SAP-H variant is on average 4% in a 1:2 and 1:3 memory ratio scenario. When we introduce RRAIM the overhead stay almost the same as showed in the following table.