Showing posts with label hecatonchire. Show all posts
Showing posts with label hecatonchire. Show all posts

Thursday, November 06, 2014

On the emergence of hardware level API for dis-aggregated datacenter resources


The technologies enabling the modular dis-aggregated data-center concept are reaching maturation point as demonstrated by the latest technology showcase RSA from Intel or to a lesser extent FusionCube / FusionSphere from Huawei. The needs for such technologies arise from the fact that current cloud and data-center technology does not and cannot fulfill all the demands of cloud users for multiple reasons. On one hand, as the number of cores and amount of memory on servers continues to increase (over the next few years, we expect to see servers with hundreds of cores and terabytes of memory per server commonly used), leasing an entire server may be too large for many customer’s needs with resources wasted. On the other hand, with the emergence of a broad class of high-end applications for analytic, data-mining, etc., the available amounts of memory and compute power on a single server may be insufficient.

Moreover, leasing cloud infrastructure resources in a fixed combination of CPU, memory, etc. are only efficient when the customer load requirements are both known in advance and remain constant over time. As neither of these conditions are met for a majority of customers, the ability to dynamically mix-and-match different amounts of compute, memory, and I/O resources is the natural evolutionary step after the hyper-converged solutions.

The objective here is to address the gaps that allow us to go beyond the boundaries of the traditional server, effectively breaking the barrier of using a single physical server for resources. in other words, we will be able to provision compute, memory, and I/O resources across multiple hosts within the same rack, while being consumed dynamically by varying quantities at run-time instead of in fixed bundles. This will effectively enable a fluid transformation of current cloud infrastructures targeting fixed commodity sized physical nodes to a very large pool of resources that can be tapped into and consumed without classical server limitation.


Intel has been advertising its RSA stack for a while and it is finally becoming reality, however, the real interesting part is not the technology. Indeed, to a certain extent, a lot of technology already exists and enables us to implement resource pooling. We demonstrated that it was already feasible to deliver cloud memory pooling in the Hecatonchire Project as well as with other vendors such as TidalScale or ScaleMP, who already offer compute aggregation. However, the last two solutions are monolithic and lack the flexibility needed to be used with the cloud consumption model and as a result, they are confined to a niche market.

What can really kick the dis-aggregated model into top gear is that Intel has now teamed up with a couple vendors and has already, created a draft hardware API specification called Redfish. Such API can be leveraged by a higher level of the stack thus allowing more intelligent, flexible and predictable resource consumption of how, where, and when workloads (VMs, containers, standard processes/threads) get scheduled onto that hardware. In a certain way this then enables Mesos / Kubernetes to deliver enhanced scheduling for every hardware aspect.

This brings some interesting capabilities to existing cloud technologies, cores and memory which then can be dynamically reallocated across the workload and arguably, , it greatly reduces the need for load balancing via live migration. You would then dynamically re-allocate the resource underneath (core, memory) rather than the whole system, thus making such process more robust and less error prone.

On the container side it would solve a lot of security headache the community is now facing. Rather than going with the physical->virtual->container route, you could simply run physical->container with a fine grained per core allocation using RSA / Redfish. Effectively you would provide fine grained subscription from your system in order to get maximal separation and performance guarantees. One can use this for separating critical applications while guaranteeing performance and isolation and indeed something we can already do now with jailhouse, at the cost of under subscribing your system.

If Intel is successful in disseminating (or having the other vendors standardize around it’s Hardware API), it would allow the technology to leap forward, as it’s biggest enemy is the difficulty to port across management API from one fabric, compute, I/O, storage model to another.


Saturday, August 03, 2013

Hecatonchire Version 0.2 Released!

Version 0.2 of Hecatonchire has been released.

What's New:
  1. Write Invalidate coherency model added for those who want to use Heca natively in their application as Distributed Shared Memory( more on that in a subsequent post)
  2. Significant improvement in performance of page transfer as well as a numbres of bugs squashed.
  3. Specific Optimisation for KVM.
  4. Scale out memory mirroring
  5. Hybrid Post copy live migration
  6. Moved to linux Kernel 3.9 Stable
  7. Moved to Qemu-kvm 1.4 stable
  8. Added Test / Proof of concept tools ( specifically for the new coherency model)
  9. Improved Documentation
Voila!

We are now focusing on Stabilizing the code as well as robustness ( we aim at making the code production ready by 0.4) . Also, we are starting significant work to transparently integrate Hecatonchire so it can be transparently leverage via a cloud stack and more specifically openstack.

You can download it here : http://hecatonchire.com/#download.html
You can see the install doc here: https://github.com/hecatonchire/heca-misc/tree/heca-0.2/docs
And finally the changelog  there : http://hecatonchire.com/#changelog-0.2.html
Or you can just pull the Master branch on github:  https://github.com/hecatonchire


Stay tuned for more in depth blog post on Hecatonchire.

Monday, June 24, 2013

Scaling Out SAP HANA In Memory DB with Hecatonchire

Many shared-memory parallel applications do not scale beyond  a few tens of cores.However, may benefit from large amounts of memory:  
  • In-memory databases
  • Datamining
  • VM
  • Scientific applications
  • etc
Moreover,  memory in the nodes of current clusters are often Overscaled in order to fit the requirements of “any” application and  remains unused most of the time. One of the objective we try to achieve with project Hecatonchire is to unleash your memory-constrained application by using the memory in the rest of nodes.In this post we demonstrate how Hecatonchire enables users to have memory that grows with their  business or applications, not before. While using high-volume components to build high-value systems and eliminating physical limitation of Cloud / Datacenter or servers.

The application : SAP HANA

HANA DB takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications. It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system. HANA DB is 100% ACID compliant.



The Benchmark, hardware and  Methodology 


  •  Application: SAP HANA ( In memory Database)
  • Workload: OLAP ( TPC-H Variant)
    • Data size 
      • For Mall and Medium Instance: ~600 GB uncompressed ( ~30 GB compressed in RAM)
      • For Large: 300 GB compressed Data ( 2 TB of uncompressed data) 
    • 18 different Queries (TPC-H Variant)
    • 15 iteration of each query set
  • Virtual Machine:
    • Small Size: 64 GB Ram - 32 vCPU
    • Medium Size: 128 GB RAM – 40 vCPU 
    • Large Size: 1 TB RAM 40 vCPU
    • Hypervisor: KVM

  • Hardware:
    • Server with Intel Xeon West Mere
      •  4 socket
      •  10Core
      •  1 TB or 512 GB RAM
    • Fabric:
      • Infiniband QDR 40Gbps Switch +  Mellanox ConnectX2



 Results 


The results demonstrate the scalability and performance of the Hecatonchire solution, for small and large instance we only notice an average of 3% overhead compare to non-scale out benchmark.

Hecatonchire Over Head vs Standard Virtualised HANA ( Small Instance)





 Moreover, we can notice in the query break down that for very short (almost point) query (13-14) the cost of accessing scaled out memory is immediately perceptible. However, Hecatonchire demonstrated that we are able to smooth out the impact of the scaling out for lengthier and memory intensive query.
Per Query Overhead breakdown for Small instance Benchmark



We officially tested Hecatonchire and HANA only up to 1 TB and obtained similar results as the one with the small and medium instance (3% overhead). We are currently running test for 4 to 8 TB scale out solution in order to  validate larger scale scenario which require new feature that are currently added into the code. Stay tuned for a new and improved Heca!
1 TB Virtualized SAP HANA scaled out with Hecatonchire



Finally, we can demonsrated that for Hecatonchire scale very well when we spread the memory across multiple memory provider nodes. We not only benefit  form the increased bandwidth but also from the improved latency with excellent performance result with 1.5% overhead when 2 third of the memory is externalised.
Various Hecatonchire deployment scenario of a  virtualized medium Instance HANA

Note: typically running HANA on Top of KVM add by default a 5% to 8% overhead compare to a bare bone instance that we didn't take into account in the result as we are only comparing virtualized against scaled out virtualized.

Tuesday, April 09, 2013

Hecatonchire High Availaility and redundancy of memory scale out node



Delivering transparent memory scale-out presents certain challenges. Two main aspects can be compromised by the deployment of such a solution:
  1. Performance may be hindered, due to the generic nature of implementation at the infrastructure level.
  2. Fault Resilience might be compromised, as the frequency of failures increases along with the increased number of hosts participating in the system. The latter challenge is significant: the decrease in Mean Time Between Failures (MTBF) exposes a non-resilient system to an increased risk of down-times, which might be unacceptable in many business environments.
In Hecatonchire we developed a solution revolving around a scale-out approach with mirroring of the remote memory in two or more parallel hosts.


Memory Scale out redundancy design in Hecatonchire


Design


To enable fault-resilience, we mirrored the memory stored at memory sponsors. All remote memory, residing in memory sponsors, is mirrored, and therefore failure in one sponsor does not cause any loss of data. Furthermore, such failure does not cause any delay or downtime in the operation of the running application.
When the kernel module faults on an address, it sends the request to both memory sponsors. Upon reception of the first response, the memory demander uses it to allow the application to continue operating. When a page is not required anymore on the computing node, the kernel module swaps the page out. To do so, it sends the page to both memory sponsors, and waits for validation that both of them stored the content, before discarding the page.
The biggest advantage of this approach is zero-downtime failover of memory sponsors. If one sponsor fails, the other sponsor continues responding to the memory demander’s requests for memory content. 
Fault Tolerance in Hecatonchire

Tradeoff

 The advantage of fault-resiliency in our approach is mitigated by several trade-offs. First of all, this approach needs twice the amount of memory, compared to a non-resilient scheme. Furthermore, the approach potentially induces an additional performance overhead: swap-out operations are delayed (waiting for responses from two hosts), and some additional computation and storage is needed.

Moreover, the mirroring approach doubles the needed bandwidth, luckily today’s fabrics have bandwidth capacities reaching 100Gbps and it is currently rare that application consume more than 50Gbps continuously.

Note  that our approach does not deal with fault tolerance for the main host running the application (the memory demander).The issue of VM replication has numerous solutions that are orthogonal to our approach, such as Kemari  and Remus.

Benchmark with Synthetic Workload


To assess the impact of using the High availability solution of Hecatonchire we used  a realistic scenario: Running an application over the system, and measuring completion time overhead, compared to running the application on top of sufficient local DRAM. For this scenario we used an application which performs a quicksort over an arbitrary amount of generated data, in similar fashion to previous evaluations of DSM systems. This simulates a memory-intensive application, as memory is accessed rapidly, with minimal computations per access. Our two main KPIs are the overhead in completion time of using an redundant cluster to extend available memory, compared to using sufficient local, physical memory; and more importantly, the trend of performance degradation as the cluster scales. Scaling was created both by extending the workload size - amount of data to be sorted - and by limiting the amount of physical memory on the computation host. The ratio of workload size to available memory on the computation host reflects a scaling factor of the system.

Note: in the figure the redundant memory cluster is called RRAIM for redundant array of inexpensive RAM






The results  show completion times for the quicksort benchmark, with a total workload of 500MB, 1GB and 2GB, respectively. The cgroup memory in the computation  host was capped, such that workload size was up to ×5 larger than available memory, reflecting memory aggregation for the process.

The results reflect very low overhead for using HA , compared to local DRAM. Table  displays the % of overhead per scaling factor. Quintupling (×5) the available memory for the quicksort application process using RRAIM results in an overhead less than 5%; using a fault-tolerant schema still results in an overhead less than 10%.
More importantly, the trend of performance degradation, as the scaling factor increases, reflects good scalability of the solution : the difference between doubling the available memory and quintupling it is less than 2.5 − 4%. The singular steep incline in the 512 GB scenario probably represents thrashing as the size of physical memory was too small for the working set - rather than a simple overhead increase. And this seems to be confirmed  as enough memory on the computation host is available to hold the working set, this phenomenon does not occur (1GB + scenario).

Benchmark with Enterprise Application Workload




The quicksort benchmark suffers from two disadvantages: The first is using small memory workloads (up to 2GB in our evaluations); and the second being a synthetic benchmark, not necessarily reflecting real-world scenarios. We therefore chose to evaluate the performance of RRAIM in transparently scaling-out a commercial in-memory database - SAP HANA - using the standard commercial benchmark. Note that this benchmark was not designed to test distributed systems at all: Only the performance of the database itself. A low overhead in completion time of the benchmark, when extending the available memory using RRAIM, reflects a transparent scale-out, with minimal disruption to the running database. The average overhead for running HANA with HECA on a  128 GB VM instance with 600 GB of DATA loaded up and running a SAP-H variant is on average 4% in a 1:2 and 1:3 memory ratio scenario. When we introduce RRAIM the overhead stay almost the same as showed in the following table.