Tuesday, November 19, 2013

Openstack consumption model : DiY vs Enterprise

Here are some personal comments on Openstack adoption in the industry and their statistics.

Openstack statistics: 

While the Openstack consortium is putting great effort in diffusing its statistics  to market / broaden its adoption ( you can see the latest batch here:  http://www.openstack.org/blog/2013/11/openstack-user-survey-statistics-november-2013/ ) you have to take them with a grain of salt.

One of the thing that is not immediately visible, but you can guess it, is the consumption model of Openstack. Openstack cloud deployment can be broadly classified in two different category: Do it yourself  ( DiY) , and (semi) contracted out  ( aka enterprise cloud).

 DiY vs Enterprise cloud consumption model

My own brew definition (partially adapted-stolen from some of Simon Wardley  blog posts , check out his great blog ):
  • DiY : Complete redesign of an  infrastructure architecture, most of the time as close as possible to a true public cloud architecture ( or exact copy in the case of public cloud provider) and/or often to fit very specific needs. These solution   aim to have the lowest cost possible by reusing as much open source tools out there combine with custom solution.
  • Enterprise: "It's like cloud but the virtual infrastructure has higher levels of resilience at a high cost when compared to using cheap components in a distributed fashion.  The core reason behind Enterprise Cloud is often to avoid any transitional costs of redesigning architecture i.e. it's principally about reducing disruption risks (including previous investment and political capital) by making the switch to a utility provider relatively painless." I.e. they want to be able to run "legacy" application alongside new one without having to throw away a lot of their existing investment ( HW - software - skillset - HR , etc.. ). These Enterprise cloud solution are sometimes delivered/consumed  as (heavily) custom packaged Openstack solution from specific vendor :  HP , mirantis, etc... These solutions are slowly making their way within the heavily regulated company segment ( compliance issues require on premise deployment).
DiY Openstack use internal resource and knowledge, in order to minimize TCO and maximize ROI they heavily rely on open source solution and adapt it to their need. This is why most of the time DiY are almost 100% home-grown solution.
On the other hand the Enterprise ones tend to be hybrid where you have a mix of opensource solution and "enterprise" solution. Sometimes this is also sprinkled with a dose of consulting. End consumer of these cloud are sometime  not aware that the solution they are buying is based on open stack ( ex:  in PaaS – Managed services scenario). By example Swisscom cloud managed services ( offer SAP -ERP - BW - etc..)  relies on Piston cloud who relies on Openstack. 
Often some  key aspect are being bought separately for their enterprise features ( or performance, support)  by ex: -storage : coraid (pNFS) / inktank (CEPH) , and networking : arista- .

Openstack user base is skewed:

Now, why the distinction is relevant to Openstack statistics: Cost savings  open technology , avoid vendor lock-in are the main drivers,  this should be hinting which one of the two is predominantly represented in the users surveyed => DiY.
As a result, the stats are quite skewed toward home grown solution where most of the cloud stack is build using open source solution almost exclusively. This is why CEPH or Ubuntu and other open source solution  heavily dominate the stats but they do NOT dominate  cloud spending ( this is reinforced with the fact that most of the surveyed user have a 1-50 nodes cloud ... ~60% + of the overall and only 20% of the clouds are used for production purpose  ). I would love to see a venn diagram of the different dimensions. If anybody has access to the raw survey data I would be happy to do the diagram for them
By example, for the cloud storage solution of some  Openstack  enterprise deployment in companies I came across  you would see all the usual suspect (EMC - Netapp - HP, etc..). However, they would be sometime relegated as secondary player role while a certain number of newcomers took the front stage. This newcomers obviously leverage the emerging market stage / first mover to grab market share : Inktank ( commercial CEPH version)  , Coraid ( pNFS ,  popular as it allow a natural transition from classical NFS setup ) , and finally the wild card :  Intel who is trying to push its Lustre solution ( from their whamcloud acquisition) into cloud storage . Note: I am not too aware on how glusterFS is fairing..

Openstack , a complex and rapidly evolving environment: 

As you can see Openstack ecosystem is heavily dominated by company using the DiY model.  This is further fuelled by the current state of Openstack which is more a toolkit that you need to (heavily) customize to build your own cloud compared to the self-contained solution like cloudstack, eucalyptus, flexiant, etc.. It makes me feel like openstack is the Debian equivalent for cloud.
However, its adoption is growing heavily in the corporate world (and as a result fragmentation risk too): Oracle has its own home-brew version (but does not contribute back anything and is completely closed) , SAP is starting some effort, Intel is a big proponent (see IDF 13), Big managed services player are using it as their basic building block  ( T-system, Swisscom , Huawei , IBM etc..) .

A tough ecosystem for startups and established company alike:

Complex Ecosystem, fragmentation, heavy reliance on custom solution makes it tricky for company to position themselves  within this environment. Too low and they might be completely ignored due to the heavy DiY proportion , leading to a long struggle ( or death) hoping that the crossing of the desert will come soon.Too high level and they end up fighting with a multitude of Openstack clone as well as with other cloud solution. 
There is no right decision there, maybe using a coherent layered approach across the openstack ecosystem would enable the creation of consistent revenue stream while limiting the race to the bottom competition ( competing on price). I probably will expand more on this concept in a follow up blog post.

PS: as I write this postI came across the following Gartner blog post that echo in some way my thought on Openstack ecosystem http://blogs.gartner.com/alessandro-perilli/what-i-saw-at-the-openstack-summit/ . Again this is to be taken with a grain of salt as Gartner has a long history of being a big VmWare supporter.

Saturday, August 03, 2013

Hecatonchire Version 0.2 Released!

Version 0.2 of Hecatonchire has been released.

What's New:
  1. Write Invalidate coherency model added for those who want to use Heca natively in their application as Distributed Shared Memory( more on that in a subsequent post)
  2. Significant improvement in performance of page transfer as well as a numbres of bugs squashed.
  3. Specific Optimisation for KVM.
  4. Scale out memory mirroring
  5. Hybrid Post copy live migration
  6. Moved to linux Kernel 3.9 Stable
  7. Moved to Qemu-kvm 1.4 stable
  8. Added Test / Proof of concept tools ( specifically for the new coherency model)
  9. Improved Documentation

We are now focusing on Stabilizing the code as well as robustness ( we aim at making the code production ready by 0.4) . Also, we are starting significant work to transparently integrate Hecatonchire so it can be transparently leverage via a cloud stack and more specifically openstack.

You can download it here : http://hecatonchire.com/#download.html
You can see the install doc here: https://github.com/hecatonchire/heca-misc/tree/heca-0.2/docs
And finally the changelog  there : http://hecatonchire.com/#changelog-0.2.html
Or you can just pull the Master branch on github:  https://github.com/hecatonchire

Stay tuned for more in depth blog post on Hecatonchire.

Monday, June 24, 2013

Scaling Out SAP HANA In Memory DB with Hecatonchire

Many shared-memory parallel applications do not scale beyond  a few tens of cores.However, may benefit from large amounts of memory:  
  • In-memory databases
  • Datamining
  • VM
  • Scientific applications
  • etc
Moreover,  memory in the nodes of current clusters are often Overscaled in order to fit the requirements of “any” application and  remains unused most of the time. One of the objective we try to achieve with project Hecatonchire is to unleash your memory-constrained application by using the memory in the rest of nodes.In this post we demonstrate how Hecatonchire enables users to have memory that grows with their  business or applications, not before. While using high-volume components to build high-value systems and eliminating physical limitation of Cloud / Datacenter or servers.

The application : SAP HANA

HANA DB takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications. It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system. HANA DB is 100% ACID compliant.

The Benchmark, hardware and  Methodology 

  •  Application: SAP HANA ( In memory Database)
  • Workload: OLAP ( TPC-H Variant)
    • Data size 
      • For Mall and Medium Instance: ~600 GB uncompressed ( ~30 GB compressed in RAM)
      • For Large: 300 GB compressed Data ( 2 TB of uncompressed data) 
    • 18 different Queries (TPC-H Variant)
    • 15 iteration of each query set
  • Virtual Machine:
    • Small Size: 64 GB Ram - 32 vCPU
    • Medium Size: 128 GB RAM – 40 vCPU 
    • Large Size: 1 TB RAM 40 vCPU
    • Hypervisor: KVM

  • Hardware:
    • Server with Intel Xeon West Mere
      •  4 socket
      •  10Core
      •  1 TB or 512 GB RAM
    • Fabric:
      • Infiniband QDR 40Gbps Switch +  Mellanox ConnectX2


The results demonstrate the scalability and performance of the Hecatonchire solution, for small and large instance we only notice an average of 3% overhead compare to non-scale out benchmark.

Hecatonchire Over Head vs Standard Virtualised HANA ( Small Instance)

 Moreover, we can notice in the query break down that for very short (almost point) query (13-14) the cost of accessing scaled out memory is immediately perceptible. However, Hecatonchire demonstrated that we are able to smooth out the impact of the scaling out for lengthier and memory intensive query.
Per Query Overhead breakdown for Small instance Benchmark

We officially tested Hecatonchire and HANA only up to 1 TB and obtained similar results as the one with the small and medium instance (3% overhead). We are currently running test for 4 to 8 TB scale out solution in order to  validate larger scale scenario which require new feature that are currently added into the code. Stay tuned for a new and improved Heca!
1 TB Virtualized SAP HANA scaled out with Hecatonchire

Finally, we can demonsrated that for Hecatonchire scale very well when we spread the memory across multiple memory provider nodes. We not only benefit  form the increased bandwidth but also from the improved latency with excellent performance result with 1.5% overhead when 2 third of the memory is externalised.
Various Hecatonchire deployment scenario of a  virtualized medium Instance HANA

Note: typically running HANA on Top of KVM add by default a 5% to 8% overhead compare to a bare bone instance that we didn't take into account in the result as we are only comparing virtualized against scaled out virtualized.

Tuesday, April 09, 2013

Hecatonchire High Availaility and redundancy of memory scale out node

Delivering transparent memory scale-out presents certain challenges. Two main aspects can be compromised by the deployment of such a solution:
  1. Performance may be hindered, due to the generic nature of implementation at the infrastructure level.
  2. Fault Resilience might be compromised, as the frequency of failures increases along with the increased number of hosts participating in the system. The latter challenge is significant: the decrease in Mean Time Between Failures (MTBF) exposes a non-resilient system to an increased risk of down-times, which might be unacceptable in many business environments.
In Hecatonchire we developed a solution revolving around a scale-out approach with mirroring of the remote memory in two or more parallel hosts.

Memory Scale out redundancy design in Hecatonchire


To enable fault-resilience, we mirrored the memory stored at memory sponsors. All remote memory, residing in memory sponsors, is mirrored, and therefore failure in one sponsor does not cause any loss of data. Furthermore, such failure does not cause any delay or downtime in the operation of the running application.
When the kernel module faults on an address, it sends the request to both memory sponsors. Upon reception of the first response, the memory demander uses it to allow the application to continue operating. When a page is not required anymore on the computing node, the kernel module swaps the page out. To do so, it sends the page to both memory sponsors, and waits for validation that both of them stored the content, before discarding the page.
The biggest advantage of this approach is zero-downtime failover of memory sponsors. If one sponsor fails, the other sponsor continues responding to the memory demander’s requests for memory content. 
Fault Tolerance in Hecatonchire


 The advantage of fault-resiliency in our approach is mitigated by several trade-offs. First of all, this approach needs twice the amount of memory, compared to a non-resilient scheme. Furthermore, the approach potentially induces an additional performance overhead: swap-out operations are delayed (waiting for responses from two hosts), and some additional computation and storage is needed.

Moreover, the mirroring approach doubles the needed bandwidth, luckily today’s fabrics have bandwidth capacities reaching 100Gbps and it is currently rare that application consume more than 50Gbps continuously.

Note  that our approach does not deal with fault tolerance for the main host running the application (the memory demander).The issue of VM replication has numerous solutions that are orthogonal to our approach, such as Kemari  and Remus.

Benchmark with Synthetic Workload

To assess the impact of using the High availability solution of Hecatonchire we used  a realistic scenario: Running an application over the system, and measuring completion time overhead, compared to running the application on top of sufficient local DRAM. For this scenario we used an application which performs a quicksort over an arbitrary amount of generated data, in similar fashion to previous evaluations of DSM systems. This simulates a memory-intensive application, as memory is accessed rapidly, with minimal computations per access. Our two main KPIs are the overhead in completion time of using an redundant cluster to extend available memory, compared to using sufficient local, physical memory; and more importantly, the trend of performance degradation as the cluster scales. Scaling was created both by extending the workload size - amount of data to be sorted - and by limiting the amount of physical memory on the computation host. The ratio of workload size to available memory on the computation host reflects a scaling factor of the system.

Note: in the figure the redundant memory cluster is called RRAIM for redundant array of inexpensive RAM

The results  show completion times for the quicksort benchmark, with a total workload of 500MB, 1GB and 2GB, respectively. The cgroup memory in the computation  host was capped, such that workload size was up to ×5 larger than available memory, reflecting memory aggregation for the process.

The results reflect very low overhead for using HA , compared to local DRAM. Table  displays the % of overhead per scaling factor. Quintupling (×5) the available memory for the quicksort application process using RRAIM results in an overhead less than 5%; using a fault-tolerant schema still results in an overhead less than 10%.
More importantly, the trend of performance degradation, as the scaling factor increases, reflects good scalability of the solution : the difference between doubling the available memory and quintupling it is less than 2.5 − 4%. The singular steep incline in the 512 GB scenario probably represents thrashing as the size of physical memory was too small for the working set - rather than a simple overhead increase. And this seems to be confirmed  as enough memory on the computation host is available to hold the working set, this phenomenon does not occur (1GB + scenario).

Benchmark with Enterprise Application Workload

The quicksort benchmark suffers from two disadvantages: The first is using small memory workloads (up to 2GB in our evaluations); and the second being a synthetic benchmark, not necessarily reflecting real-world scenarios. We therefore chose to evaluate the performance of RRAIM in transparently scaling-out a commercial in-memory database - SAP HANA - using the standard commercial benchmark. Note that this benchmark was not designed to test distributed systems at all: Only the performance of the database itself. A low overhead in completion time of the benchmark, when extending the available memory using RRAIM, reflects a transparent scale-out, with minimal disruption to the running database. The average overhead for running HANA with HECA on a  128 GB VM instance with 600 GB of DATA loaded up and running a SAP-H variant is on average 4% in a 1:2 and 1:3 memory ratio scenario. When we introduce RRAIM the overhead stay almost the same as showed in the following table.