Reflections Of The Void: kernel

Showing posts with label kernel. Show all posts

Tuesday, June 09, 2020

[Links of the Day] 09/06/2020 : #WASM on #K8S , Fast Anomaly Detection on Graphs, Linux one ring

Krustlet : it seems that web assembly is getting more pervasive, we have kernel WASM, WASM for Deep learning, now Krustlet offer WASM for Kubernetes via Kubelet.
Fast Anomaly Detection in Graphs : Really cool real-time anomaly detection on dynamic graphs. the authors claim to be 644 times faster than SOTA with 42-48% higher accuracy. What is even more attractive is the constant memory usage which is fantastic for production deployment. [github]
io_uring : this will dominate the future of the Linux interface. It is currently eating up every single IO interface and probably won't stop just there.

Wednesday, May 09, 2018

A look at Google gVisor OCI runtime

Google released a new OCI container runtime: gVisor. This runtime aim at solving part of the security concerns associated with the current container technology stack. One of the big argument of the virtualisation crowd has always been that the lack of explicit partitioning and protection of resource can facilitate “leakage” from the containers to the host or other adjacent containers.

This stem from the historical evolution of containers in Linux. Linux has no such concept of native containers. Not like BSD with Jails or Solaris with Zones. Containers in Linux are the result of the gradual emergence of a stack of various security and isolation technology that was introduced in the Linux kernel. As a result, Linux ended up with a highly broad technology stack that can be separately turned on / off or tuned. However, there is no such thing as a pure sandbox solution. The classic jack of all trade curse, it’s a master of none solution.

The Docker runtime (Containerd) package all the Linux kernel isolation and security stack (namespaces, cgroup, capabilities, seccomp, apparmor, SELinux) into a neat solution that is easy to deploy and use.

It allows the user to restrict what the application can do such as which file it can access with which permission or limits resource consumption such as networks, disk I/O or CPU. It allows the applications to share resources without stepping on each other toes happily. Also, limits the risk any of their data being accessed while sitting on the same machine.

With a correct configuration ( the default one is quite reasonable ) will allow blocking anything that is not authorised and in principle protect from any leak from a malicious or badly coded piece of code running in the container.

However, you have to understand that Docker has some real limitation already. It has only limited support for user-namespace. The user-namespace allows applications to have UID 0 permission within the containers ( aka root ) while the containers and user running has a lower privilege level. As a result, each container would run under a different user ID without stepping on each other toes.

All of these features rely on the reliability and security (as in no bugs) of the Linux kernel. Most of Docker advanced feature relies on kernel features. And getting new features is a multi-year effort, it took a while for good resource isolation mechanism percolates from the first RFC to the stable branch by example. As a result, Docker and current container ecosystem are directly dependent on the Linux kernel update inertia as well as its code quality. While being excellent, no system is entirely free of bug, not to mention the eternal race for patching them when they are discovered and fixed.

Hence the idea is to, rather than having to share the kernel between all the containers, have one kernel per container. Explicitly limiting potential leakage, interference and reduce the attack surface. gVisor adopts this approach, which is not new as KataContainers already implemented something similar. Katacontainers is the result of the fusion of ClearContainer (intel) and runV (hyper). Katacontainers use KVM as a minimalistic kernel dedicated to the container runtime. But, you still need to manage the host machine to ensure fair resource sharing and their securitisation. This additional layer of indirection limits the attack surface as even if a kernel bug is discovered you will be challenged to exploit it to escape to another an adjacent container or underlying one as they are not shared.

gVisor can use a KVM as kernel; however, it was initially and is still primarily designed around ptrace. User Mode Linux already used the same technique, which is to start a process in userspace for the subsystem that will be running on top. Similarly to a hypervisor model used by virtual machines. All the system calls will be executed using the permission of the userspace process on behalf of the subsystem via an interception mechanism.

Now, how do you intercept these system calls which should be executed by the kernel? UML and gVisor divert ptrace primary goal ( which is debugging ) and inject a breakpoint in the executable code to intercept and stop for every system call execution. Once caught the new userspace kernel will execute the call on behalf the original process within userspace. It works well, but as you guessed, there is no free lunch. This method what heavily used by the first virtualisation solution. But rapidly, processor vendors realised that offering hardware-specific acceleration method would be highly beneficial ( and sell more at the same time).

KVM and other hypervisor leverage such accelerator. Now you even have AWS and Azure deploying completely dedicated coprocessor for handling virtualization related acceleration. Allowing VM to run almost that the same speed as a bare metal system.

And like Qemu leveraging KVM, gVisor also offer KVM as underlying runtime environment. However, there is significant work to be done to enable any container to run on to of it. While ptrace allow to directly leverage existing Linux stack, with KVM you need need to reimplement a good chunk of the system to make it work. Have a look at Qemu code to understand the complexity of the task. This is the reason behind the limited set of supported applications as not all syscalls are implemented yet.

As is, gVisor is probably not ready yet for production. However, this looks like a promising solution providing a middle ground between the Docker approach and the Virtualization one while taking some of the excellent ideas coming from the unikernel world. I hope that this technology gets picked up, and the KVM runtime becomes the default solution for gVisor. It will allow the system to benefit from a rock-solid hardware acceleration with all the paravirtualisation goodies such as virtio.

Thursday, August 03, 2017

[Links of the Day] 03/08/2017 : NVMe over TCP , Lineage mapping of cryptocurrency , Perceptions of probability

NVMe Over TCP : interesting kernel module by solarflare that allow people to use NVMe over TCP. It will be really interesting to see what type of performance you can start to get out of such setup. Even if performance is significantly decreased ( but higher than other storage solution) the economic gain vs costly NVMe solution would make this worth it. Also, It can start to accelerate the arrival of a new type of high-performance low-cost storage applications by lowering the barrier to entry.
Map of Coins : Impressive lineage mapping of crypto currency. But what is more concerning is the amount of dead Bitcoin child cryptocurrency, a lot of pump and dump scheme going on
Perceptions : this really cool graphics show the how human perceive probability and how fuzzy it can be. This can explain why a certain type of person might take more risk or less based on certain information due to a different interpretation of the content.

Friday, April 14, 2017

[Links of the Day] 14/04/2016 : OpenFabric Workshop , Docker's Containerd , Category Theory

OpenFabrics Workshop 2017 : Some interesting talk this year at the open fabric conference:

uRDMA : Userspace RDMA using DPDK. This opens up a certain amount of possibility, especially for object storage solution. [Video , Slides, github]
Crail : Using urdma above to deliver accelerated storage solution for Apache big data projects [Slides, github]
Remote Persistent Memory: I think this is the next killer app for RDMA. If Intel doesn't jump onto it and deliver a dpdk like solution. [Video, Slides]
On Demand paging: slowly the tech is crawling its way up to upstream acceptance. While on-demand paging introduces a certain performance cost. It also allows a greater flexibility in consuming RDMA. One of the interesting aspects that nobody mentioned yet is how this feature could be used with persistent memory. I think that there is some good potential for p2p NVM storage solution.[Video, Slides]

Containerd : Containerd move to github, the docker "industry standard" container runtime is also reaching its v.0.2.x release. [github]
Category Theory : If you are into functional programming and Haskell. This is a must read book for you.

Monday, July 11, 2016

[Links of the day] 11/07/2016: SSD failures, BCC , NUMA deep dives

SSD Failures in Datacenters : Best student paper : SSD fail, What? When? And Why?
BCC : I have been trying to trace some nasty RCU stall bug ( which turn out to be just the symptom of another problem) and BCC was really useful with this ordeal. It is quickly turning into the linux swiss army knife of debugging. BPF is an amazing piece of software.
NUMA Deep Dive Series : Start of a series of posts looking into history and modern NUMA architecture.

Thursday, June 16, 2016

[Links of the day] 16/06/2016 : Bayesan Betancourt's lecture, Physic breakthrough & Distributed systems, Linux Kernel Radix tree

Betancourt Binge : Michael Betancourt’s video lectures in Tokyo, all about Bayesian model.
Standing on Distributed Shoulders of Giants : as usual excellent ACM queue article drawing parallel between physics breakthrough and the world of distributed systems.
Multi-order radix tree : The Linux kernel radix tree is a data-structure at the centerpiece of the memory management system. With the advent of new memory model ( persistent one). This data-structure needs to evolve. However, like with everything touching memory, it will take some time and many many re-submission to the mailing list. This article present a possible evolution path for this venerable data-structure.

Monday, April 18, 2016

[Links of the day] 18/04/2016 : Persistent Memory file system and kernel, CS education

pmFS : persistent memory file system , not in development anymore but for anybody curious in persistent and storage class memory storage this would be good place to start
Kernel persistent memory : instruction for working with persistent memory code in Linux
CS education : compiled ressource of computer science class by google

Monday, February 01, 2016

[Links of the day] 01/02/2016: Linux internals, best paper and Spotify goes SDN

Best Papers : Best Paper Awards in Computer Science (since 1996)
Linux Internals : very good ebooks on the internal of linux kernel from boot to memory
SDN Internet Router [part2] : a very impressive demonstration of the implication of the SDN technology for company. It allowed spotify to replace routers that would have cost 1/2 $M each with a couple of SDN switches.

Wednesday, November 04, 2015

Links of the day 04/11/2015: Intel ISA-L , Linux Kernel userland page fault handling, Evolution of CI at stratoscale

ISA-L : brief introduction to the Intel Intelligent Storage Acceleration Library (ISA-L). Some nice feature for erasure code in there [intel 01 website]
Evolution of CI at Stratoscale : How the development team develops, tests, deploys and operates it. How do we get tens of developers to work productively at a high velocity while maintaining system cohesion and quality? How can we tame the inherent complexity of such a distributed system? How do we continuously integrate, test, deploy and operate it? How do we take devops to the limit? And just how tasty is our own dog food?
Userfaultfd : Nice to see the code making it into the upstream release for user land page fault resolution.

Wednesday, October 14, 2015

Links of the day 14/10/2015: Kernel Bypass, stateful services, erasure codes benchmark

Streaming video 10Gb and beyond : well seems everybody is bypassing the kernel nowadays.
Building Scalable Stateful Services : Age old cycle - state - no state - etc..
Benchmark comparisons of erasure codes : pick carefully , but not all are equal. Again there is no one size fit all.

Monday, September 14, 2015

Links of the day 14/09/2015 : devops cert, Byte-Addressable NVM, Kernel Bypass

WrAP : paper on Managing Byte-Addressable Persistent Memory
Kernel bypass : as network and storage get faster generic solution start to be seen as a limitation in current software stack. As a result we are seeing more and more bypass library. However it always end up being dependent on how efficient the end consuming software is.
Devops League : There are plenty of DevOps certifications out there of varying quality. This one is the best. It is wonderful and You'll love it, too. You'll love it so much that you'll print out your certification and even put it on your résumé. You'll tell all your friends about it and even ask your loved ones to mention it at your funeral. RIP, by the way.

Friday, August 28, 2015

Links of the day 28/08/2015 : libfabric, IO visor and demoscene

IO visor : it seems that the effort from Plumgrid around eBPF are picking up speed and an official linux foundation project is been setup. However one must wonder how such solution will compete against the pure user space solution relying on DPDK and consort? You can find a more in depth slide deck of the concept [here].
Libfabric : Intel is announcing in grand pomp the "Open source" library supporting its Omni path fabric. However it is not other that the fantastic Lib fabric. This library offer a set of next-generation, community-driven, ultra-low latency networking APIs. The APIs are not tied to any particular networking hardware model, it support Infiniband / Iwarp, usNic from Cisco, OPA from intel What is interesting is that it goes one step further than the RDMA library while maintaining a good balance between low level tuning and high level program-ability. While the learning curve might be a little bit more steep compared to Accelio from Mellanox it delivers ( I think ) greater advantage and flexibility.
Winning 1kb intro : released at Assembly 2015, prepare to be amazed

Monday, February 16, 2015

Links of the day 16 - 02 - 2015

Today's links 16/02/2015: Memory Analysis, CPU instruction for NVM,SR-IOV and Linux kernel live patching

ANATOMY: an analytic model of memory system performance able to summarize key workload characteristics, namely row buffer hit rate, bank-level parallelism, and request spread which are used as inputs to the queuing model to estimate memory performance. [slides]
CLWB and PCOMMIT : a look at the new specific cpu instruction for NVM. The real benefit will start to appear when the dev willstart using them in application such as in-memory DB or persistence logging.
SR-IOV : Single-root I/O virtualization (SR-IOV) standard allows an I/O device to be shared by multiple Virtual Machines (VMs), without losing runtime performance. series of videos covering topics for your virtualization environment such as VXLAN Tunnel End Point (VTEP), live VM migration, and HPC clustering.
Live patching : kGraft and kpatch merged into a single patchset for kernel live patching ..

Friday, October 31, 2014

Links of the day 31 - 10 - 2014

Today's links 31/10/2014: TCP, kernel, NVMe, network fabric

lwIP on BareMetal : a lightweight TCP/IP stack running on an ultra-lightweight kernel ( coded in x86 assembly )
NVMe over Fabrics : a look at the performance of NVMe over various fabric and what it implies for the future of storage.

Thursday, October 30, 2014

Links of the day 30 - 10 - 2014

Today's links 30/10/2014 : userfault, transaction, cloud frontend , virtkick

Phaser : phase Reconciliation for Contended In-Memory Transactions by Neha Narula at MIT [slides]
Scaling Address-Space Operations on Linux with TSX : Thesis by Christopher Ryan Johnson on transacional memory and how these operations can be scaled within multicore systems.
VirtKick : A simple orchestrator. Manage virtual machines or #docker [github]
Userfault : Andrea Arcangeli release the first RFC for page fault resolution in userspace. The interesting bit is the possibility to treat write and read fault differently. I can foresee some promising spin off from this project

Tuesday, October 21, 2014

Links of the day 21 - 10 - 2014

Today's links 21/20/2014: all about #Linux #networking with a little bit of #HPC distributed #storage

State of Linux network stack : what's new and interesting in the latest kernel release, especially the low-latency device polling
KVM Forum : all videos of this year KVM forum . Some interesting talk especially on the HPC front and an interesting quote from Vincent Jardin: " if you want to have high performance networking or NVF solution don't use virtualization use container"
RDMA and ARM : Mellanox bring its RoCE adapter to the moonshot project. Interesting to see what type of application would leverage such architecture combination: a lot of small processors with a fast fabric.
IX : solution that isclose to achieve the holy grail of networking - Low latency with high throughput (line rate)
(Fast Forward) Storage and I/O : Distributed Application Object Storage (DAOS) by Intel for HPC solution. A lot of flash , burst buffer with Lustre for supercomputer. Very interesting approach to address the challenge of future exascale computing platform.

Friday, October 03, 2014

Links of the day 03 - 10 - 2014

Today's links 03/10/2014: Cloud regulations, OO Linux Kernel , scaling SSL, DBMS Architecture principle

Cloud computing/security regulations : by country mashed-up on a Google Map
BOOS-MOOL : Minimalistic Object Oriented Linux, its a redesign of the kernel with object oriented abstractions and C++ driver support will increase maintainability while reducing complexity of the kernel.
Scaling Universal : Cloudflare is able to reduce the CPU usage of Universal SSL to almost nothing.
Architecture of a Database System : paper presenting an architectural discussion of DBMS design principles, including process models, parallel architecture, storage system design, transaction system implementation, query processor and optimizer architectures, and typical shared components and utilities.

Evolving database landscape

Tuesday, September 02, 2014

Links of the day 02 - 09 - 2014

Today's links : SSH, Linux Kernel I/O stack, Linux Kernel Locking,Systemd, Data Visualization

Mosh : A replacement for SSH, more robust, UDP based, its designed to be more robust and responsive, especially over Wi-Fi, cellular, and long-distance links.
Revisiting How We Put Together Linux Systems : another spin on the immutable OS by the systemd crowd
Now Locking Mechanism in Linux kernel : As servers gains more core, lock contention become a major performance bottleneck. HP worked on developing new locking mechanism in the Linux kernel to alleviate some of the issues.
Interactive version of the book : "Interactive Data Visualization for the Web" by Scott Murray
Linux 3.17 I/O stack diagram

Saturday, August 03, 2013

Hecatonchire Version 0.2 Released!

Version 0.2 of Hecatonchire has been released.

What's New:

Write Invalidate coherency model added for those who want to use Heca natively in their application as Distributed Shared Memory( more on that in a subsequent post)
Significant improvement in performance of page transfer as well as a numbres of bugs squashed.
Specific Optimisation for KVM.
Scale out memory mirroring
Hybrid Post copy live migration
Moved to linux Kernel 3.9 Stable
Moved to Qemu-kvm 1.4 stable
Added Test / Proof of concept tools ( specifically for the new coherency model)
Improved Documentation

Voila!

We are now focusing on Stabilizing the code as well as robustness ( we aim at making the code production ready by 0.4) . Also, we are starting significant work to transparently integrate Hecatonchire so it can be transparently leverage via a cloud stack and more specifically openstack.

You can download it here : http://hecatonchire.com/#download.html
You can see the install doc here: https://github.com/hecatonchire/heca-misc/tree/heca-0.2/docs
And finally the changelog there : http://hecatonchire.com/#changelog-0.2.html
Or you can just pull the Master branch on github: https://github.com/hecatonchire

Stay tuned for more in depth blog post on Hecatonchire.

Wednesday, February 06, 2013

Hecatonchire: Memory Scale out -up for Linux Kernel

This post present a more detail description of the memory scale out solution offered by the Hecatonchire project.

I. Motivation:

Currently the amount of memory available on a physical server constraint the performance and scalability of applications be limiting the amount of RAM available for consumption. The combination of the explosive data set growth with the need to deliver near real time analytics to achieve competitive advantage requires companies to use large memory systems. However this solution tends to be expensive, have difficulty to scale and have a shorter life span due to the exponential application workload increase. Not to mention that they often create an island within virtualised datacentre and cloud environment which increase their operational cost.

While hardware capabilities improve, the amount of data to be analysed continues to grow. We can imagine that technological innovation in the area of hardware resources may not be able to keep up with this growth and indeed could reach a halt. Rather than frequently buying the newest (and often most expensive) cutting-edge machines the market has to offer, a better answer to the problem of data volume growth can be to enable memory

Scale out: This means that data can be distributed and accessed in parallel within a set of relatively inexpensive commodity hardware machines (within a cloud). Additional machines can then be connected to handle even larger amounts of data.
Scale Up with Non Volatile Memory (NVM) as extended memory (non-volatility is not important here): This would enable the extension of current system at lower cost while maintaining decent performance.
Hybrid: allowing remote NVM to be accessed over network.

However in order to guarantee performance we need to leverage either RDMA fabric or direct device access to allow zero copy transfer of page and low cpu / memory overhead .

The result, a system able to distribute memory across a cluster of machines and / or tapping into NVM, providing fast response times for many types of applications that require large amount of memory that exceed the typical Core - Memory ratio of 1:4GB. It allows also the introduction of highly specialised HW within datacentres and treats memory the same way we currently treat storage.

We will first discuss the state of the art and its limitations. Then we will discuss a possible approach and finally the performance of a proof of concept memory scale out module. Note that most of the effort currently focused on the scale out approach and the scale up using NVM device is currently on at the concept stage.

II. State of the art:

When a user or administrator desired to extend its RAM memory it typically leverage the swap mechanism to extend the memory of the system by allowing it to leverage block device space. They create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems.

However the current solution suffers from significant drawback:

Speed : The swap process relies on the block io layer which is expensive and doesn’t suit very well swap to ram scenario (https://lkml.org/lkml/2010/5/20/314 ) . This is further exacerbated when the block layer is access over network which limit its usage for diskless system that can tolerate the poor performance. As a result However, the low latency characteristics of flash devices and RDMA fabrics mean that the standard I/O path adds additional overhead for swapping.
Size: On 32bit kernel we are limited to 64 swap spaces at a maximum of 64 GB. While with the 64 bit kernel 128 swap areas are permitted, each with 16 Tb.
Granularity: Swap is system wide which prevents the user from restricting its usage to certain process. The only control relies on memlock functionality which requires programmatic modification and awareness of the deployment scenario.
Flexibility: Due to the limited number of swap area we cannot easily strip the memory across remote memory recipient for performance. By ex: 1TB with 1 GB strip across 10 different memory node/ server. On 64 bit linux we could only have 128 stripes while a 1000 would be required.

Alternative solutions to swapping were proposed such as cleancache / frontswap / ramster combo (with the transcendent memory patch set) . RAMster is a transparent memory aggregation solution, integrated into mainline Linux. It allows a host to rely on physical memory of another single remote host, to handle large memory workloads. It supports local compression of data using zcache, and off-line transfer of memory contents, to mitigate networking costs. Despite its advantages, it does not fully fit the case of memory-demanding applications:

It does not yet include fault-resilience features, it uses a peer-to-peer schema, and it creates some disruption in native operation of Linux, by delaying the transfer of pages that have already been unmapped from a process in a matter undetectable by Linux. These approaches while valid force the creation of dedicated backend solution to offer de-duplication, compression, tiering of memory while such features already exist within the standard kernel.

Paging / Swapping to nvram:

As previously mentioned the low latency characteristics of flash devices mean that the standard I/O path adds additional overhead for swapping . Several attempt where proposed to design a specific NVM aware swapping backend however so far none of them made their way with the Linux kernel as they tend to be hyper specialised and do not translate very well across the different device.

III. Proposed approach (memory aggregation) :

1. Overview:

High level concept design was focused on meeting the requirements for scaling-out memory-demanding applications, described in the introduction section: elastic, resource-selective, transparent, scalable, extensible and with high-performance. These goals were to be achieved without disrupting existing functionalities in any of the participating hosts, e.g., no disruption to the operating system, hardware devices or applications running in parallel on these hosts.

a. Scaling out Memory over network:

A cluster is built around a memory demander: a host running a memory-demanding application or VM. Available memory for that process is extended, relying on physical memory designated for that purpose by other hosts participating in the cluster. To extend its available memory, the memory-demanding application or VM registers segments of its virtual address space, which are sponsored by one or more memory sponsors.

Memory sponsors are other hosts participating in the cluster. They malloc() sufficient memory in a local process, to serve as swap-space for memory segments they sponsor. Yet beyond that they are free to operate, utilizing their resources for unrelated purposes. Therefore only essential resources

are selectively used by the cluster. When the memory demander host runs out of physical memory, it sends memory contents in sponsored segments to be stored at their respective memory sponsors. Memory segments can be registered as sponsored at any stage in the life-cycle of the memory-demanding process.

When it is initially set-up, it is possible to register large sponsored segments, pre-creating a large cluster. Alternatively, additional memory segments can be registered as sponsored over time, gradually extending available physical memory for the process. Sponsored segments can also be unregistered over time, shrinking the available memory. The cluster is therefore elastic, extendable or shrinkable over time.

b. Scaling up with NVM:

We can rely on the same approach for NVM device however we cannot use user space process to act as memory host in this case. As a result we propose to use an alternative approach: we can rely on a dedicated module managing the mapping of memory segment to the NVM device. This module can register and expose segment to the other process via simple sysfs call.

IV. Kernel Implementation:

For transparent integration with existing applications we implemented a propose Linux Kernel Module, with four small changes to the Kernel itself: two hooks in the write-out sequence and two in the page fault sequence. Our goal was to create an RDMA-based alternative to existing disk-based swap storage, while also keeping existing swap functionalities intact.

We sought to integrate with the Memory Management Unit (MMU) in a way similar to the way disk-based swap integrates with it, and carry out the RDMA processes in a manner as close as possible to the processes which they replace. We relies on existing data structures, such as page table entries and swap entries.

The proposed approach reserves a single swap area entry for the purpose of memory aggregation. This swap entry is used to identify (flag) pages that are either remote or on NVM devices. When a memory segment is registered as sponsored, we set page table entries with this specific flag. The swap entries contain information as to the host which sponsors that memory segment. Note: on a 64bit system the swap entry size allow us to scale to up to 2^32 different memory segments. Moreover each memory sponsor is identified by a single 128 bit ID and each cluster or group is also identified by a single 128 bit ID. This allows the system to scale within large cloud / datacentre environment.

This approach has the advantage to be non-intrusive and leverage existing kernel code however it require the pre population of all the PTE associated with the memory segment . We could use the creation of a special VMA however in order to be the less intrusive possible we decided against but we might reconsider this possibility based on feedback.

When a fault occurs in an address, we recognizes its flagged swap entry, and is able to request the data from the sponsoring hosts to which the entry points. If a write-out later occurs (by example under memory pressure by C-group constraint), and page contents are transferred, the page table entry is re-set, pointing to the sponsor host. Moreover we distinguish between write and read fault as we allow different coherency model to be used for distributed shared memory application of the technology (shared nothing, write invalidate) .

The actual page transfer operation can be either done via DMA by the dedicate NVM module for the device or in the scale out scenario via a specifically designed RDMA engine which ensure zerocopy and low latency transfer kernel to kernel.

Hecatonchire Simplified Memory Management Architecture

This manner of integration with the MMU and existing data structures has several advantages. First of all, as mentioned, it is transparent to running applications/VMs, and non-disruptive for existing memory management. The only needed changes in applications are the system calls needed to register sponsored memory segments. In the common case when applications run on top of

a VM, the VM may transparently handle these system calls, leaving the application completely unchanged.

Furthermore, any optimizations imposed on page table entries are helpful too, and future compatibility with such enhancements is guaranteed. In addition, as we branches out only when specifically writing out a page or resolving a page fault, the underlying processes deciding which page to write-out, or faulting on an address are unchanged. These are highly optimized processes, and should not be affected by the existence of a remote memory cluster. A single exception to this design choice is the process of prefetching.

The prefetching is a surrogate for the native Kernel one, which exists within the page faults flows which are replaced by our execution flows. When sending a page request to a remote host, we uses the spare time spent in networking to issue further requests, prefetching related pages. These pages are later faulted in. The current prefetch method is rather naïve but provide good performance as we simply try to pre-fault a limited set of surrounding pages of the one being fetch.

This solution also offer transparent support of kernel features such as:

Memory swapping: memory sponsor can swap out the page as they are standard normal anonymous page
KSM: Page can be de-duplicated; if we encounter a KSM paged we simply break the KSM page.
Transparent huge page: If a fault is encountered in a THP we break it in order to have a single granularity for all the pages
EPT/NPT: the solution support shadow page table in order to be virtualization friendly
Asynchronous page fault (KVM): The page faulting process has been specifically optimised to support async page fault implemented by KVM
Cgroup: cgroup are transparently supported and every page is accounted eliminating potential black hole effect as the one created by Ramster.

Advantages:

Transparent integration
No overhead when not compiled
Virtually no overhead to normal operation
Minimal hooks
Can be extended to provide distributed shared memory features with different coherency protocol.
Design to scale
Zero Copy (RDMA or DMA)
Use Anonymous page
RDMA: low latency

Issue / Unknown:

We do not know yet what the best way to handle Memlock / mmap is
We did not plan how to support process fork (this can be supported with a special VMA flagging)
PTE entries need to be pre filled (this can be mitigated with a special VMA)
Require a specific module to manage NVM devices
Require HW RDMA support for scale out (you can use Emulated RDMA such as SoftIwarp or softRoCE but it is not upstream yet)

V. Proof of concept and performance:

We produce an initial proof of concept. This PoC focused on the memory scale out scenario over RDMA. We developed an in kernel RDMA communication engine and a module for managing the memory. We tested the PoC against various scenario and RDMA fabric (SoftIWARP, Iwarp, Infiniband).

Hardware:

CPU : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
1 socket with 4 cores, Pci-e 2
Memory: 16 GB

Network Fabric:

Infiniband : QDR connectX2 mellanox with 40Gbps switch
Iwarp : 10GbE Chelsio T422CR with Fujitsu 10GbE switch
Softiwarp : 10GbE Chelsio T422CR with fujitsu 10GbE switch
Kernel: Linux stable 3.6 + Hecatonchire Module

Performance for hard fault resolution (from page fault to its resolution):

Softiwarp 10GE: ~320 micro seconds
Iwarp 10 GE (Chelsio T422-CR) : ~ 45 micro seconds
Infiniband QDR 40Gbps (Mellanox ConnectX2): 25 micro seconds

Random Walk over 1GB of remote RAM (4 threads):

Compounded Perceived Page Fault Resolution (prefetch enabled):

Infiniband QDR : ~2 micro seconds
Iwarp 10GE : ~4 micro seconds

Page fault resolution per seconds:

Infiniband QDR: ~600 000 pages / second
Iwarp 10GE: ~300 000 pages /seconds

Note: Some limitation (IOPS , latency, bandwith) are due to the workstation limitation and we are currently doing scaling test on Production class servers.

Parallel quick sort over 2 GB dataset with memory constraint (C-group) Infiniband only:

Memory Ratio: 3:4 , Overhead: 2.08%
Memory Ratio: 1:2 , Overhead: 2.62%
Memory Ratio: 1:3, Overhead: 3.35%
Memory Ratio: 1:4, Overhead: 4.15%
Memory Ratio: 1:5, Overhead: 4.71%

We also did various performance tests with KVM and HANA DB, you can fine the result in the following slide deck

VI. Code:

The proof of concept has been developed within the Hecatonchire project and the source code is available at the following URL: http://www.hecatonchire.com .

VII. Conclusion

In summary Hecatonchire enable:

Better Performance:

Each function is isolated. We limit the scope of what each box must do
We can leverage dedicated hardware and software resulting in increased performance.

Superior Scalability:

Functions are isolated from each other. As a result we can alter one function without impacting the others.

Improved Economics:

Deliver a cost-effective deployment of resource. We have improved provisioning and consolidation of disparate equipment

Subscribe to: Posts ( Atom )