Reflections Of The Void: fault tolerance

Showing posts with label fault tolerance. Show all posts

Wednesday, March 08, 2017

[Links of the Day] 08/03/2016 : Intel blockchain, Fast17 conference and papers, AWS cloud formation devops tool

After a small hiatus, here is the return of the links of the day.

Sawtooth Lake: Intel distributed ledger system. It uses an interesting security mechanism to deliver secure consensus. Sadly it relies on Intel proprietary hardware encryption modules to deliver this feature.
Fast17: File and Storage technology Usenix conference happened last month. There were a couple of interesting papers but one picked my interest: Redundancy Does Not Imply Fault Tolerance:Analysis of Distributed Storage Reactions toSingle Errors and Corruptions. The authors look at single file system fault impact on Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MongoDB, LogCabin, and CockroachDB. Turns out most systems are not able to handle these type of faults very well. It seems that a single node persistency layer error can have an adversarial ripple effect as distributed system seems to have put way to much trust in the reliability of this layer. Sadly they lack tools for recovering from errors or corruption emerging from file systems.
Stacker : remind 101 tools for creating and updating AWS formation stacks. Looks like an interesting alternative to terraform.

Tuesday, March 15, 2016

[Links of the day] 15/03/2016: AWS Cross region fault tolerance, word of wisdom from AWS's CTO, #NetflixEverywhere Global Architecture

Build Fault Tolerant Cross-Region AWS VPC : how Rackspace deploy fault tolerant solution on top of AWS multi region using VPC
10 Lessons from 10 Years of AWS : words of wisdom from Werner Vogel , CTO of AWS
#NetflixEverywhere Global Architecture : Qcon presentation from Netflix director of Operations Josh Evans, interesting bit is the focus on data replication cross data center ( or availability zones in this case). It seems pretty obvious that Netflix went the right way with dealing with scaling the resiliency of their product : start with the primitive then the data not the other way around. If the data is not available or consistent their is always a chance to fallback at a cost. While if the services are down, having the data available won't help.

Tuesday, March 01, 2016

[Links of the day] 01/03/2016 : DSSD , Datacenter design [book] and latent faults [paper]

DSSD : EMC released into the wild DSSD product (acquired last year). Quite a beast: all flash, 10 M IOPs, 100μs latency, 100GB/s BW, 144TB/5U. It use a PCIe fabric to connect the storage to the compute nodes, however I expect them to move soon to infiniband / omnipath fabric based on the talk they recently made.
Datacenter Design and Management: book that surveys datacenter research from a computer architect's perspective, addressing challenges in applications, design, management, server simulation, and system simulation.
Unsupervised Latent Faults Detection in Data Centers : talk and paper that look at automatically enable early detection and handling of performance problems, or latent faults. These faults "fly under the radar" of existing detection systems because they are not acute enough, or were not anticipated by maintenance engineers.

Rolex Deep Sea Sea Dweller (DSSD)

Thursday, September 10, 2015

Links of the day 10/09/2015: Scale + fault tolerance, and web is dead long live the web

IPFS : a lot like the webtorrent project or other, anyway interesting to check out. Even if some inherent characteristics makes it non practical for dynamic web. Think of Reddit, this would generate so many diff of the same page... There manifesto can be found here
Scalable High Performance Systems : nice presentation on addressing interesting new challenges which emerge in the operation of the datacentres that form the infrastructure of cloud services, and in supporting the dynamic workloads of demanding users.
FTS workshop : funny that its the the first workshop on fault tolerance while HPC system have a long history of addressing these issue..

Tuesday, December 09, 2014

Links of the day 9 - 12 - 2014

Today's links 9/12/2014: #quantum computing, #LLVM conf, #HPC fault tolerance

Future Of Quantum Computing : Vern Brownell (CEO, D-Wave) insights in quantum computing and where it is headed.
LLVM Developers' Meeting : video and slides of the 2014 conference
Fault-tolerant Techniques for HPC : Tutorial on Techniques and practical use of fault tolerance approach for large scale HPC systems.
Failures at Petascale : Lessons Learned From the Analysis of the peta scale Blue Waters HPC system

Wednesday, February 19, 2014

On allowing shorter timeout on Mellanox cards and other tips and tricks

Modifying the minimum Timeout detection on Mellanox cards:

If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.

First the issue :

Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.

The consequence:

You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.

The solution :

To fix the problem, you need to modify the firmware in the NICs as follow:

Get from Mellanox the appropriate version of the firmware to start with.
This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini

Check /dev/mst to verify the file name there. In this case the .ini file is named after the board_id printed by ibv_devinfo
Edit the .ini file to add a new qp_minimal_timeout_val parameter with a value of zero. It goes in the HCA section, like this:

[HCA]
hca_header_device_id = 0x673c
hca_header_subsystem_id = 0x0018
dpdp_en = true
eth_xfi_en = true
mdio_en_port1 = 0
qp_minimal_timeout_val = 0


Generate a new image from the .mlx file and the .ini file:     mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
Upload the image into the NIC:
    flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y  b
Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).




Now you can setup timeout as low as you want!! But beware of the consequence of unnecessary retries operations. You really need to profile your setup in order to find the best timeout resolution.


Bonus: Testing your IB setup speeds: 


Checking Link Speeds

Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).








 iblinkinfo |grep \"rc

          50   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      51    1[  ] "rc41 HCA-1" ( )

 ...

          41   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      45    1[  ] "rcnfs HCA-1" ( )

          41   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       1    1[  ] "rcmaster HCA-1" ( )

          48   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      26    1[  ] "rc02 HCA-1" ( )

          48   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       6    1[  ] "rc22 HCA-1" ( )

          48   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      23    1[  ] "rc30 HCA-1" ( )

          48   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      30    1[  ] "rc10 HCA-1" ( )

          48   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      16    1[  ] "rc28 HCA-1" ( )

          48   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      40    1[  ] "rc06 HCA-1" ( )

          48   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      37    1[  ] "rc18 HCA-1" ( )

          48   22[  ] ==( 4X  2.5 Gbps Active/  LinkUp)==>      36    1[  ] "rc14 HCA-1" ( Could be 10.0 Gbps)

          48   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      29    1[  ] "rc16 HCA-1" ( )

...







Measuring Bandwidth

ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.

Src host:






 ib_send_bw rc16ib

------------------------------------------------------------------

                    Send BW Test

 Number of qps   : 1

 Connection type : RC

 TX depth        : 300

 CQ Moderation   : 50

 Link type       : IB

 Mtu             : 2048

 Inline data is used up to 0 bytes message

 local address: LID 0x24 QPN 0x80049 PSN 0xe895fd

 remote address: LID 0x1d QPN 0x200049 PSN 0xb960d2

------------------------------------------------------------------

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]

 65536     1000           939.34             939.34

------------------------------------------------------------------







Dst host:






 ib_send_bw

------------------------------------------------------------------

                    Send BW Test

 Number of qps   : 1

 Connection type : RC

 RX depth        : 600

 CQ Moderation   : 50

 Link type       : IB

 Mtu             : 2048

 Inline data is used up to 0 bytes message

 local address: LID 0x1d QPN 0x200049 PSN 0xb960d2

 remote address: LID 0x24 QPN 0x80049 PSN 0xe895fd

------------------------------------------------------------------

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]

 65536     1000           -nan               940.85

------------------------------------------------------------------







Measuring Latency

Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.