- Sawtooth Lake: Intel distributed ledger system. It uses an interesting security mechanism to deliver secure consensus. Sadly it relies on Intel proprietary hardware encryption modules to deliver this feature.
- Fast17: File and Storage technology Usenix conference happened last month. There were a couple of interesting papers but one picked my interest: Redundancy Does Not Imply Fault Tolerance:Analysis of Distributed Storage Reactions toSingle Errors and Corruptions. The authors look at single file system fault impact on Redis, ZooKeeper, Cassandra, Kafka, RethinkDB, MongoDB, LogCabin, and CockroachDB. Turns out most systems are not able to handle these type of faults very well. It seems that a single node persistency layer error can have an adversarial ripple effect as distributed system seems to have put way to much trust in the reliability of this layer. Sadly they lack tools for recovering from errors or corruption emerging from file systems.
- Stacker : remind 101 tools for creating and updating AWS formation stacks. Looks like an interesting alternative to terraform.
A blog about life, Engineering, Business, Research, and everything else (especially everything else)
Showing posts with label fault tolerance. Show all posts
Showing posts with label fault tolerance. Show all posts
Wednesday, March 08, 2017
[Links of the Day] 08/03/2016 : Intel blockchain, Fast17 conference and papers, AWS cloud formation devops tool
After a small hiatus, here is the return of the links of the day.
Labels:
automation
,
aws
,
blockchain
,
conference
,
devops
,
fault tolerance
,
intel
,
links of the day
,
storage
Tuesday, March 15, 2016
[Links of the day] 15/03/2016: AWS Cross region fault tolerance, word of wisdom from AWS's CTO, #NetflixEverywhere Global Architecture
- Build Fault Tolerant Cross-Region AWS VPC : how Rackspace deploy fault tolerant solution on top of AWS multi region using VPC
- 10 Lessons from 10 Years of AWS : words of wisdom from Werner Vogel , CTO of AWS
- #NetflixEverywhere Global Architecture : Qcon presentation from Netflix director of Operations Josh Evans, interesting bit is the focus on data replication cross data center ( or availability zones in this case). It seems pretty obvious that Netflix went the right way with dealing with scaling the resiliency of their product : start with the primitive then the data not the other way around. If the data is not available or consistent their is always a chance to fallback at a cost. While if the services are down, having the data available won't help.
Labels:
architecture
,
aws
,
fault tolerance
,
links of the day
,
netflix
Tuesday, March 01, 2016
[Links of the day] 01/03/2016 : DSSD , Datacenter design [book] and latent faults [paper]
- DSSD : EMC released into the wild DSSD product (acquired last year). Quite a beast: all flash, 10 M IOPs, 100μs latency, 100GB/s BW, 144TB/5U. It use a PCIe fabric to connect the storage to the compute nodes, however I expect them to move soon to infiniband / omnipath fabric based on the talk they recently made.
- Datacenter Design and Management: book that surveys datacenter research from a computer architect's perspective, addressing challenges in applications, design, management, server simulation, and system simulation.
- Unsupervised Latent Faults Detection in Data Centers : talk and paper that look at automatically enable early detection and handling of performance problems, or latent faults. These faults "fly under the radar" of existing detection systems because they are not acute enough, or were not anticipated by maintenance engineers.
![]() |
Rolex Deep Sea Sea Dweller (DSSD) |
Labels:
book
,
datacenter
,
design
,
fault tolerance
,
flash
,
links of the day
,
network fabric
,
nvme
,
paper
,
pcie
Thursday, September 10, 2015
Links of the day 10/09/2015: Scale + fault tolerance, and web is dead long live the web
- IPFS : a lot like the webtorrent project or other, anyway interesting to check out. Even if some inherent characteristics makes it non practical for dynamic web. Think of Reddit, this would generate so many diff of the same page... There manifesto can be found here
- Scalable High Performance Systems : nice presentation on addressing interesting new challenges which emerge in the operation of the datacentres that form the infrastructure of cloud services, and in supporting the dynamic workloads of demanding users.
- FTS workshop : funny that its the the first workshop on fault tolerance while HPC system have a long history of addressing these issue..
Labels:
distributed
,
fault tolerance
,
HPC
,
links of the day
,
web
Tuesday, December 09, 2014
Links of the day 9 - 12 - 2014
Today's links 9/12/2014: #quantum computing, #LLVM conf, #HPC fault tolerance
- Future Of Quantum Computing : Vern Brownell (CEO, D-Wave) insights in quantum computing and where it is headed.
- LLVM Developers' Meeting : video and slides of the 2014 conference
- Fault-tolerant Techniques for HPC : Tutorial on Techniques and practical use of fault tolerance approach for large scale HPC systems.
- Failures at Petascale : Lessons Learned From the Analysis of the peta scale Blue Waters HPC system
Labels:
fault tolerance
,
HPC
,
links of the day
,
llvm
,
quantum
Wednesday, February 19, 2014
On allowing shorter timeout on Mellanox cards and other tips and tricks
Modifying the minimum Timeout detection on Mellanox cards:
If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.
First the issue :
Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.The consequence:
You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.
The solution :
To fix the problem, you need to modify the firmware in the NICs as follow:
- Get from Mellanox the appropriate version of the firmware to start with.
- This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini
Check/dev/mst
to verify the file name there. In this case the .ini file is named after theboard_id
printed byibv_devinfo
- Edit the .ini file to add a new
qp_minimal_timeout_val
parameter with a value of zero. It goes in the HCA section, like this:
[HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0018 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 qp_minimal_timeout_val = 0
- Generate a new image from the .mlx file and the .ini file: mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
- Upload the image into the NIC:
flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y b
- Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Bonus: Testing your IB setup speeds:
Checking Link Speeds
Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
iblinkinfo |grep \"rc
50
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
51
1
[ ] "rc41 HCA-
1
" ( )
...
41
35
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
45
1
[ ] "rcnfs HCA-
1
" ( )
41
36
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
1
1
[ ] "rcmaster HCA-
1
" ( )
48
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
26
1
[ ] "rc02 HCA-
1
" ( )
48
16
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
6
1
[ ] "rc22 HCA-
1
" ( )
48
17
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
23
1
[ ] "rc30 HCA-
1
" ( )
48
18
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
30
1
[ ] "rc10 HCA-
1
" ( )
48
19
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
16
1
[ ] "rc28 HCA-
1
" ( )
48
20
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
40
1
[ ] "rc06 HCA-
1
" ( )
48
21
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
37
1
[ ] "rc18 HCA-
1
" ( )
48
22
[ ] ==( 4X
2.5
Gbps Active/ LinkUp)==>
36
1
[ ] "rc14 HCA-
1
" ( Could be
10.0
Gbps)
48
23
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
29
1
[ ] "rc16 HCA-
1
" ( )
...
Measuring Bandwidth
ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.Src host:
ib_send_bw rc16ib
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
TX depth :
300
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x24
QPN
0x80049
PSN
0xe895fd
remote address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
939.34
939.34
------------------------------------------------------------------
Dst host:
ib_send_bw
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
RX depth :
600
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
remote address: LID
0x24
QPN
0x80049
PSN
0xe895fd
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
-nan
940.85
------------------------------------------------------------------
Measuring Latency
Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.
Labels:
fault tolerance
,
firmware
,
infiniband
,
latency
,
Mellanox
,
performance
,
rdma
,
timeout
Subscribe to:
Posts
(
Atom
)