Modifying the minimum Timeout detection on Mellanox cards:
If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.
First the issue :
Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.The consequence:
You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.
The solution :
To fix the problem, you need to modify the firmware in the NICs as follow:
- Get from Mellanox the appropriate version of the firmware to start with.
- This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini
Check/dev/mst
to verify the file name there. In this case the .ini file is named after theboard_id
printed byibv_devinfo
- Edit the .ini file to add a new
qp_minimal_timeout_val
parameter with a value of zero. It goes in the HCA section, like this:
[HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0018 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 qp_minimal_timeout_val = 0
- Generate a new image from the .mlx file and the .ini file: mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
- Upload the image into the NIC:
flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y b
- Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Bonus: Testing your IB setup speeds:
Checking Link Speeds
Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
iblinkinfo |grep \"rc
50
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
51
1
[ ] "rc41 HCA-
1
" ( )
...
41
35
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
45
1
[ ] "rcnfs HCA-
1
" ( )
41
36
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
1
1
[ ] "rcmaster HCA-
1
" ( )
48
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
26
1
[ ] "rc02 HCA-
1
" ( )
48
16
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
6
1
[ ] "rc22 HCA-
1
" ( )
48
17
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
23
1
[ ] "rc30 HCA-
1
" ( )
48
18
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
30
1
[ ] "rc10 HCA-
1
" ( )
48
19
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
16
1
[ ] "rc28 HCA-
1
" ( )
48
20
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
40
1
[ ] "rc06 HCA-
1
" ( )
48
21
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
37
1
[ ] "rc18 HCA-
1
" ( )
48
22
[ ] ==( 4X
2.5
Gbps Active/ LinkUp)==>
36
1
[ ] "rc14 HCA-
1
" ( Could be
10.0
Gbps)
48
23
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
29
1
[ ] "rc16 HCA-
1
" ( )
...
Measuring Bandwidth
ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.Src host:
ib_send_bw rc16ib
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
TX depth :
300
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x24
QPN
0x80049
PSN
0xe895fd
remote address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
939.34
939.34
------------------------------------------------------------------
Dst host:
ib_send_bw
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
RX depth :
600
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
remote address: LID
0x24
QPN
0x80049
PSN
0xe895fd
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
-nan
940.85
------------------------------------------------------------------
Measuring Latency
Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.
No comments :
Post a Comment