Modifying the minimum Timeout detection on Mellanox cards:
If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.
First the issue :Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.
You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.
The solution :
To fix the problem, you need to modify the firmware in the NICs as follow:
- Get from Mellanox the appropriate version of the firmware to start with.
- This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.iniCheck
/dev/mstto verify the file name there. In this case the .ini file is named after the
- Edit the .ini file to add a new
qp_minimal_timeout_valparameter with a value of zero. It goes in the HCA section, like this:
[HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0018 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 qp_minimal_timeout_val = 0
- Generate a new image from the .mlx file and the .ini file: mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
- Upload the image into the NIC:
flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y b
- Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Bonus: Testing your IB setup speeds:
Checking Link SpeedsRun iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
Measuring Bandwidthib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.Src host:Dst host:
Measuring LatencyUse ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.