Modifying the minimum Timeout detection on Mellanox cards:
If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.
First the issue :
Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.The consequence:
You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.
The solution :
To fix the problem, you need to modify the firmware in the NICs as follow:
- Get from Mellanox the appropriate version of the firmware to start with.
- This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.iniCheck/dev/mstto verify the file name there. In this case the .ini file is named after theboard_idprinted byibv_devinfo - Edit the .ini file to add a new
qp_minimal_timeout_valparameter with a value of zero. It goes in the HCA section, like this:
[HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0018 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 qp_minimal_timeout_val = 0
- Generate a new image from the .mlx file and the .ini file: mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
- Upload the image into the NIC:
flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y b - Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Bonus: Testing your IB setup speeds:
Checking Link Speeds
Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
iblinkinfo |grep \"rc5015[ ] ==( 4X10.0Gbps Active/ LinkUp)==>511[ ] "rc41 HCA-1" ( )...4135[ ] ==( 4X10.0Gbps Active/ LinkUp)==>451[ ] "rcnfs HCA-1" ( )4136[ ] ==( 4X10.0Gbps Active/ LinkUp)==>11[ ] "rcmaster HCA-1" ( )4815[ ] ==( 4X10.0Gbps Active/ LinkUp)==>261[ ] "rc02 HCA-1" ( )4816[ ] ==( 4X10.0Gbps Active/ LinkUp)==>61[ ] "rc22 HCA-1" ( )4817[ ] ==( 4X10.0Gbps Active/ LinkUp)==>231[ ] "rc30 HCA-1" ( )4818[ ] ==( 4X10.0Gbps Active/ LinkUp)==>301[ ] "rc10 HCA-1" ( )4819[ ] ==( 4X10.0Gbps Active/ LinkUp)==>161[ ] "rc28 HCA-1" ( )4820[ ] ==( 4X10.0Gbps Active/ LinkUp)==>401[ ] "rc06 HCA-1" ( )4821[ ] ==( 4X10.0Gbps Active/ LinkUp)==>371[ ] "rc18 HCA-1" ( )4822[ ] ==( 4X2.5Gbps Active/ LinkUp)==>361[ ] "rc14 HCA-1" ( Could be10.0Gbps)4823[ ] ==( 4X10.0Gbps Active/ LinkUp)==>291[ ] "rc16 HCA-1" ( )...Measuring Bandwidth
ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.Src host:
ib_send_bw rc16ib------------------------------------------------------------------Send BW TestNumber of qps :1Connection type : RCTX depth :300CQ Moderation :50Link type : IBMtu :2048Inline data is used up to0bytes messagelocal address: LID0x24QPN0x80049PSN0xe895fdremote address: LID0x1dQPN0x200049PSN0xb960d2------------------------------------------------------------------#bytes #iterations BW peak[MB/sec] BW average[MB/sec]655361000939.34939.34------------------------------------------------------------------Dst host:
ib_send_bw------------------------------------------------------------------Send BW TestNumber of qps :1Connection type : RCRX depth :600CQ Moderation :50Link type : IBMtu :2048Inline data is used up to0bytes messagelocal address: LID0x1dQPN0x200049PSN0xb960d2remote address: LID0x24QPN0x80049PSN0xe895fd------------------------------------------------------------------#bytes #iterations BW peak[MB/sec] BW average[MB/sec]655361000-nan940.85------------------------------------------------------------------Measuring Latency
Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.
No comments :
Post a Comment