Wednesday, February 19, 2014

On allowing shorter timeout on Mellanox cards and other tips and tricks


Modifying the minimum Timeout detection on Mellanox cards: 

If you want to leverage  the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem. 

First the issue : 

Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.

The consequence: 

You can have the nodes that maintained a connection with the faulty server  running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice. 

The solution : 

To fix the problem, you need to modify  the firmware in the NICs as follow: 
  • Get from Mellanox the appropriate version of the firmware to start with. 
  • This file needs to be combined with an appropriate .ini file.  First, fetch the existing .ini file from the NIC:
        flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini
    Check /dev/mst to verify the file name there.  In this case the .ini file is named after the board_id printed by ibv_devinfo 
  • Edit the .ini file to add a new qp_minimal_timeout_val parameter with a value of zero. It goes in the HCA section, like this:
[HCA]
hca_header_device_id = 0x673c
hca_header_subsystem_id = 0x0018
dpdp_en = true
eth_xfi_en = true
mdio_en_port1 = 0
qp_minimal_timeout_val = 0

  • Generate a new image from the .mlx file and the .ini file:     mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
  • Upload the image into the NIC:     flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y  b
  • Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Now you can setup timeout as low as you want!! But beware of the consequence of unnecessary retries operations. You really need to profile your setup in order to find the best timeout resolution.


Bonus: Testing your IB setup speeds:

Checking Link Speeds

Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
iblinkinfo |grep \"rc
          50   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      51    1[  ] "rc41 HCA-1" ( )
 ...
          41   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      45    1[  ] "rcnfs HCA-1" ( )
          41   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       1    1[  ] "rcmaster HCA-1" ( )
          48   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      26    1[  ] "rc02 HCA-1" ( )
          48   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       6    1[  ] "rc22 HCA-1" ( )
          48   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      23    1[  ] "rc30 HCA-1" ( )
          48   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      30    1[  ] "rc10 HCA-1" ( )
          48   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      16    1[  ] "rc28 HCA-1" ( )
          48   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      40    1[  ] "rc06 HCA-1" ( )
          48   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      37    1[  ] "rc18 HCA-1" ( )
          48   22[  ] ==( 4X  2.5 Gbps Active/  LinkUp)==>      36    1[  ] "rc14 HCA-1" ( Could be 10.0 Gbps)
          48   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      29    1[  ] "rc16 HCA-1" ( )
...

Measuring Bandwidth

ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.
Src host:
ib_send_bw rc16ib
------------------------------------------------------------------
                    Send BW Test
 Number of qps   : 1
 Connection type : RC
 TX depth        : 300
 CQ Moderation   : 50
 Link type       : IB
 Mtu             : 2048
 Inline data is used up to 0 bytes message
 local address: LID 0x24 QPN 0x80049 PSN 0xe895fd
 remote address: LID 0x1d QPN 0x200049 PSN 0xb960d2
------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 65536     1000           939.34             939.34
------------------------------------------------------------------
Dst host:
ib_send_bw
------------------------------------------------------------------
                    Send BW Test
 Number of qps   : 1
 Connection type : RC
 RX depth        : 600
 CQ Moderation   : 50
 Link type       : IB
 Mtu             : 2048
 Inline data is used up to 0 bytes message
 local address: LID 0x1d QPN 0x200049 PSN 0xb960d2
 remote address: LID 0x24 QPN 0x80049 PSN 0xe895fd
------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 65536     1000           -nan               940.85
------------------------------------------------------------------

Measuring Latency

Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.