Reflections Of The Void: Mellanox

Showing posts with label Mellanox. Show all posts

Friday, June 24, 2016

[Links of the day] 24/06/2016: Velocity Conference, Cloud provider infrastructure investment, Connectx5

Velocity 2016 : Slides and video of O'reilly Velocity 2016 conference.
Infrastructure Investments by Cloud Service Providers : A fantastic work by Redmonk folks looking at the investment of the main Cloud service provider and what you can learn from this information. What is interesting is that it clearly show the lead that AWS and Microsoft has over all the other one. Also, it seems that SAP is not even on the radar which speak really loudly about their "cloud" strategy. It seems that they are desperately falling behind and might not be able to catch up..
ConnectX5 : 100GBps interface are out there, but what is really interesting is the switch-less fabric capabilities. With the current race for higher number of port as well and NFV / SDN complexity explosion it seems interesting to see that we might see an emergence of switch-less infrastructure. Moving the switching and other network functionality in the NIC or server itself ( container / VM / unikernel ). Eliminating altogether the need for dedicated network hardware.

Wednesday, May 04, 2016

[Links of the day] 04/05/2016 : Openserver Summit & Fortran OpenCoArray

OpenCoArray : Fortran is not dead, and the work on the Co array with accelerator demonstrate it.
Openserver summit :

pcie 4.0 : Some really nice improvement with the upcoming standard in term of performance and especially RAS. However not mr-iov capability yet.. This is sorely missing to make PCIe a true contender on the rack scale fabric level.
Azure SmartNIC : Microsoft use FPGA based smartnic to shorten the update cycle of their Azure cloud fabric. Its a really impressive solution.
Persistent Memory over Fabrics : Mellanox pushing for RDMA based persistent memory solution. Probably trying to corner the market quickly as 3dXpoint and Omnipath solution from Intel are just around the corner. However what caught my attention is slide 14: HGST PCM Remote Access Demo. What is really interesting is that HGST is probably one step away from merging NVM and RDMA fabric onto a single package. With that they would be able to offer a direct competition with DSSD at lower cost ( following the Eth Drive model ).

Monday, August 17, 2015

Links of the day 17/08/2015 : #NVMe & #RDMA , #Strategy , Cryptography in hostile environment

NVMe over RDMA fabric : interesting bit PMC Sierra and Mellanox unveiled NVMe ove RDMA fabric as well as peer direct technology for NVM storage. This open up a certain world of possibility where you could combine without CPU involvement GPU - NVM(e) - RDMA. Literally offloading all the storage operations.
Strategy Scenario and the use of mapping : excellent series of posts by Simon Wardley showing how leveraging his mapping technique allow CEO - CIO to navigate the tortuous strategic decision. The Analysis of the scenario can be found here
The network is hostile : TL;DR: we don't encrypt enough and early enough

Thursday, November 13, 2014

Links of the day 13 - 11 - 2014

Today's links 13/11/2014: Mellanox ConnectX4, Immutable infrastructure and ARM server

ConnectX4 : EDR 100Gb/s InfiniBand and 100Gb/s Ethernet, 150M messages/second, impressive numbers from Mellanox.
Fugue: immutable infrastructure delivering Automating the creation and operations of cloud infrastructure, Short-lived and simplified compute instances
Custom Cloud Arm Server : online lab design its own ARM based server for its cloud infrastructure.

Tuesday, October 21, 2014

Links of the day 21 - 10 - 2014

Today's links 21/20/2014: all about #Linux #networking with a little bit of #HPC distributed #storage

State of Linux network stack : what's new and interesting in the latest kernel release, especially the low-latency device polling
KVM Forum : all videos of this year KVM forum . Some interesting talk especially on the HPC front and an interesting quote from Vincent Jardin: " if you want to have high performance networking or NVF solution don't use virtualization use container"
RDMA and ARM : Mellanox bring its RoCE adapter to the moonshot project. Interesting to see what type of application would leverage such architecture combination: a lot of small processors with a fast fabric.
IX : solution that isclose to achieve the holy grail of networking - Low latency with high throughput (line rate)
(Fast Forward) Storage and I/O : Distributed Application Object Storage (DAOS) by Intel for HPC solution. A lot of flash , burst buffer with Lustre for supercomputer. Very interesting approach to address the challenge of future exascale computing platform.

Wednesday, February 19, 2014

On allowing shorter timeout on Mellanox cards and other tips and tricks

Modifying the minimum Timeout detection on Mellanox cards:

If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.

First the issue :

Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.

The consequence:

You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.

The solution :

To fix the problem, you need to modify the firmware in the NICs as follow:

Get from Mellanox the appropriate version of the firmware to start with.
This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini

Check /dev/mst to verify the file name there. In this case the .ini file is named after the board_id printed by ibv_devinfo
Edit the .ini file to add a new qp_minimal_timeout_val parameter with a value of zero. It goes in the HCA section, like this:

[HCA]
hca_header_device_id = 0x673c
hca_header_subsystem_id = 0x0018
dpdp_en = true
eth_xfi_en = true
mdio_en_port1 = 0
qp_minimal_timeout_val = 0


Generate a new image from the .mlx file and the .ini file:     mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
Upload the image into the NIC:
    flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y  b
Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).




Now you can setup timeout as low as you want!! But beware of the consequence of unnecessary retries operations. You really need to profile your setup in order to find the best timeout resolution.


Bonus: Testing your IB setup speeds: 


Checking Link Speeds

Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).








 iblinkinfo |grep \"rc

          50   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      51    1[  ] "rc41 HCA-1" ( )

 ...

          41   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      45    1[  ] "rcnfs HCA-1" ( )

          41   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       1    1[  ] "rcmaster HCA-1" ( )

          48   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      26    1[  ] "rc02 HCA-1" ( )

          48   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       6    1[  ] "rc22 HCA-1" ( )

          48   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      23    1[  ] "rc30 HCA-1" ( )

          48   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      30    1[  ] "rc10 HCA-1" ( )

          48   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      16    1[  ] "rc28 HCA-1" ( )

          48   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      40    1[  ] "rc06 HCA-1" ( )

          48   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      37    1[  ] "rc18 HCA-1" ( )

          48   22[  ] ==( 4X  2.5 Gbps Active/  LinkUp)==>      36    1[  ] "rc14 HCA-1" ( Could be 10.0 Gbps)

          48   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      29    1[  ] "rc16 HCA-1" ( )

...







Measuring Bandwidth

ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.

Src host:






 ib_send_bw rc16ib

------------------------------------------------------------------

                    Send BW Test

 Number of qps   : 1

 Connection type : RC

 TX depth        : 300

 CQ Moderation   : 50

 Link type       : IB

 Mtu             : 2048

 Inline data is used up to 0 bytes message

 local address: LID 0x24 QPN 0x80049 PSN 0xe895fd

 remote address: LID 0x1d QPN 0x200049 PSN 0xb960d2

------------------------------------------------------------------

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]

 65536     1000           939.34             939.34

------------------------------------------------------------------







Dst host:






 ib_send_bw

------------------------------------------------------------------

                    Send BW Test

 Number of qps   : 1

 Connection type : RC

 RX depth        : 600

 CQ Moderation   : 50

 Link type       : IB

 Mtu             : 2048

 Inline data is used up to 0 bytes message

 local address: LID 0x1d QPN 0x200049 PSN 0xb960d2

 remote address: LID 0x24 QPN 0x80049 PSN 0xe895fd

------------------------------------------------------------------

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]

 65536     1000           -nan               940.85

------------------------------------------------------------------







Measuring Latency

Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.

Subscribe to: Posts ( Atom )

Reflections Of The Void