- Velocity 2016 : Slides and video of O'reilly Velocity 2016 conference.
- Infrastructure Investments by Cloud Service Providers : A fantastic work by Redmonk folks looking at the investment of the main Cloud service provider and what you can learn from this information. What is interesting is that it clearly show the lead that AWS and Microsoft has over all the other one. Also, it seems that SAP is not even on the radar which speak really loudly about their "cloud" strategy. It seems that they are desperately falling behind and might not be able to catch up..
- ConnectX5 : 100GBps interface are out there, but what is really interesting is the switch-less fabric capabilities. With the current race for higher number of port as well and NFV / SDN complexity explosion it seems interesting to see that we might see an emergence of switch-less infrastructure. Moving the switching and other network functionality in the NIC or server itself ( container / VM / unikernel ). Eliminating altogether the need for dedicated network hardware.
A blog about life, Engineering, Business, Research, and everything else (especially everything else)
Showing posts with label Mellanox. Show all posts
Showing posts with label Mellanox. Show all posts
Friday, June 24, 2016
[Links of the day] 24/06/2016: Velocity Conference, Cloud provider infrastructure investment, Connectx5
Labels:
cloud
,
conference
,
links of the day
,
Mellanox
,
velocity
Wednesday, May 04, 2016
[Links of the day] 04/05/2016 : Openserver Summit & Fortran OpenCoArray
- OpenCoArray : Fortran is not dead, and the work on the Co array with accelerator demonstrate it.
- Openserver summit :
- pcie 4.0 : Some really nice improvement with the upcoming standard in term of performance and especially RAS. However not mr-iov capability yet.. This is sorely missing to make PCIe a true contender on the rack scale fabric level.
- Azure SmartNIC : Microsoft use FPGA based smartnic to shorten the update cycle of their Azure cloud fabric. Its a really impressive solution.
- Persistent Memory over Fabrics : Mellanox pushing for RDMA based persistent memory solution. Probably trying to corner the market quickly as 3dXpoint and Omnipath solution from Intel are just around the corner. However what caught my attention is slide 14: HGST PCM Remote Access Demo. What is really interesting is that HGST is probably one step away from merging NVM and RDMA fabric onto a single package. With that they would be able to offer a direct competition with DSSD at lower cost ( following the Eth Drive model ).
Monday, August 17, 2015
Links of the day 17/08/2015 : #NVMe & #RDMA , #Strategy , Cryptography in hostile environment
- NVMe over RDMA fabric : interesting bit PMC Sierra and Mellanox unveiled NVMe ove RDMA fabric as well as peer direct technology for NVM storage. This open up a certain world of possibility where you could combine without CPU involvement GPU - NVM(e) - RDMA. Literally offloading all the storage operations.
- Strategy Scenario and the use of mapping : excellent series of posts by Simon Wardley showing how leveraging his mapping technique allow CEO - CIO to navigate the tortuous strategic decision. The Analysis of the scenario can be found here
- The network is hostile : TL;DR: we don't encrypt enough and early enough
Labels:
cryptography
,
links of the day
,
Mellanox
,
network
,
nvme
,
rdma
,
strategy
Thursday, November 13, 2014
Links of the day 13 - 11 - 2014
Today's links 13/11/2014: Mellanox ConnectX4, Immutable infrastructure and ARM server
- ConnectX4 : EDR 100Gb/s InfiniBand and 100Gb/s Ethernet, 150M messages/second, impressive numbers from Mellanox.
- Fugue: immutable infrastructure delivering Automating the creation and operations of cloud infrastructure, Short-lived and simplified compute instances
- Custom Cloud Arm Server : online lab design its own ARM based server for its cloud infrastructure.
Labels:
arm
,
infrastructure
,
links of the day
,
Mellanox
Tuesday, October 21, 2014
Links of the day 21 - 10 - 2014
Today's links 21/20/2014: all about #Linux #networking with a little bit of #HPC distributed #storage
- State of Linux network stack : what's new and interesting in the latest kernel release, especially the low-latency device polling
- KVM Forum : all videos of this year KVM forum . Some interesting talk especially on the HPC front and an interesting quote from Vincent Jardin: " if you want to have high performance networking or NVF solution don't use virtualization use container"
- RDMA and ARM : Mellanox bring its RoCE adapter to the moonshot project. Interesting to see what type of application would leverage such architecture combination: a lot of small processors with a fast fabric.
- IX : solution that isclose to achieve the holy grail of networking - Low latency with high throughput (line rate)
- (Fast Forward) Storage and I/O : Distributed Application Object Storage (DAOS) by Intel for HPC solution. A lot of flash , burst buffer with Lustre for supercomputer. Very interesting approach to address the challenge of future exascale computing platform.
Labels:
arm
,
HPC
,
intel
,
kernel
,
kvm
,
links of the day
,
linux
,
Mellanox
,
moonshot
,
networking
,
rdma
,
storage
Wednesday, February 19, 2014
On allowing shorter timeout on Mellanox cards and other tips and tricks
Modifying the minimum Timeout detection on Mellanox cards:
If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.
First the issue :
Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.The consequence:
You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.
The solution :
To fix the problem, you need to modify the firmware in the NICs as follow:
- Get from Mellanox the appropriate version of the firmware to start with.
- This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC:
flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini
Check/dev/mst
to verify the file name there. In this case the .ini file is named after theboard_id
printed byibv_devinfo
- Edit the .ini file to add a new
qp_minimal_timeout_val
parameter with a value of zero. It goes in the HCA section, like this:
[HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0018 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 qp_minimal_timeout_val = 0
- Generate a new image from the .mlx file and the .ini file: mlxburn -fw fw-ConnectX2-rel.mlx -conf MT_XXXXXX.ini -wrimage MT_XXXXX.bin
- Upload the image into the NIC:
flint -d /dev/mst/mtXXXXX_pci_cr0 -i MT_0DD0120009.bin -y b
- Beware: if different NICs have different board ids, they will need different .ini files, and they may need different .mlx files (ask Mellanox for help).
Bonus: Testing your IB setup speeds:
Checking Link Speeds
Run iblinkinfo as root. This will show link speeds of all ports in the network (both on switches and HCAs).
iblinkinfo |grep \"rc
50
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
51
1
[ ] "rc41 HCA-
1
" ( )
...
41
35
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
45
1
[ ] "rcnfs HCA-
1
" ( )
41
36
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
1
1
[ ] "rcmaster HCA-
1
" ( )
48
15
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
26
1
[ ] "rc02 HCA-
1
" ( )
48
16
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
6
1
[ ] "rc22 HCA-
1
" ( )
48
17
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
23
1
[ ] "rc30 HCA-
1
" ( )
48
18
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
30
1
[ ] "rc10 HCA-
1
" ( )
48
19
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
16
1
[ ] "rc28 HCA-
1
" ( )
48
20
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
40
1
[ ] "rc06 HCA-
1
" ( )
48
21
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
37
1
[ ] "rc18 HCA-
1
" ( )
48
22
[ ] ==( 4X
2.5
Gbps Active/ LinkUp)==>
36
1
[ ] "rc14 HCA-
1
" ( Could be
10.0
Gbps)
48
23
[ ] ==( 4X
10.0
Gbps Active/ LinkUp)==>
29
1
[ ] "rc16 HCA-
1
" ( )
...
Measuring Bandwidth
ib_send_lat will measure bandwidth between two hosts using the send/recv verbs. An example follows below.Src host:
ib_send_bw rc16ib
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
TX depth :
300
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x24
QPN
0x80049
PSN
0xe895fd
remote address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
939.34
939.34
------------------------------------------------------------------
Dst host:
ib_send_bw
------------------------------------------------------------------
Send BW Test
Number of qps :
1
Connection type : RC
RX depth :
600
CQ Moderation :
50
Link type : IB
Mtu :
2048
Inline data is used up to
0
bytes message
local address: LID
0x1d
QPN
0x200049
PSN
0xb960d2
remote address: LID
0x24
QPN
0x80049
PSN
0xe895fd
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
65536
1000
-nan
940.85
------------------------------------------------------------------
Measuring Latency
Use ib_send_lat or ibv_ud_pingpong as above. Note that the two apps may have different defaults for packet sizes, inlining, etc.
Labels:
fault tolerance
,
firmware
,
infiniband
,
latency
,
Mellanox
,
performance
,
rdma
,
timeout
Subscribe to:
Posts
(
Atom
)