This page explains how to set up, diagnose, and benchmark InfiniBand networks.
InfiniBand (abbreviated IB) is an alternative to Ethernet and Fibre Channel. IB provides high bandwidth and low latency. IB can transfer data directly to and from a storage device on one machine to userspace on another machine, bypassing and avoiding the overhead of a system call. IB adapters can handle the networking protocols, unlike Ethernet networking protocols which are ran on the CPU. This allows the OS's and CPU's to remain free while the high bandwidth transfers take place, which can be a real problem with 10Gb+ Ethernet.
IB hardware is made by Mellanox (which merged with Voltaire, and is heavily backed by Oracle) and Intel (which acquired QLogic's IB division in 2012). IB is most often used by supercomputers, clusters, and data centers. IBM, HP, and Cray are also members of the InfiniBand Steering Committee. Facebook, Twitter, eBay, YouTube, and PayPal are examples of IB users.
IB software is developed under the OpenFabrics Open Source Alliance
Affordable used equipment
With large businesses benefiting so much from jumping to newer versions, the maximum length limitations of passive IB cabling, the high cost of active IB cabling, and the more technically complex setup than Ethernet, the used IB market is heavily saturated, allowing used IB devices to affordably be used at home or smaller businesses for their internal networks.
Signal transfer rates
IB transfer rates corresponded in the beginning to the maximum supported by PCI Express (abbreviated PCIe), later on, as PCIe made less progress, the transfer rates corresponded to other I/O technologies and the number of PCIe lanes per port was increased instead. It launched using SDR (Single Data Rate) with a signaling rate of 2.5Gb/s per lane (corresponding with PCI Express v1.0), and has added: DDR (Double Data Rate) at 5Gb/s (PCI Express v2.0); QDR (Quad Data Rate) at 10Gb/s (matching the throughput of PCI Express 3.0 with improved coding of PCIe 3.0 instead of the signaling rate); and FDR (Fourteen Data Rate) at 14.0625Gbps (matching 16GFC Fibre Channel). IB is now delivering EDR (Enhanced Data Rate) at 25Gb/s (matching 25Gb Ethernet). Planned around 2017 will be HDR (High Data Rate) at 50Gb/s.
Because SDR, DDR, and QDR versions use 8/10 encoding (8 bits of data takes 10 bits of signaling), effective throughput for these is lowered to 80%: SDR at 2Gb/s/link; DDR at 4Gb/s/link; and QDR at 8Gb/s/link. Starting with FDR, IB uses 64/66 encoding, allowing a higher effective throughput to signaling rate ratio of 96.97%: FDR at 13.64Gb/s/link; EDR at 24.24Gb/s/lane; and HDR at 48.48Gb/s/link.
IB devices are capable of sending data over multiple links, though commercial products standardized around 4 links per cable.
When using the common 4X link devices, this effectively allows total effective throughputs of: SDR of 8Gb/s; DDR of 16Gb/s; QDR of 32Gb/s; FDR of 54.54Gb/s; EDR of 96.97Gb/s; and HDR of 193.94Gb/s.
IB's latency is incredibly small: SDR (5us); DDR (2.5us); QDR (1.3us); FDR (0.7us); EDR (0.5us); and HDR (< 0.5us). For comparison 10Gb Ethernet is more like 7.22us, ten times more than FDR's latency.
IB devices are almost always backwards compatible. Connections should be established at the lowest common denominator. A DDR adapter meant for a PCI Express 8x slot should work in a PCI Express 4x slot (with half the bandwidth).
IB passive copper cables can be up to 7 meters using up to QDR, and 3 meters using FDR.
IB active fiber (optical) cables can be up to 300 meters using up to FDR (only 100 meters on FDR10).
Mellanox MetroX devices exist which allow up to 80 kilometer connections. Latency increases by about 5us per kilometer.
An IB cable can be used to directly link two computers without a switch; IB cross-over cables do not exist.
Adapters, switches, routers, and bridges/gateways must be specifically made for IB.
- HCA (Host Channel Adapter)
- Like an Ethernet NIC (Network Interface Card). Connects the IB cable to the PCI Express bus, at the full speed of the bus if the proper generation of HCA is used. An end node on an IB network, executes transport-level functions, and supports the IB verbs interface.
- Like an Ethernet NIC. Moves packets from one link to another on the same IB subnet.
- Like an Ethernet router. Moves packets between different IB subnets.
- A standalone piece of hardware, or a computer performing this function. Bridges IB and Ethernet networks.
Like Ethernet MAC addresses, but a device has multiple GUID's. Assigned by the hardware manufacturer, and remains the same through reboots. 64-bit addresses (24-bit manufacturer prefix and 40-bit device identifier). Given to adapters, switches, routers, and bridges/gateways.
- Node GUID
- Identifies the HCA, Switch, or Router
- Port GUID
- Identifies a port on a HCA, Switch, or Router (even a HCA often has multiple ports)
- System GUID
- Allows treating multiple GUIDs as one entity
- LID (Local IDentifier)
- 16-bit addresses, assigned by the Subnet Manager when picked up by the Subnet Manager. Used for routing packets. Not persistent through reboots.
- SM (Subnet Manager)
- Actively manages an IB subnet. Can be implemented as a software program on a computer connected to the IB network, built in to an IB switch, or as a specialized IB device. Initializes and configures everything else on the subnet, including assigning LIDs (Local IDentifiers). Establishes traffic paths through the subnet. Isolates faults. Prevents unauthorized Subnet Managers. You can have multiple switches all on one subnet, under one Subnet Manager. You can have redundant Subnet Managers on one subnet, but only one can be active at a time.
- MAD (MAnagement Datagram)
- Standard message format for subnet manager to and from IB device communication, carried by a UD (Unreliable Datagram).
- UD (Unreliable Datagram)
First install rdma-core which contains all core libraries and daemons.
Running the most recent firmware can give significant performance increases, and fix connectivity issues.
- Install mstflintAUR
- Determine your adapter's PCI device ID (in this example, "05:00.0" is the adapter's PCI device ID)
$ lspci | grep Mellanox
05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
- Determine what firmware version your adapter has, and your adapter's PSID (more specific than just a model number - specific to a compatible set of revisions)
# mstflint -d <adapter PCI device ID> query
... FW Version: 2.7.1000 ... PSID: MT_04A0110002
- Check latest firmware version
- Visit Mellanox's firmware download page (this guide incorporates this link's "firmware burning instructions", using its mstflint option)
- Choose the category of device you have
- Locate your device's PSID on their list, that mstflint gave you
- Examine the Firmware Image filename to see if it is more recent than your adapter's FW Version, i.e.
fw-25408-2_9_1000-MHGH28-XTC_A1.bin.zip, is version
- If there is a more recent version, download new firmware and burn it to your adapter
$ unzip <firmware .bin.zip file name> # mstflint -d <adapter PCI device ID> -i <firmware .bin file name> burn
Search for the model number (or a substring) over at Intel Download Center and follow the instructions. The downloaded software will probably need to be run from RHEL/CentOS or SUSE/OpenSUSE.
/etc/rdma/modules/infiniband.conf to your liking. Then load the kernel modules written in these files such as
ib_ipoib, or just reboot the system. (Although there should be no need to do this, start and enable both
firstname.lastname@example.org if kernel modules are not loaded correctly. Rebooting the system will be fine).
/etc/rdma/modules/rdma.confonly take effect max once every boot, when
rdma-load-modules@*.serviceis started for first time. Restarting
rdma-load-modules@*.servicehas no effect.
Each IB network requires at least one subnet manager. Without one, devices may show having a link, but will never change state from
Active. A subnet manager often (typically every 5 or 30 seconds) checks the network for new adapters and adds them to the routing tables. If you have an IB switch with an embedded subnet manager, you can use that, or you can keep it disabled and use a software subnet manager instead. Dedicated IB subnet manager devices also exist.
If the port is in the physical state
Sleep (can be verified with
ibstat) then it first needs to be enabled by running
ibportstate --Direct 0 1 enable for it to wake up. This may need to be automated at boot if the ports at both ends of the link are sleeping.
Software subnet manager
On one system:
- Install opensmAUR
- Correct the systemd file
/usr/lib/systemd/system/opensm.serviceas the following instruction.
- Start and enable
The current opensm's configuration for opensm is not compatible with RDMA's systemd configuration. That is, edit the 2 lines in
/usr/lib/systemd/system/opensm.service as following (Commented ones are original contents.).
Requiresemail@example.com # Requires=rdma.service Afterfirstname.lastname@example.org # After=rdma.service
All of your connected IB ports should now be in a (port) state of
Active, and a physical state of
LinkUp. You can check this by running ibstat.
... (look at the ports shown you expect to be connected) State: Active Physical state: LinkUp ...
Or by examining the
$ cat /sys/class/infiniband/kernel_module/ports/port_number/phys_state
$ cat /sys/class/infiniband/kernel_module/ports/port_number/state
You can create a virtual Ethernet Adapter that runs on the HCA. This is intended so programs designed to work with TCP/IP but not IB, can (indirectly) use IB networks. Performance is negatively affected due to sending all traffic through the normal TCP stack; requiring system calls, memory copies, and network protocols to run on the CPU rather than on the HCA.
IB interface will appear when the module
ib_ipoib is loaded. The simple configuration to make it appear is adding the line
/etc/rdma/modules/infiniband.conf then rebooting the system. After booting the system with the module
ib_ipoib, links with the name like
ibp16s0 should be confirmed with the command
Detailed configuration is possible for the IB interface (e.g. naming it
ib0 and assigning IP addresses like a traditional Ethernet adapter).
IPoIB can run in datagram (default) or connected mode. Connected mode allows you to set a higher MTU, but does increase TCP latency for short messages by about 5% more than datagram mode.
To see the current mode used:
$ cat /sys/class/net/interface/mode
In datagram mode, UD (Unreliable Datagram) transport is used, which typically forces the MTU to be 2044 bytes. Technically to the IB L2 MTU - 4 bytes for the IPoIB encapsulation header, which is usually 2044 bytes.
In connected mode, RC (Reliable Connected) transport is used, which allows a MTU up to the maximum IP packet size, 65520 bytes.
To see your MTU:
$ ip link show interface
Finetuning connection mode and MTU
You only need
ipoibmodemtu if you want to change the default connection mode and/or MTU.
- Install and set up TCP/IP over IB (IPoIB)
- Install ipoibmodemtuAUR
/etc/ipoibmodemtu.conf, which contains instructions on how to do so
- It defaults to setting a single IB port
connectedmode and MTU
- It defaults to setting a single IB port
- Start and enable
Different setups will see different results. Some people see a gigantic (double+) speed increase by using
connected mode and MTU
65520, and a few see about the same or even worse speeds. Use qperf and iperf to finetune your system.
Using the qperf examples given in this article, here are example results from an SDR network (8 theoretical Gb/s) with various finetuning:
Soft RoCE (RXE)
Soft ROCE is a software implementation of RoCE that allows using Infiniband over any ethernet adapter.
- Install ethtool
rxe_cfg startto load RXE modules and configure persistent instances.
rxe_cfg add ethNto configure an RXE instance on ethernet device ethN.
You should now have an rxe0 device:
# rxe_cfg status
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU enp1s0 yes virtio_net 1500 192.168.122.211 rxe0 1024 (3)
Remote data storage
You can share physical or virtual devices from a target (host/server) to an initiator (guest/client) system over an IB network, using iSCSI, iSCSI with iSER, or SRP. These methods differ from traditional file sharing (i.e. Samba or NFS) because the initiator system views the shared device as its own block level device, rather than a traditionally mounted network shared folder. i.e.
The disadvantage is only one system can use each shared device at a time; trying to mount a shared device on the target or another initiator system will fail (an initiator system can certainly run traditional file sharing on top).
The advantages are faster bandwidth, more control, and even having an initiator's root filesystem being physically located remotely (remote booting).
targetcli acts like a shell that presents its complex (and not worth creating by hand)
/etc/target/saveconfig.json as a pseudo-filesystem.
Installing and using
On the target system:
- Install targetcli-fbAUR
- Start and enable
- In any pseudo-directory, you can run
helpto see the commands available in that pseudo-directory or
help create) for more detailed help
- Tab-completion is also available for many commands
lsto see the entire pseudo-filesystem at and below the current pseudo-directory
Enter the configuration shell:
targetcli, setup a backstore for each device or virtual device to share:
- To share an actual block device, run:
cd /backstores/block; and
create name dev
- To share a file as a virtual block device, run:
cd /backstores/fileio; and
create name file
- To share a physical SCSI device as a pass-through, run:
cd /backstores/pscsi; and
create name dev
- To share a RAM disk, run:
cd /backstores/ramdisk; and
create name size
- Where name is for the backstore's name
- Where dev is the block device to share (i.e.
/dev/disk/by-id/XXX, or a LVM logical volume
- Where file is the file to share (i.e.
- Where size is the size of the RAM disk to create (i.e. 512MB, 20GB)
iSCSI allows storage devices and virtual storage devices to be used over a network. For IB networks, the storage can either work over IPoIB or iSER.
There is a lot of overlap with the iSCSI Target, iSCSI Initiator, and iSCSI Boot articles, but the necessities will be discussed since much needs to be customized for usage over IB.
Perform the target system instructions first, which will direct you when to temporarily switch over to the initiator system instructions.
- On the target and initiator systems, install TCP/IP over IB
- On the target system, for each device or virtual device you want to share, in
- Create a backstore
- For each backstore, create an IQN (iSCSI Qualified Name) (the name other systems' configurations will see the storage as)
cd /iscsi; and
create. It will give you a randomly_generated_target_name, i.e.
- Set up the TPG (Target Portal Group), automatically created in the last step as tpg1
- Create a lun (Logical Unit Number)
cd randomly_generated_target_name/tpg1/luns; and
create storage_object. Where
storage_objectis a full path to an existing storage object, i.e.
- Create an acl (Access Control List)
cd ../acls; and
create wwn, where
wwnis the initiator system's IQN (iSCSI Qualified Name), aka its (World Wide Name)
- Get the
wwnby running on the initiator system, not this target system: (after installing on it open-iscsi)
- Get the
- Create a lun (Logical Unit Number)
- Save and exit by running:
- On the initiator system:
- Install open-iscsi
- At this point, you can obtain this initiator system's IQN (iSCSI Qualified Name), aka its wwn (World Wide Name), for setting up the target system's
pacmanshould have displayed
>>> Setting Initiatorname wwn
- Otherwise, run:
cat /etc/iscsi/initiatorname.iscsito see
- Start and enable
- To automatically login to discovered targets at boot, before discovering targets, edit
node.startup = automatic
- Discover online targets. Run
iscsiadm -m discovery -t sendtargets -p portalas root, where portal is an IP (v4 or v6) address or hostname
- If using a hostname, make sure it routes to the IB IP address rather than Ethernet - it may be beneficial to just use the IB IP address
- To automatically login to discovered targets at boot, Start and enable
- To manually login to discovered targets, run
iscsiadm -m node -L allas root.
- View which block device ID was given to each target logged into. Run
iscsiadm -m session -P 3 | grep Attachedas root. The block device ID will be the last line in the tree for each target (
-Pis the print command, its option is the verbosity level, and only level 3 lists the block device IDs)
iSER (iSCSI Extensions for RDMA) takes advantage of IB's RDMA protocols, rather than using TCP/IP. It eliminates TCP/IP overhead, and provides higher bandwidth, zero copy time, lower latency, and lower CPU utilization.
Follow the iSCSI Over IPoIB instructions, with the following changes:
- If you wish, instead of installing IPoIB, you can just install RDMA for loading kernel modules
- On the target system, after everything else is setup, while still in
targetcli, enable iSER on the target:
cd /iscsi/iqn/tpg1/portals/0.0.0.0:3260for each iqn you want to have use iSER rather than IPoIB
- Where iqn is the randomly generated target name, i.e.
- Where iqn is the randomly generated target name, i.e.
- Save and exit by running:
- On the initiator system, when running
iscsiadmto discover online targets, use the additional argument
-I iser, and when you login to them, you should see:
Logging in to [iface: iser...
Adding to /etc/fstab
The last time you discovered targets, automatic login must have been turned on.
Add your mount entry to
/etc/fstab as if it were a local block device, except add a
_netdev option to avoid attempting to mount it before network initialization.
An IB subnet can be partitioned for different customers or applications, giving security and quality of service guarantees. Each partition is identified by a PKEY (Partition Key).
SDP (Sockets Direct Protocol)
librdmacm (successor to rsockets and libspd) and
LD_PRELOAD to intercept non-IB programs' socket calls, and transparently (to the program) send them over IB via RDMA. Dramatically speeding up programs built for TCP/IP, much more than can be achieved by using IPoIB. It avoids the need to change the program's source code to work with IB and can even be used for closed source programs. It does not work for programs that statically link in socket libraries.
Diagnosing and benchmarking
All IB specific tools are included in rdma-core and ibutilsAUR.
ibstat - View a computer's IB GUIDs
ibstat will show you detailed information about each IB adapter in the computer it is ran on, including: model number; number of ports; firmware and hardware version; node, system image, and port GUIDs; and port state, physical state, rate, base lid, lmc, SM lid, capability mask, and link layer.
CA 'mlx4_0' CA type: MT25418 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: a0 Node GUID: 0x0002c90300002f78 System image GUID: 0x0002c90300002f7b Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 3 LMC: 0 SM lid: 3 Capability mask: 0x0251086a Port GUID: 0x0002c90300002f79 Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300002f7a Link layer: InfiniBand
This example shows a Mellanox Technologies (MT) adapter. Its PCI Device ID is reported (25418), rather than the model number of part number. It shows a state of "Active", which means is it properly connected to a subnet manager. It shows a physical state of "LinkUp", which means it has an electrical connection via cable, but is not necessarily properly connected to a subnet manager. It shows a total rate of 20 Gb/s (which for this card is from a 5.0 Gb/s signaling rate and 4 virtual lanes). It shows the subnet manager assigned the port a lid of 3.
ibhosts - View all hosts on IB network
ibhosts will show you the Node GUIDs, number of ports, and device names, for each host on the IB network.
Ca : 0x0002c90300002778 ports 2 "MT25408 ConnectX Mellanox Technologies" Ca : 0x0002c90300002f78 ports 2 "hostname mlx4_0"
ibswitches - View all switches on IB network
ibswitches will show you the Node GUIDs, number of ports, and device names, for each switch on the IB network. If you are running with direct connections only, it will show nothing.
iblinkinfo - View link information on IB network
iblinkinfo will show you the device names, Port GUIDs, number of virtual lanes, signal transfer rates, state, physical state, and what it is connected to.
CA: MT25408 ConnectX Mellanox Technologies: 0x0002c90300002779 4 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 3 1[ ] "kvm mlx4_0" ( ) CA: hostname mlx4_0: 0x0002c90300002f79 3 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 4 1[ ] "MT25408 ConnectX Mellanox Technologies" ( )
This example shows two adapters directly connected with out a switch, using a 5.0 Gb/s signal transfer rate, and 4 virtual lanes (4X).
ibping - Ping another IB device
ibping will attempt pinging another IB GUID. ibping must be ran in server mode on one computer, and in client mode on another.
ibping must be ran in server mode on one computer.
# ibping -S
And in client mode on another. It is pinging a specific port, so it cannot take a CA name, or a Node or System GUID. It requires
-G with a Port GUID, or
-L with a Lid.
# ibping -G 0x0002c90300002779 -or- # ibping -L 1
Pong from hostname.(none) (Lid 1): time 0.053 ms Pong from hostname.(none) (Lid 1): time 0.074 ms ^C --- hostname.(none) (Lid 4) ibping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1630 ms rtt min/avg/max = 0.053/0.063/0.074 ms
If you are running IPoIB, you can use regular
ping which pings through the TCP/IP stack. ibping uses IB interfaces, and does not use the TCP/IP stack.
ibdiagnet - Show diagnostic information for entire subnet
ibdiagnet will show you potential problems on your subnet. You can run it without options.
-lw <1x|4x|12x> specifies the expected link width (number of virtual lanes) for your computer's adapter, so it can check if it is running as intended.
-ls <2.5|5|10> specifies the expected link speed (signaling rate) for your computer's adapter, so it can check if it is running as intended, but it does not yet support options faster than 10 for FDR+ devices.
-c <count> overrides the default number of packets to be sent of 10.
# ibdiagnet -lw 4x -ls 5 -c 1000
Loading IBDIAGNET from: /usr/lib/ibdiagnet1.5.7 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib/ibdm1.5.7 -I- Using port 1 as the local port. -I- Discovering ... 2 nodes (0 Switches & 2 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- General Device Info -I--------------------------------------------------- -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -I- No illegal PM counters values were found -I--------------------------------------------------- -I- Links With links width != 4x (as set by -lw option) -I--------------------------------------------------- -I- No unmatched Links (with width != 4x) were found -I--------------------------------------------------- -I- Links With links speed != 5 (as set by -ls option) -I--------------------------------------------------- -I- No unmatched Links (with speed != 5) were found -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:2 full:2 limited:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 General Devices Info Report 0 0 Performance Counters Report 0 0 Specific Link Width Check 0 0 Specific Link Speed Check 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 0 seconds.
qperf - Measure performance over RDMA or TCP/IP
qperf can measure bandwidth and latency over RDMA (SDP, UDP, UD, and UC) or TCP/IP (including IPoIB)
qperf must be ran in server mode on one computer.
And in client mode on another. SERVERNODE can be a hostname, or for IPoIB a TCP/IP address. There are many tests. Some of the most useful are below.
$ qperf SERVERNODE [OPTIONS] TESTS
TCP/IP over IPoIB
$ qperf 192.168.2.2 tcp_bw tcp_lat
tcp_bw: bw = 701 MB/sec tcp_lat: latency = 19.8 us
iperf - Measure performance over TCP/IP
iperf is not an IB aware program, and is meant to test over TCP/IP or UDP. Even though qperf can test your IB TCP/IP performace using IPoIB, iperf is still another program you can use.
iperf must be ran in server mode on one computer.
$ iperf3 -s
And in client mode on another.
$ iperf3 -c 192.168.2.2
[ 4] local 192.168.2.1 port 20139 connected to 192.168.2.2 port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 639 MBytes 5.36 Gbits/sec ... [ 4] 9.00-10.00 sec 638 MBytes 5.35 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 4] 0.00-10.00 sec 6.23 GBytes 5.35 Gbits/sec sender [ 4] 0.00-10.00 sec 6.23 GBytes 5.35 Gbits/sec receiver iperf Done.
iperf shows Transfer in base 10 GB's, and Bandwidth in base 2 GB's. So, this example shows 6.23GB (base 10) in 10 seconds. That is 6.69GB (base 2) in 10 seconds. (6.23 * 2^30 / 10^9) That's 5.35 Gb/s (base 2), as shown by iperf. (6.23 * 2^30 / 10^9 * 8 / 10) That is 685 MB/s (base 2), which is roughly the speed that qperf reported. (6.23 * 2^30 / 10^9 * 8 / 10 * 1024 / 8)
Common problems / FAQ
Link, physical state and port state
- See if the IB hardware modules are recognized by the system. If you have an Intel adapter, you will have to use Intel here and look through a few lines if you have other Intel hardware:
# dmesg | grep -Ei "Mellanox|InfiniBand|QLogic|Voltaire"
[ 6.287556] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) [ 8.686257] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014)
$ ls -l /sys/class/infiniband
mlx4_0 -> ../../devices/pci0000:00/0000:00:03.0/0000:05:00.0/infiniband/mlx4_0
If nothing is shown, your kernel is not recognizing your adapter. This example shows approximately what you will see if you have a Mellanox ConnectX adapter, which uses the mlx4_0 kernel module.
- Check the port and physical states. Either run ibstat or examine
(look at the port shown that you expect to be connected)
$ cat /sys/class/infiniband/<kernel module>/ports/<port number>/phys_state
$ cat /sys/class/infiniband/<kernel module>/ports/<port number>/state
The physical state should be "LinkUp". If it is not, your cable likely is not plugged in, is not connected to anything on the other end, or is defective. The (port) state should be "Active". If it is "Initializing" or "INIT", your subnet manager does not exist, is not running, or has not added the port to the network's routing tables.
- Can you successfully ibping which uses IB directly, rather than IPoIB? Can you successfully
ping, if you are running IPoIB?
- Consider upgrading firmware.
getaddrinfo failed: Name or service not known
- Run ibhosts to see the CA names at the end of each line in quotes.
- Start by double-checking your expectations.
How have you determined you have a speed problem? Are you using qperf or iperf, which both transmit data to and from memory rather than hard drives. Or, are you benchmarking actual file transfers, which relies on your hard drives? Unless you are running RAID to boost speed, even with the fastest SSD's available in mid 2015, a single hard drive (or sometimes even multiple ones) will be bottlenecking your IB transfer speeds. Are you using RDMA or TCP/IP via IPoIB? If so, there is a performance hit for using IPoIB instead of RDMA.
- Check your link speeds. Run ibstat, iblinkinfo, or examine
(look at the Rate shown on the port you are using)
(look at the middle part formatted like "4X 5.0 Gbps")
$ cat /sys/class/infiniband/<kernel module>/ports/<port number>/rate
20 Gb/sec (4X DDR)
Does this match your expected bandwidth and number of virtual lanes?
- Check diagnostic information for entire subnet. Run #ibdiagnet - Show diagnostic information for entire subnet. Make sure to use
-lswith the proper signaling rate, which is likely the advertised speed of your card divided by 4.
# ibdiagnet -lw <expected number of virtual lanes -ls <expected signaling rate> -c 1000
- Consider upgrading firmware.