Thursday, August 25, 2011

Network interface bonding and trunking VLANs with different MTUs on RedHat Linux

Network interface bonding and trunking VLANs with different MTUs on RedHat Linux

(Last updated on aug 30 2011)




Configuration of Red Hat Enterprise Linux Server 6.1 (Hardware: HP DL380G7, 2 CPU Intel core) connected with two 10Gbit/sec ethernet interfaces (HP NC522SFP Dual Port 10GbE Server Adapter Board), with interface bonding, high availability, bandwidth aggregation and multiple VLAN transport.

Two 10Gbit/s physical network interfaces have to be connected to two FEX (Nexus Fabric EXtension "switches") upstreaming to two different Nexus 7000 cores, in a high availability configuration, allowing also to use both links simultaneously, and to transport different VLANs.
The different VLANs have then to be mapped on different linux logical subinterfaces.

The two physical interface are joined to form a bond, called bond0.
A specific channel configuration exists on the cisco nexus switches, so to allow these ports to form a single trunk channel, and to transport on it tagged frames from the selected VLANs, complying to the 802.1q transport.

Here are the settings on the Cisco Nexus Side (only one "side" is shown, the other is symmetric):

Core to FEX2232

  interface port-channel131
    switchport
    switchport mode fex-fabric
    fex associate 131
    mtu 9216



  FEX to server

  interface Ethernet131/1/5
    description srv
    switchport
    switchport mode trunk
    switchport trunk allowed vlan 10-13
    flowcontrol send off
    channel-group 88 mode active
    no shutdown





On the RedHat linux side, my two physical interfaces are eth0 and eth2. I check with ethtool that on both I can see the link, and that the speed and duplex settings are ok (autonegotiation has been disabled both on the switch side and on the server).

The two physical interfaces are joined in bond0

Here are the configuration files of the physical and logical interfaces, in /etc/sysconfig/network-scripts
bond0.10 and bond0.11 are the two logical interfaces sitting on VLANs 10 and 11
bond0.10 has MTU 1500
bond0.11 has MTU 9000 so to allow more efficient traffic of large blocks of data on the storage VLAN

-------
[root@srv network-scripts]# cat ifcfg-eth0
DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
HWADDR=78:e3:b5:f4:e0:50
TYPE=Ethernet
IPV6INIT=no

[root@srv network-scripts]# cat ifcfg-eth2
DEVICE=eth2
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
HWADDR=78:E3:B5:F4:76:E0
TYPE=Ethernet
IPV6INIT=no


[root@srv network-scripts]# cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
TYPE=Ethernet
BONDING_OPTS="mode=4 miimon=100"
IPV6INIT=no
USERCTL=no
MTU=9000


[root@srv network-scripts]# cat ifcfg-bond0.10
IPADDR=192.168.145.248
NETMASK=255.255.254.0
DEVICE=bond0.10
ONBOOT=yes
VLAN=yes
TYPE=Ethernet
HWADDR=78:e3:b5:f4:e0:50
BOOTPROTO=none
GATEWAY=192.168.144.1
IPV6INIT=no
USERCTL=no
MTU=1500

[root@srv network-scripts]# cat ifcfg-bond0.11
IPADDR=192.168.148.85
NETMASK=255.255.255.0
DEVICE=bond0.11
ONBOOT=yes
VLAN=yes
TYPE=Ethernet
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
MTU=9000




[root@srv network-scripts]# ifconfig eth0 && ifconfig eth2 && ifconfig bond0 && ifconfig bond0.10 && ifconfig bond0.11
eth0      Link encap:Ethernet  HWaddr 78:E3:B5:F4:E0:50
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:30750 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12992 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3590043 (3.4 MiB)  TX bytes:3577302 (3.4 MiB)
          Interrupt:53

eth2      Link encap:Ethernet  HWaddr 78:E3:B5:F4:E0:50
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:36472 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7248 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:27483898 (26.2 MiB)  TX bytes:906669 (885.4 KiB)
          Interrupt:61

bond0     Link encap:Ethernet  HWaddr 78:E3:B5:F4:E0:50
          inet6 addr: fe80::7ae3:b5ff:fef4:e050/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:67224 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20242 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:31074081 (29.6 MiB)  TX bytes:4484967 (4.2 MiB)

bond0.10  Link encap:Ethernet  HWaddr 78:E3:B5:F4:E0:50
          inet addr:192.168.145.248  Bcast:192.168.145.255  Mask:255.255.254.0
          inet6 addr: fe80::7ae3:b5ff:fef4:e050/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:51880 errors:0 dropped:0 overruns:0 frame:0
          TX packets:13135 errors:0 dropped:10 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:8963296 (8.5 MiB)  TX bytes:3972360 (3.7 MiB)

bond0.11  Link encap:Ethernet  HWaddr 78:E3:B5:F4:E0:50
          inet addr:192.168.148.85  Bcast:192.168.148.255  Mask:255.255.255.0
          inet6 addr: fe80::7ae3:b5ff:fef4:e050/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:14599 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6891 errors:0 dropped:7 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20770093 (19.8 MiB)  TX bytes:486535 (475.1 KiB)



[root@srv network-scripts]# ethtool eth0
Settings for eth0:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000005 (5)
        Link detected: yes

[root@srv network-scripts]# ethtool eth2
Settings for eth2:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000005 (5)
        Link detected: yes


-------


With this configurations, the server keeps correctly the configuration at reboot, and all the interfaces have the correct MTU sizes.

The definition of the MTU is only needed in the bond0 and in the proper subinterface. The physical interfaces do not have any MTU settings attached.

The behaviour of ethtool command appeared consistent. mii-tool was not. 
ethtool is the command to be used.

Performance metering:
With this setup, I was able to squeeze an amazing end-to-end tcp speed of 9.8Gbit/sec


here is the nuttcp test output.
[root@srv network-scripts]# nuttcp -t -v -T 60 192.168.148.85
  nuttcp-t: v6.1.2: socket
  nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.148.85
  nuttcp-t: time limit = 60.00 seconds
  nuttcp-t: connect to 192.168.148.85 with mss=8948, RTT=0.277 ms
  nuttcp-t: send window size = 28440, receive window size = 87380
  nuttcp-t: available send window = 21330, available receive window = 65535
  nuttcp-t: 70086.0000 MB in 60.00 real seconds = 1196131.87 KB/sec = 9798.7123 Mbps
  nuttcp-t: retrans = 0
  nuttcp-t: 1121376 I/O calls, msec/call = 0.05, calls/sec = 18689.56
  nuttcp-t: 0.2user 29.7sys 1:00real 49% 0i+0d 468maxrss 0+2pf 25844+75csw
 
  nuttcp-r: v6.1.2: socket
  nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
  nuttcp-r: accept from 192.168.148.86
  nuttcp-r: send window size = 28440, receive window size = 87380
  nuttcp-r: available send window = 21330, available receive window = 65535
  nuttcp-r: 70086.0000 MB in 60.00 real seconds = 1196065.83 KB/sec = 9798.1712 Mbps
  nuttcp-r: 2319818 I/O calls, msec/call = 0.03, calls/sec = 38661.42
  nuttcp-r: 0.4user 33.9sys 1:00real 57% 0i+0d 330maxrss 0+17pf 1103994+76csw
  [root@srv network-scripts]#

which is a quite good result!
during the nuttcp test, the servers were having about a 50KHz interrupt rate and a very low average load.

here is the vmstat 1 output of the transmitting server during the test:

  procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
   0  0      0 15100956  44948 525388    0    0     0     0  280  152  0  0 100  0  0
   0  0      0 15100956  44948 525388    0    0     0     0  159   73  0  0 100  0  0
   1  0      0 15096676  44948 525388    0    0     0     0 44407  536  0  2 98  0  0
   1  0      0 15097296  44956 525384    0    0     0    44 51087  455  0  6 94  0  0
   1  0      0 15097792  44956 525388    0    0     0     0 51025  556  0  2 98  0  0
   1  0      0 15098056  44956 525388    0    0     0     0 50067  656  0  7 93  0  0
   1  0      0 15098304  44956 525388    0    0     0     0 49978  774  0  2 98  0  0
   1  0      0 15098676  44956 525388    0    0     0     0 50264  769  0  6 94  0  0
   1  0      0 15098676  44956 525388    0    0     0  1076 50428  843  0  2 98  0  0
   1  0      0 15098676  44956 525388    0    0     0     0 50832 1202  0  5 95  0  0
   0  0      0 15099296  44956 525388    0    0     0     0 51193 1350  0  2 98  0  0
   1  0      0 15099544  44956 525388    0    0     0     0 51228 1198  0  7 93  0  0
   0  0      0 15102948  44956 525388    0    0     0     0 6756  681  0  0 100  0  0
   0  0      0 15103072  44964 525380    0    0     0   360  684  462  0  0 100  0  0
   0  0      0 15103400  44964 525388    0    0     0     0  234  147  0  0 100  0  0




Marco    ( @mgua )

.

11 comments:

  1. thanks.
    You can ask a question?
    And will look like file descriptions, for example, if:
    1. VLAN10: 95.10.10.145/24,
    192.168.30.1/24
    2. VLAN20: 192.168.0.10/24

    Bonding should be organized.


    PS. Inside vlan10 have two network
    Thank you.

    ReplyDelete
  2. Try with
    ip addr add (second ip/mask) dev bond0.10


    M

    ReplyDelete
  3. Hi,

    Pardon my naievety.

    I have a Intel (ixgbe) dual port 10GbE card and if I bond the two ports my expectation was I would atleast theoritically get a throughput of something greater than 10GbE. Is this not so ?

    Please advice.

    ReplyDelete
  4. @stephen,

    depending on the switches configuration, it would be possible to have overall throughtput over 10Gbe, but this only if talking with different hosts.

    Normally path selection at the switch level is done using a layer two hash function working on mac-addresses, and this limits bandwidth to 1 link for every two hosts which are talking.

    More sophisticated layer 3 switches could implement more complex link distribution traffic, taking into account ip address and port, and in this case a single session has an upper bound of 1 link speed.

    Some documents say that, at least for NFS, you could use multiple (overlapping) mounts of a NAS storage over tcpip, and in this case the effective bandwidth could be over the 1 link speed if the switch hash function engages different physical links given the the tcp-port number.

    But in general these circumstances are not to be expected in every setup, and are subjects for fine tuning.


    marco (@mgua)


    .

    ReplyDelete
  5. I am not exactly getting whats the switch side configuration needed here.. DO I need to make both the switch ports as a trunk Port? I mean the ports which are connected to servers eth0 eth1 interface connected? Or something else I need to do? Waiting for you reply :)

    Btw such a nice post!!.



    Regds,
    Ashok

    ReplyDelete
  6. @ashok,

    As I wrote (FEX to server cfg), the two interfaces on the switches have to be setup in trunk mode, so to transport the different vlans.

    The server is connected with two physical interfaces (eth0 and eth2) to two fex switches. the switch side ports configurations are the same.


    Marco

    ReplyDelete
  7. Hi Marco,

    I did bonding. We have three Vlans having ID 15, 99 and 1. Default is 1 which is untagged. IP for bond0 is given as the same which is given to bladeServer in HP-OA. IPs in VLAN 15 and 99 were unique.

    Can you please confirm whether bonding is done correctly.

    Actually I never done bonding in VLANS that’s I am confused.

    I’ll really appreciate your help.

    Thank you,


    #####Output of /proc/net/bonding/bond0
    #########

    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: eth0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0

    Slave Interface: eth0
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: e8:39:35:bb:de:70

    Slave Interface: eth1
    MII Status: up
    Link Failure Count: 0
    Permanent HW addr: e8:39:35:bb:de:74

    #####Output of ifconfig

    bond0 Link encap:Ethernet HWaddr E8:39:35:BB:DE:70
    inet addr:106.40.129.90 Bcast:106.40.129.255 Mask:255.255.255.0
    inet6 addr: fe80::ea39:35ff:febb:de70/64 Scope:Link
    UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
    RX packets:3014 errors:0 dropped:0 overruns:0 frame:0
    TX packets:597 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:290054 (283.2 KiB) TX bytes:89502 (87.4 KiB)

    bond0.15 Link encap:Ethernet HWaddr E8:39:35:BB:DE:70
    inet addr:192.168.10.50 Bcast:192.168.10.255 Mask:255.255.255.0
    inet6 addr: fe80::ea39:35ff:febb:de70/64 Scope:Link
    UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
    TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:0 (0.0 b) TX bytes:6002 (5.8 KiB)

    bond0.99 Link encap:Ethernet HWaddr E8:39:35:BB:DE:70
    inet addr:132.197.247.248 Bcast:132.197.247.255 Mask:255.255.255.224
    inet6 addr: fe80::ea39:35ff:febb:de70/64 Scope:Link
    UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
    TX packets:37 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:0 (0.0 b) TX bytes:6294 (6.1 KiB)

    eth0 Link encap:Ethernet HWaddr E8:39:35:BB:DE:70
    UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
    RX packets:257 errors:0 dropped:0 overruns:0 frame:0
    TX packets:433 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:32724 (31.9 KiB) TX bytes:65080 (63.5 KiB)

    eth1 Link encap:Ethernet HWaddr E8:39:35:BB:DE:70
    UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
    RX packets:2757 errors:0 dropped:0 overruns:0 frame:0
    TX packets:164 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:257330 (251.2 KiB) TX bytes:24422 (23.8 KiB)

    lo Link encap:Local Loopback
    inet addr:127.0.0.1 Mask:255.0.0.0
    inet6 addr: ::1/128 Scope:Host
    UP LOOPBACK RUNNING MTU:16436 Metric:1
    RX packets:1519 errors:0 dropped:0 overruns:0 frame:0
    TX packets:1519 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:4707652 (4.4 MiB) TX bytes:4707652 (4.4 MiB)

    ####Device ifcfg-bond0

    DEVICE=bond0
    IPADDR=106.40.129.90
    NETMASK=255.255.255.0
    BOOTPROTO=none
    ONBOOT=YES
    BONDING_OPTS='mode=1 miimon=100'

    ReplyDelete
  8. ####Device ifcfg-bond0

    DEVICE=bond0
    IPADDR=106.40.129.90
    NETMASK=255.255.255.0
    BOOTPROTO=none
    ONBOOT=YES
    BONDING_OPTS='mode=1 miimon=100'


    #####Device ifcfg-bond0.15

    DEVICE=bond0.15
    ONBOOT=yes
    IPADDR=192.168.10.50
    NETMASK=255.255.255.0
    USERCTL=no
    VLAN=yes

    ####Device ifcfg-bond0.99

    DEVICE=bond0.99
    ONBOOT=yes
    IPADDR=132.197.247.248
    NETMASK=255.255.255.224
    USERCTL=no
    VLAN=yes

    ###Output of /etc/modprobe.conf

    alias eth0 be2net
    alias eth1 be2net
    alias eth2 be2net
    alias eth3 be2net
    alias eth4 be2net
    alias eth5 be2net
    alias eth6 be2net
    alias eth7 be2net
    alias scsi_hostadapter cciss
    alias scsi_hostadapter1 usb-storage
    alias bond0 bonding
    alias bond1 bonding
    alias bond2 bonding
    alias bond3 bonding
    options bonding max_bonds=3

    ReplyDelete
  9. Marco,

    I have a very similar setup to achieve with an added twist that I cannot figure out from the server side. I set up my config very similar to yours.
    2 NICs bonded together:
    eth3 + eth5 = bond0.
    two vlans to attach the bond to:
    vlan 240 & vlan 836.

    So I have
    ifcfg-eth3
    ifcfg-eth5
    ifcfg-bond0
    ifcfg-bond0.240 (IP 123.145.154.1)
    ifcfg-bond0.836 (IP 100.133.144.5)

    Now I need to add a second IP address to each vlan, and this is where I am stuck. How do I set up the next pair of files. IF I follow the logic of adding an second IP to a normal interface it would be ifcfg-bond0.240:1 and ifcfg-bond0.836:1. I have tried this and it does not work. I have tried a lot of other ideas with this, but I have only found ways NOT to do it...

    Do you know how this is accomplished?

    ReplyDelete
  10. D. A. Holton:

    ifcfg-bond0.240:0 for the first address
    ifcfg-bond0.240:1 for the second address

    these are "interface aliases"

    David

    ReplyDelete