| |
| Linux Ethernet Bonding Driver mini-howto |
| |
| Initial release : Thomas Davis <tadavis at lbl.gov> |
| Corrections, HA extensions : 2000/10/03-15 : |
| - Willy Tarreau <willy at meta-x.org> |
| - Constantine Gavrilov <const-g at xpert.com> |
| - Chad N. Tindel <ctindel at ieee dot org> |
| - Janice Girouard <girouard at us dot ibm dot com> |
| - Jay Vosburgh <fubar at us dot ibm dot com> |
| |
| Note : |
| ------ |
| The bonding driver originally came from Donald Becker's beowulf patches for |
| kernel 2.0. It has changed quite a bit since, and the original tools from |
| extreme-linux and beowulf sites will not work with this version of the driver. |
| |
| For new versions of the driver, patches for older kernels and the updated |
| userspace tools, please follow the links at the end of this file. |
| |
| |
| Table of Contents |
| ================= |
| |
| Installation |
| Bond Configuration |
| Module Parameters |
| Configuring Multiple Bonds |
| Switch Configuration |
| Verifying Bond Configuration |
| Frequently Asked Questions |
| High Availability |
| Promiscuous Sniffing notes |
| 8021q VLAN support |
| Limitations |
| Resources and Links |
| |
| |
| Installation |
| ============ |
| |
| 1) Build kernel with the bonding driver |
| --------------------------------------- |
| For the latest version of the bonding driver, use kernel 2.4.12 or above |
| (otherwise you will need to apply a patch). |
| |
| Configure kernel with `make menuconfig/xconfig/config', and select "Bonding |
| driver support" in the "Network device support" section. It is recommended |
| to configure the driver as module since it is currently the only way to |
| pass parameters to the driver and configure more than one bonding device. |
| |
| Build and install the new kernel and modules. |
| |
| 2) Get and install the userspace tools |
| -------------------------------------- |
| This version of the bonding driver requires updated ifenslave program. The |
| original one from extreme-linux and beowulf will not work. Kernels 2.4.12 |
| and above include the updated version of ifenslave.c in |
| Documentation/networking directory. For older kernels, please follow the |
| links at the end of this file. |
| |
| IMPORTANT!!! If you are running on Redhat 7.1 or greater, you need |
| to be careful because /usr/include/linux is no longer a symbolic link |
| to /usr/src/linux/include/linux. If you build ifenslave while this is |
| true, ifenslave will appear to succeed but your bond won't work. The purpose |
| of the -I option on the ifenslave compile line is to make sure it uses |
| /usr/src/linux/include/linux/if_bonding.h instead of the version from |
| /usr/include/linux. |
| |
| To install ifenslave.c, do: |
| # gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave |
| # cp ifenslave /sbin/ifenslave |
| |
| |
| Bond Configuration |
| ================== |
| |
| You will need to add at least the following line to /etc/modprobe.conf |
| so the bonding driver will automatically load when the bond0 interface is |
| configured. Refer to the modprobe.conf manual page for specific modprobe.conf |
| syntax details. The Module Parameters section of this document describes each |
| bonding driver parameter. |
| |
| alias bond0 bonding |
| |
| Use standard distribution techniques to define the bond0 network interface. For |
| example, on modern Red Hat distributions, create an ifcfg-bond0 file in |
| the /etc/sysconfig/network-scripts directory that resembles the following: |
| |
| DEVICE=bond0 |
| IPADDR=192.168.1.1 |
| NETMASK=255.255.255.0 |
| NETWORK=192.168.1.0 |
| BROADCAST=192.168.1.255 |
| ONBOOT=yes |
| BOOTPROTO=none |
| USERCTL=no |
| |
| (use appropriate values for your network above) |
| |
| All interfaces that are part of a bond should have SLAVE and MASTER |
| definitions. For example, in the case of Red Hat, if you wish to make eth0 and |
| eth1 a part of the bonding interface bond0, their config files (ifcfg-eth0 and |
| ifcfg-eth1) should resemble the following: |
| |
| DEVICE=eth0 |
| USERCTL=no |
| ONBOOT=yes |
| MASTER=bond0 |
| SLAVE=yes |
| BOOTPROTO=none |
| |
| Use DEVICE=eth1 in the ifcfg-eth1 config file. If you configure a second |
| bonding interface (bond1), use MASTER=bond1 in the config file to make the |
| network interface be a slave of bond1. |
| |
| Restart the networking subsystem or just bring up the bonding device if your |
| administration tools allow it. Otherwise, reboot. On Red Hat distros you can |
| issue `ifup bond0' or `/etc/rc.d/init.d/network restart'. |
| |
| If the administration tools of your distribution do not support |
| master/slave notation in configuring network interfaces, you will need to |
| manually configure the bonding device with the following commands: |
| |
| # /sbin/ifconfig bond0 192.168.1.1 netmask 255.255.255.0 \ |
| broadcast 192.168.1.255 up |
| |
| # /sbin/ifenslave bond0 eth0 |
| # /sbin/ifenslave bond0 eth1 |
| |
| (use appropriate values for your network above) |
| |
| You can then create a script containing these commands and place it in the |
| appropriate rc directory. |
| |
| If you specifically need all network drivers loaded before the bonding driver, |
| adding the following line to modprobe.conf will cause the network driver for |
| eth0 and eth1 to be loaded before the bonding driver. |
| |
| install bond0 /sbin/modprobe -a eth0 eth1 && /sbin/modprobe bonding |
| |
| Be careful not to reference bond0 itself at the end of the line, or modprobe |
| will die in an endless recursive loop. |
| |
| If running SNMP agents, the bonding driver should be loaded before any network |
| drivers participating in a bond. This requirement is due to the the interface |
| index (ipAdEntIfIndex) being associated to the first interface found with a |
| given IP address. That is, there is only one ipAdEntIfIndex for each IP |
| address. For example, if eth0 and eth1 are slaves of bond0 and the driver for |
| eth0 is loaded before the bonding driver, the interface for the IP address |
| will be associated with the eth0 interface. This configuration is shown below, |
| the IP address 192.168.1.1 has an interface index of 2 which indexes to eth0 |
| in the ifDescr table (ifDescr.2). |
| |
| interfaces.ifTable.ifEntry.ifDescr.1 = lo |
| interfaces.ifTable.ifEntry.ifDescr.2 = eth0 |
| interfaces.ifTable.ifEntry.ifDescr.3 = eth1 |
| interfaces.ifTable.ifEntry.ifDescr.4 = eth2 |
| interfaces.ifTable.ifEntry.ifDescr.5 = eth3 |
| interfaces.ifTable.ifEntry.ifDescr.6 = bond0 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 |
| |
| This problem is avoided by loading the bonding driver before any network |
| drivers participating in a bond. Below is an example of loading the bonding |
| driver first, the IP address 192.168.1.1 is correctly associated with |
| ifDescr.2. |
| |
| interfaces.ifTable.ifEntry.ifDescr.1 = lo |
| interfaces.ifTable.ifEntry.ifDescr.2 = bond0 |
| interfaces.ifTable.ifEntry.ifDescr.3 = eth0 |
| interfaces.ifTable.ifEntry.ifDescr.4 = eth1 |
| interfaces.ifTable.ifEntry.ifDescr.5 = eth2 |
| interfaces.ifTable.ifEntry.ifDescr.6 = eth3 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 |
| ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 |
| |
| While some distributions may not report the interface name in ifDescr, |
| the association between the IP address and IfIndex remains and SNMP |
| functions such as Interface_Scan_Next will report that association. |
| |
| |
| Module Parameters |
| ================= |
| |
| Optional parameters for the bonding driver can be supplied as command line |
| arguments to the insmod command. Typically, these parameters are specified in |
| the file /etc/modprobe.conf (see the manual page for modprobe.conf). The |
| available bonding driver parameters are listed below. If a parameter is not |
| specified the default value is used. When initially configuring a bond, it |
| is recommended "tail -f /var/log/messages" be run in a separate window to |
| watch for bonding driver error messages. |
| |
| It is critical that either the miimon or arp_interval and arp_ip_target |
| parameters be specified, otherwise serious network degradation will occur |
| during link failures. |
| |
| arp_interval |
| |
| Specifies the ARP monitoring frequency in milli-seconds. |
| If ARP monitoring is used in a load-balancing mode (mode 0 or 2), the |
| switch should be configured in a mode that evenly distributes packets |
| across all links - such as round-robin. If the switch is configured to |
| distribute the packets in an XOR fashion, all replies from the ARP |
| targets will be received on the same link which could cause the other |
| team members to fail. ARP monitoring should not be used in conjunction |
| with miimon. A value of 0 disables ARP monitoring. The default value |
| is 0. |
| |
| arp_ip_target |
| |
| Specifies the ip addresses to use when arp_interval is > 0. These |
| are the targets of the ARP request sent to determine the health of |
| the link to the targets. Specify these values in ddd.ddd.ddd.ddd |
| format. Multiple ip adresses must be seperated by a comma. At least |
| one ip address needs to be given for ARP monitoring to work. The |
| maximum number of targets that can be specified is set at 16. |
| |
| downdelay |
| |
| Specifies the delay time in milli-seconds to disable a link after a |
| link failure has been detected. This should be a multiple of miimon |
| value, otherwise the value will be rounded. The default value is 0. |
| |
| lacp_rate |
| |
| Option specifying the rate in which we'll ask our link partner to |
| transmit LACPDU packets in 802.3ad mode. Possible values are: |
| |
| slow or 0 |
| Request partner to transmit LACPDUs every 30 seconds (default) |
| |
| fast or 1 |
| Request partner to transmit LACPDUs every 1 second |
| |
| max_bonds |
| |
| Specifies the number of bonding devices to create for this |
| instance of the bonding driver. E.g., if max_bonds is 3, and |
| the bonding driver is not already loaded, then bond0, bond1 |
| and bond2 will be created. The default value is 1. |
| |
| miimon |
| |
| Specifies the frequency in milli-seconds that MII link monitoring |
| will occur. A value of zero disables MII link monitoring. A value |
| of 100 is a good starting point. See High Availability section for |
| additional information. The default value is 0. |
| |
| mode |
| |
| Specifies one of the bonding policies. The default is |
| round-robin (balance-rr). Possible values are (you can use |
| either the text or numeric option): |
| |
| balance-rr or 0 |
| |
| Round-robin policy: Transmit in a sequential order |
| from the first available slave through the last. This |
| mode provides load balancing and fault tolerance. |
| |
| active-backup or 1 |
| |
| Active-backup policy: Only one slave in the bond is |
| active. A different slave becomes active if, and only |
| if, the active slave fails. The bond's MAC address is |
| externally visible on only one port (network adapter) |
| to avoid confusing the switch. This mode provides |
| fault tolerance. |
| |
| balance-xor or 2 |
| |
| XOR policy: Transmit based on [(source MAC address |
| XOR'd with destination MAC address) modula slave |
| count]. This selects the same slave for each |
| destination MAC address. This mode provides load |
| balancing and fault tolerance. |
| |
| broadcast or 3 |
| |
| Broadcast policy: transmits everything on all slave |
| interfaces. This mode provides fault tolerance. |
| |
| 802.3ad or 4 |
| |
| IEEE 802.3ad Dynamic link aggregation. Creates aggregation |
| groups that share the same speed and duplex settings. |
| Transmits and receives on all slaves in the active |
| aggregator. |
| |
| Pre-requisites: |
| |
| 1. Ethtool support in the base drivers for retrieving the |
| speed and duplex of each slave. |
| |
| 2. A switch that supports IEEE 802.3ad Dynamic link |
| aggregation. |
| |
| balance-tlb or 5 |
| |
| Adaptive transmit load balancing: channel bonding that does |
| not require any special switch support. The outgoing |
| traffic is distributed according to the current load |
| (computed relative to the speed) on each slave. Incoming |
| traffic is received by the current slave. If the receiving |
| slave fails, another slave takes over the MAC address of |
| the failed receiving slave. |
| |
| Prerequisite: |
| |
| Ethtool support in the base drivers for retrieving the |
| speed of each slave. |
| |
| balance-alb or 6 |
| |
| Adaptive load balancing: includes balance-tlb + receive |
| load balancing (rlb) for IPV4 traffic and does not require |
| any special switch support. The receive load balancing is |
| achieved by ARP negotiation. The bonding driver intercepts |
| the ARP Replies sent by the server on their way out and |
| overwrites the src hw address with the unique hw address of |
| one of the slaves in the bond such that different clients |
| use different hw addresses for the server. |
| |
| Receive traffic from connections created by the server is |
| also balanced. When the server sends an ARP Request the |
| bonding driver copies and saves the client's IP information |
| from the ARP. When the ARP Reply arrives from the client, |
| its hw address is retrieved and the bonding driver |
| initiates an ARP reply to this client assigning it to one |
| of the slaves in the bond. A problematic outcome of using |
| ARP negotiation for balancing is that each time that an ARP |
| request is broadcasted it uses the hw address of the |
| bond. Hence, clients learn the hw address of the bond and |
| the balancing of receive traffic collapses to the current |
| salve. This is handled by sending updates (ARP Replies) to |
| all the clients with their assigned hw address such that |
| the traffic is redistributed. Receive traffic is also |
| redistributed when a new slave is added to the bond and |
| when an inactive slave is re-activated. The receive load is |
| distributed sequentially (round robin) among the group of |
| highest speed slaves in the bond. |
| |
| When a link is reconnected or a new slave joins the bond |
| the receive traffic is redistributed among all active |
| slaves in the bond by intiating ARP Replies with the |
| selected mac address to each of the clients. The updelay |
| modeprobe parameter must be set to a value equal or greater |
| than the switch's forwarding delay so that the ARP Replies |
| sent to the clients will not be blocked by the switch. |
| |
| Prerequisites: |
| |
| 1. Ethtool support in the base drivers for retrieving the |
| speed of each slave. |
| |
| 2. Base driver support for setting the hw address of a |
| device also when it is open. This is required so that there |
| will always be one slave in the team using the bond hw |
| address (the curr_active_slave) while having a unique hw |
| address for each slave in the bond. If the curr_active_slave |
| fails it's hw address is swapped with the new curr_active_slave |
| that was chosen. |
| |
| primary |
| |
| A string (eth0, eth2, etc) to equate to a primary device. If this |
| value is entered, and the device is on-line, it will be used first |
| as the output media. Only when this device is off-line, will |
| alternate devices be used. Otherwise, once a failover is detected |
| and a new default output is chosen, it will remain the output media |
| until it too fails. This is useful when one slave was preferred |
| over another, i.e. when one slave is 1000Mbps and another is |
| 100Mbps. If the 1000Mbps slave fails and is later restored, it may |
| be preferred the faster slave gracefully become the active slave - |
| without deliberately failing the 100Mbps slave. Specifying a |
| primary is only valid in active-backup mode. |
| |
| updelay |
| |
| Specifies the delay time in milli-seconds to enable a link after a |
| link up status has been detected. This should be a multiple of miimon |
| value, otherwise the value will be rounded. The default value is 0. |
| |
| use_carrier |
| |
| Specifies whether or not miimon should use MII or ETHTOOL |
| ioctls vs. netif_carrier_ok() to determine the link status. |
| The MII or ETHTOOL ioctls are less efficient and utilize a |
| deprecated calling sequence within the kernel. The |
| netif_carrier_ok() relies on the device driver to maintain its |
| state with netif_carrier_on/off; at this writing, most, but |
| not all, device drivers support this facility. |
| |
| If bonding insists that the link is up when it should not be, |
| it may be that your network device driver does not support |
| netif_carrier_on/off. This is because the default state for |
| netif_carrier is "carrier on." In this case, disabling |
| use_carrier will cause bonding to revert to the MII / ETHTOOL |
| ioctl method to determine the link state. |
| |
| A value of 1 enables the use of netif_carrier_ok(), a value of |
| 0 will use the deprecated MII / ETHTOOL ioctls. The default |
| value is 1. |
| |
| |
| Configuring Multiple Bonds |
| ========================== |
| |
| If several bonding interfaces are required, either specify the max_bonds |
| parameter (described above), or load the driver multiple times. Using |
| the max_bonds parameter is less complicated, but has the limitation that |
| all bonding instances created will have the same options. Loading the |
| driver multiple times allows each instance of the driver to have differing |
| options. |
| |
| For example, to configure two bonding interfaces, one with mii link |
| monitoring performed every 100 milliseconds, and one with ARP link |
| monitoring performed every 200 milliseconds, the /etc/conf.modules should |
| resemble the following: |
| |
| alias bond0 bonding |
| alias bond1 bonding |
| |
| options bond0 miimon=100 |
| options bond1 -o bonding1 arp_interval=200 arp_ip_target=10.0.0.1 |
| |
| Configuring Multiple ARP Targets |
| ================================ |
| |
| While ARP monitoring can be done with just one target, it can be useful |
| in a High Availability setup to have several targets to monitor. In the |
| case of just one target, the target itself may go down or have a problem |
| making it unresponsive to ARP requests. Having an additional target (or |
| several) increases the reliability of the ARP monitoring. |
| |
| Multiple ARP targets must be seperated by commas as follows: |
| |
| # example options for ARP monitoring with three targets |
| alias bond0 bonding |
| options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 |
| |
| For just a single target the options would resemble: |
| |
| # example options for ARP monitoring with one target |
| alias bond0 bonding |
| options bond0 arp_interval=60 arp_ip_target=192.168.0.100 |
| |
| Potential Problems When Using ARP Monitor |
| ========================================= |
| |
| 1. Driver support |
| |
| The ARP monitor relies on the network device driver to maintain two |
| statistics: the last receive time (dev->last_rx), and the last |
| transmit time (dev->trans_start). If the network device driver does |
| not update one or both of these, then the typical result will be that, |
| upon startup, all links in the bond will immediately be declared down, |
| and remain that way. A network monitoring tool (tcpdump, e.g.) will |
| show ARP requests and replies being sent and received on the bonding |
| device. |
| |
| The possible resolutions for this are to (a) fix the device driver, or |
| (b) discontinue the ARP monitor (using miimon as an alternative, for |
| example). |
| |
| 2. Adventures in Routing |
| |
| When bonding is set up with the ARP monitor, it is important that the |
| slave devices not have routes that supercede routes of the master (or, |
| generally, not have routes at all). For example, suppose the bonding |
| device bond0 has two slaves, eth0 and eth1, and the routing table is |
| as follows: |
| |
| Kernel IP routing table |
| Destination Gateway Genmask Flags MSS Window irtt Iface |
| 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 |
| 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 |
| 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 |
| 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo |
| |
| In this case, the ARP monitor (and ARP itself) may become confused, |
| because ARP requests will be sent on one interface (bond0), but the |
| corresponding reply will arrive on a different interface (eth0). This |
| reply looks to ARP as an unsolicited ARP reply (because ARP matches |
| replies on an interface basis), and is discarded. This will likely |
| still update the receive/transmit times in the driver, but will lose |
| packets. |
| |
| The resolution here is simply to insure that slaves do not have routes |
| of their own, and if for some reason they must, those routes do not |
| supercede routes of their master. This should generally be the case, |
| but unusual configurations or errant manual or automatic static route |
| additions may cause trouble. |
| |
| Switch Configuration |
| ==================== |
| |
| While the switch does not need to be configured when the active-backup, |
| balance-tlb or balance-alb policies (mode=1,5,6) are used, it does need to |
| be configured for the round-robin, XOR, broadcast, or 802.3ad policies |
| (mode=0,2,3,4). |
| |
| |
| Verifying Bond Configuration |
| ============================ |
| |
| 1) Bonding information files |
| ---------------------------- |
| The bonding driver information files reside in the /proc/net/bonding directory. |
| |
| Sample contents of /proc/net/bonding/bond0 after the driver is loaded with |
| parameters of mode=0 and miimon=1000 is shown below. |
| |
| Bonding Mode: load balancing (round-robin) |
| Currently Active Slave: eth0 |
| MII Status: up |
| MII Polling Interval (ms): 1000 |
| Up Delay (ms): 0 |
| Down Delay (ms): 0 |
| |
| Slave Interface: eth1 |
| MII Status: up |
| Link Failure Count: 1 |
| |
| Slave Interface: eth0 |
| MII Status: up |
| Link Failure Count: 1 |
| |
| 2) Network verification |
| ----------------------- |
| The network configuration can be verified using the ifconfig command. In |
| the example below, the bond0 interface is the master (MASTER) while eth0 and |
| eth1 are slaves (SLAVE). Notice all slaves of bond0 have the same MAC address |
| (HWaddr) as bond0 for all modes except TLB and ALB that require a unique MAC |
| address for each slave. |
| |
| [root]# /sbin/ifconfig |
| bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 |
| inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 |
| UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 |
| RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 |
| TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 |
| collisions:0 txqueuelen:0 |
| |
| eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 |
| inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 |
| UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 |
| RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 |
| TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 |
| collisions:0 txqueuelen:100 |
| Interrupt:10 Base address:0x1080 |
| |
| eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 |
| inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 |
| UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 |
| RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 |
| TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 |
| collisions:0 txqueuelen:100 |
| Interrupt:9 Base address:0x1400 |
| |
| |
| Frequently Asked Questions |
| ========================== |
| |
| 1. Is it SMP safe? |
| |
| Yes. The old 2.0.xx channel bonding patch was not SMP safe. |
| The new driver was designed to be SMP safe from the start. |
| |
| 2. What type of cards will work with it? |
| |
| Any Ethernet type cards (you can even mix cards - a Intel |
| EtherExpress PRO/100 and a 3com 3c905b, for example). |
| You can even bond together Gigabit Ethernet cards! |
| |
| 3. How many bonding devices can I have? |
| |
| There is no limit. |
| |
| 4. How many slaves can a bonding device have? |
| |
| Limited by the number of network interfaces Linux supports and/or the |
| number of network cards you can place in your system. |
| |
| 5. What happens when a slave link dies? |
| |
| If your ethernet cards support MII or ETHTOOL link status monitoring |
| and the MII monitoring has been enabled in the driver (see description |
| of module parameters), there will be no adverse consequences. This |
| release of the bonding driver knows how to get the MII information and |
| enables or disables its slaves according to their link status. |
| See section on High Availability for additional information. |
| |
| For ethernet cards not supporting MII status, the arp_interval and |
| arp_ip_target parameters must be specified for bonding to work |
| correctly. If packets have not been sent or received during the |
| specified arp_interval duration, an ARP request is sent to the |
| targets to generate send and receive traffic. If after this |
| interval, either the successful send and/or receive count has not |
| incremented, the next slave in the sequence will become the active |
| slave. |
| |
| If neither mii_monitor and arp_interval is configured, the bonding |
| driver will not handle this situation very well. The driver will |
| continue to send packets but some packets will be lost. Retransmits |
| will cause serious degradation of performance (in the case when one |
| of two slave links fails, 50% packets will be lost, which is a serious |
| problem for both TCP and UDP). |
| |
| 6. Can bonding be used for High Availability? |
| |
| Yes, if you use MII monitoring and ALL your cards support MII link |
| status reporting. See section on High Availability for more |
| information. |
| |
| 7. Which switches/systems does it work with? |
| |
| In round-robin and XOR mode, it works with systems that support |
| trunking: |
| |
| * Many Cisco switches and routers (look for EtherChannel support). |
| * SunTrunking software. |
| * Alteon AceDirector switches / WebOS (use Trunks). |
| * BayStack Switches (trunks must be explicitly configured). Stackable |
| models (450) can define trunks between ports on different physical |
| units. |
| * Linux bonding, of course ! |
| |
| In 802.3ad mode, it works with with systems that support IEEE 802.3ad |
| Dynamic Link Aggregation: |
| |
| * Extreme networks Summit 7i (look for link-aggregation). |
| * Many Cisco switches and routers (look for LACP support; this may |
| require an upgrade to your IOS software; LACP support was added |
| by Cisco in late 2002). |
| * Foundry Big Iron 4000 |
| |
| In active-backup, balance-tlb and balance-alb modes, it should work |
| with any Layer-II switch. |
| |
| |
| 8. Where does a bonding device get its MAC address from? |
| |
| If not explicitly configured with ifconfig, the MAC address of the |
| bonding device is taken from its first slave device. This MAC address |
| is then passed to all following slaves and remains persistent (even if |
| the the first slave is removed) until the bonding device is brought |
| down or reconfigured. |
| |
| If you wish to change the MAC address, you can set it with ifconfig: |
| |
| # ifconfig bond0 hw ether 00:11:22:33:44:55 |
| |
| The MAC address can be also changed by bringing down/up the device |
| and then changing its slaves (or their order): |
| |
| # ifconfig bond0 down ; modprobe -r bonding |
| # ifconfig bond0 .... up |
| # ifenslave bond0 eth... |
| |
| This method will automatically take the address from the next slave |
| that will be added. |
| |
| To restore your slaves' MAC addresses, you need to detach them |
| from the bond (`ifenslave -d bond0 eth0'). The bonding driver will then |
| restore the MAC addresses that the slaves had before they were enslaved. |
| |
| 9. Which transmit polices can be used? |
| |
| Round-robin, based on the order of enslaving, the output device |
| is selected base on the next available slave. Regardless of |
| the source and/or destination of the packet. |
| |
| Active-backup policy that ensures that one and only one device will |
| transmit at any given moment. Active-backup policy is useful for |
| implementing high availability solutions using two hubs (see |
| section on High Availability). |
| |
| XOR, based on (src hw addr XOR dst hw addr) % slave count. This |
| policy selects the same slave for each destination hw address. |
| |
| Broadcast policy transmits everything on all slave interfaces. |
| |
| 802.3ad, based on XOR but distributes traffic among all interfaces |
| in the active aggregator. |
| |
| Transmit load balancing (balance-tlb) balances the traffic |
| according to the current load on each slave. The balancing is |
| clients based and the least loaded slave is selected for each new |
| client. The load of each slave is calculated relative to its speed |
| and enables load balancing in mixed speed teams. |
| |
| Adaptive load balancing (balance-alb) uses the Transmit load |
| balancing for the transmit load. The receive load is balanced only |
| among the group of highest speed active slaves in the bond. The |
| load is distributed with round-robin i.e. next available slave in |
| the high speed group of active slaves. |
| |
| High Availability |
| ================= |
| |
| To implement high availability using the bonding driver, the driver needs to be |
| compiled as a module, because currently it is the only way to pass parameters |
| to the driver. This may change in the future. |
| |
| High availability is achieved by using MII or ETHTOOL status reporting. You |
| need to verify that all your interfaces support MII or ETHTOOL link status |
| reporting. On Linux kernel 2.2.17, all the 100 Mbps capable drivers and |
| yellowfin gigabit driver support MII. To determine if ETHTOOL link reporting |
| is available for interface eth0, type "ethtool eth0" and the "Link detected:" |
| line should contain the correct link status. If your system has an interface |
| that does not support MII or ETHTOOL status reporting, a failure of its link |
| will not be detected! A message indicating MII and ETHTOOL is not supported by |
| a network driver is logged when the bonding driver is loaded with a non-zero |
| miimon value. |
| |
| The bonding driver can regularly check all its slaves links using the ETHTOOL |
| IOCTL (ETHTOOL_GLINK command) or by checking the MII status registers. The |
| check interval is specified by the module argument "miimon" (MII monitoring). |
| It takes an integer that represents the checking time in milliseconds. It |
| should not come to close to (1000/HZ) (10 milli-seconds on i386) because it |
| may then reduce the system interactivity. A value of 100 seems to be a good |
| starting point. It means that a dead link will be detected at most 100 |
| milli-seconds after it goes down. |
| |
| Example: |
| |
| # modprobe bonding miimon=100 |
| |
| Or, put the following line in /etc/modprobe.conf: |
| |
| options bond0 miimon=100 |
| |
| There are currently two policies for high availability. They are dependent on |
| whether: |
| |
| a) hosts are connected to a single host or switch that support trunking |
| |
| b) hosts are connected to several different switches or a single switch that |
| does not support trunking |
| |
| |
| 1) High Availability on a single switch or host - load balancing |
| ---------------------------------------------------------------- |
| It is the easiest to set up and to understand. Simply configure the |
| remote equipment (host or switch) to aggregate traffic over several |
| ports (Trunk, EtherChannel, etc.) and configure the bonding interfaces. |
| If the module has been loaded with the proper MII option, it will work |
| automatically. You can then try to remove and restore different links |
| and see in your logs what the driver detects. When testing, you may |
| encounter problems on some buggy switches that disable the trunk for a |
| long time if all ports in a trunk go down. This is not Linux, but really |
| the switch (reboot it to ensure). |
| |
| Example 1 : host to host at twice the speed |
| |
| +----------+ +----------+ |
| | |eth0 eth0| | |
| | Host A +--------------------------+ Host B | |
| | +--------------------------+ | |
| | |eth1 eth1| | |
| +----------+ +----------+ |
| |
| On each host : |
| # modprobe bonding miimon=100 |
| # ifconfig bond0 addr |
| # ifenslave bond0 eth0 eth1 |
| |
| Example 2 : host to switch at twice the speed |
| |
| +----------+ +----------+ |
| | |eth0 port1| | |
| | Host A +--------------------------+ switch | |
| | +--------------------------+ | |
| | |eth1 port2| | |
| +----------+ +----------+ |
| |
| On host A : On the switch : |
| # modprobe bonding miimon=100 # set up a trunk on port1 |
| # ifconfig bond0 addr and port2 |
| # ifenslave bond0 eth0 eth1 |
| |
| |
| 2) High Availability on two or more switches (or a single switch without |
| trunking support) |
| --------------------------------------------------------------------------- |
| This mode is more problematic because it relies on the fact that there |
| are multiple ports and the host's MAC address should be visible on one |
| port only to avoid confusing the switches. |
| |
| If you need to know which interface is the active one, and which ones are |
| backup, use ifconfig. All backup interfaces have the NOARP flag set. |
| |
| To use this mode, pass "mode=1" to the module at load time : |
| |
| # modprobe bonding miimon=100 mode=active-backup |
| |
| or: |
| |
| # modprobe bonding miimon=100 mode=1 |
| |
| Or, put in your /etc/modprobe.conf : |
| |
| options bond0 miimon=100 mode=active-backup |
| |
| Example 1: Using multiple host and multiple switches to build a "no single |
| point of failure" solution. |
| |
| |
| | | |
| |port3 port3| |
| +-----+----+ +-----+----+ |
| | |port7 ISL port7| | |
| | switch A +--------------------------+ switch B | |
| | +--------------------------+ | |
| | |port8 port8| | |
| +----++----+ +-----++---+ |
| port2||port1 port1||port2 |
| || +-------+ || |
| |+-------------+ host1 +---------------+| |
| | eth0 +-------+ eth1 | |
| | | |
| | +-------+ | |
| +--------------+ host2 +----------------+ |
| eth0 +-------+ eth1 |
| |
| In this configuration, there is an ISL - Inter Switch Link (could be a trunk), |
| several servers (host1, host2 ...) attached to both switches each, and one or |
| more ports to the outside world (port3...). One and only one slave on each host |
| is active at a time, while all links are still monitored (the system can |
| detect a failure of active and backup links). |
| |
| Each time a host changes its active interface, it sticks to the new one until |
| it goes down. In this example, the hosts are negligibly affected by the |
| expiration time of the switches' forwarding tables. |
| |
| If host1 and host2 have the same functionality and are used in load balancing |
| by another external mechanism, it is good to have host1's active interface |
| connected to one switch and host2's to the other. Such system will survive |
| a failure of a single host, cable, or switch. The worst thing that may happen |
| in the case of a switch failure is that half of the hosts will be temporarily |
| unreachable until the other switch expires its tables. |
| |
| Example 2: Using multiple ethernet cards connected to a switch to configure |
| NIC failover (switch is not required to support trunking). |
| |
| |
| +----------+ +----------+ |
| | |eth0 port1| | |
| | Host A +--------------------------+ switch | |
| | +--------------------------+ | |
| | |eth1 port2| | |
| +----------+ +----------+ |
| |
| On host A : On the switch : |
| # modprobe bonding miimon=100 mode=1 # (optional) minimize the time |
| # ifconfig bond0 addr # for table expiration |
| # ifenslave bond0 eth0 eth1 |
| |
| Each time the host changes its active interface, it sticks to the new one until |
| it goes down. In this example, the host is strongly affected by the expiration |
| time of the switch forwarding table. |
| |
| |
| 3) Adapting to your switches' timing |
| ------------------------------------ |
| If your switches take a long time to go into backup mode, it may be |
| desirable not to activate a backup interface immediately after a link goes |
| down. It is possible to delay the moment at which a link will be |
| completely disabled by passing the module parameter "downdelay" (in |
| milliseconds, must be a multiple of miimon). |
| |
| When a switch reboots, it is possible that its ports report "link up" status |
| before they become usable. This could fool a bond device by causing it to |
| use some ports that are not ready yet. It is possible to delay the moment at |
| which an active link will be reused by passing the module parameter "updelay" |
| (in milliseconds, must be a multiple of miimon). |
| |
| A similar situation can occur when a host re-negotiates a lost link with the |
| switch (a case of cable replacement). |
| |
| A special case is when a bonding interface has lost all slave links. Then the |
| driver will immediately reuse the first link that goes up, even if updelay |
| parameter was specified. (If there are slave interfaces in the "updelay" state, |
| the interface that first went into that state will be immediately reused.) This |
| allows to reduce down-time if the value of updelay has been overestimated. |
| |
| Examples : |
| |
| # modprobe bonding miimon=100 mode=1 downdelay=2000 updelay=5000 |
| # modprobe bonding miimon=100 mode=balance-rr downdelay=0 updelay=5000 |
| |
| |
| Promiscuous Sniffing notes |
| ========================== |
| |
| If you wish to bond channels together for a network sniffing |
| application --- you wish to run tcpdump, or ethereal, or an IDS like |
| snort, with its input aggregated from multiple interfaces using the |
| bonding driver --- then you need to handle the Promiscuous interface |
| setting by hand. Specifically, when you "ifconfing bond0 up" you |
| must add the promisc flag there; it will be propagated down to the |
| slave interfaces at ifenslave time; a full example might look like: |
| |
| ifconfig bond0 promisc up |
| for if in eth1 eth2 ...;do |
| ifconfig $if up |
| ifenslave bond0 $if |
| done |
| snort ... -i bond0 ... |
| |
| Ifenslave also wants to propagate addresses from interface to |
| interface, appropriately for its design functions in HA and channel |
| capacity aggregating; but it works fine for unnumbered interfaces; |
| just ignore all the warnings it emits. |
| |
| |
| 8021q VLAN support |
| ================== |
| |
| It is possible to configure VLAN devices over a bond interface using the 8021q |
| driver. However, only packets coming from the 8021q driver and passing through |
| bonding will be tagged by default. Self generated packets, like bonding's |
| learning packets or ARP packets generated by either ALB mode or the ARP |
| monitor mechanism, are tagged internally by bonding itself. As a result, |
| bonding has to "learn" what VLAN IDs are configured on top of it, and it uses |
| those IDs to tag self generated packets. |
| |
| For simplicity reasons, and to support the use of adapters that can do VLAN |
| hardware acceleration offloding, the bonding interface declares itself as |
| fully hardware offloaing capable, it gets the add_vid/kill_vid notifications |
| to gather the necessary information, and it propagates those actions to the |
| slaves. |
| In case of mixed adapter types, hardware accelerated tagged packets that should |
| go through an adapter that is not offloading capable are "un-accelerated" by the |
| bonding driver so the VLAN tag sits in the regular location. |
| |
| VLAN interfaces *must* be added on top of a bonding interface only after |
| enslaving at least one slave. This is because until the first slave is added the |
| bonding interface has a HW address of 00:00:00:00:00:00, which will be copied by |
| the VLAN interface when it is created. |
| |
| Notice that a problem would occur if all slaves are released from a bond that |
| still has VLAN interfaces on top of it. When later coming to add new slaves, the |
| bonding interface would get a HW address from the first slave, which might not |
| match that of the VLAN interfaces. It is recommended that either all VLANs are |
| removed and then re-added, or to manually set the bonding interface's HW |
| address so it matches the VLAN's. (Note: changing a VLAN interface's HW address |
| would set the underlying device -- i.e. the bonding interface -- to promiscouos |
| mode, which might not be what you want). |
| |
| |
| Limitations |
| =========== |
| The main limitations are : |
| - only the link status is monitored. If the switch on the other side is |
| partially down (e.g. doesn't forward anymore, but the link is OK), the link |
| won't be disabled. Another way to check for a dead link could be to count |
| incoming frames on a heavily loaded host. This is not applicable to small |
| servers, but may be useful when the front switches send multicast |
| information on their links (e.g. VRRP), or even health-check the servers. |
| Use the arp_interval/arp_ip_target parameters to count incoming/outgoing |
| frames. |
| |
| |
| |
| Resources and Links |
| =================== |
| |
| Current development on this driver is posted to: |
| - http://www.sourceforge.net/projects/bonding/ |
| |
| Donald Becker's Ethernet Drivers and diag programs may be found at : |
| - http://www.scyld.com/network/ |
| |
| You will also find a lot of information regarding Ethernet, NWay, MII, etc. at |
| www.scyld.com. |
| |
| Patches for 2.2 kernels are at Willy Tarreau's site : |
| - http://wtarreau.free.fr/pub/bonding/ |
| - http://www-miaif.lip6.fr/~tarreau/pub/bonding/ |
| |
| To get latest informations about Linux Kernel development, please consult |
| the Linux Kernel Mailing List Archives at : |
| http://www.ussg.iu.edu/hypermail/linux/kernel/ |
| |
| -- END -- |