ARP – Problems and Solutions

In summary, there are three major issue of ARP flooding:

  • Eating link bandwidth
  • Eating CPU at routes
  • Eating CPU at end hosts

To begin with, Address Resolution Statistics:

To extract maximum performance from these applications it is important to optimize and tune all the layers in the data center stack. One critical area that requires particular attention is the link-layer address resolution protocol that maps an IP address with the specific hardware address at the edge of the network.

In the case of ARP, all address resolution request messages are broadcast and these will be received and processed by all nodes on the network. In the case of ND, address resolution messages are sent via multicast and therefore may have a lower overall impact on the network even though the number of messages exchanged is the same.

The results indicate that address resolution traffic scales in a linear fashion with the number of hosts in the network. This linear scaling applies both to ARP as well as ND traffic though raw ARP traffic rate was considerably higher than ND traffic rate.

Next is RFC 6820

Large amounts of broadcast traffic pose a particular burden because every device (switch, host, and router) must process and possibly act on such traffic.

Server virtualization provides numerous benefits, including higher utilization, increased data security, reduced user downtime, and even significant power conservation, along with the promise of a  more flexible and dynamic computing environment.

If a VM moves to a new IP subnet, its address must change, and clients will need to be made aware of that change. From a VM management perspective, management is simplified if all servers are on a single large L2 network.

ARP uses broadcast, whereas ND uses multicast. When querying for a target IP address, ND maps the target address into an IPv6 Solicited Node multicast address. Using multicast rather than broadcast has the benefit that the multicast frames do not necessarily need to be sent to all parts of the network, i.e., the frames can be sent only to segments where listeners for the Solicited Node multicast address reside.

When the L3 domain extends only to aggregation switches, hosts in any of the IP subnets configured on the aggregation switches can be reachable via L2 through any access switches if access switches enable all the VLANs.  Such a topology allows a greater level of flexibility, as servers attached to any access switch can run any VMs that have been provisioned with IP addresses configured on the  aggregation switches.  In such an environment, VMs can migrate between racks without IP address changes.

Depending on the L3 boundary being present at the access/aggregation/core layer, the VM migration flexibility varies. Having L3 in the core only will affect the ARP/ND impact the most as the whole data center below the core becomes one large L2 domain.

The use of overlays in the data center network can be a useful design mechanism to help manage a potential bottleneck at the L2/L3 boundary by redefining where that boundary exists.

The RFC talks about the kinds of traffic – large web server farm, multi-tenant cloud hosting environment, high performance computing cluster. These three scenarios pose different requirements for network design.

If the expected use of the data center is to serve as a large web server farm, where thousands of nodes are doing similar things and the traffic pattern is largely in and out of a large data center, an access layer with EoR switches might be used, as it minimizes complexity, allows for servers and databases to be located in the same L2 domain, and provides for maximum density.

In order to host a multi-tenant cloud hosting service, to isolate inter-customer traffic, smaller L2 domains might be  preferred. The multi-tenant nature of the cloud hosting application requires a smaller and more compartmentalized access layer. A multi-tenant environment might also require the use of L3 all the way to the access-layer ToR switch.

In a high-performance compute cluster, where most of the traffic is expected to stay within the cluster but at the same time there is a high degree of crosstalk between the nodes, would once again  call for a large access layer in order to minimize the requirements at the aggregation layer.

Some workarounds:

Sending out periodic gratuitous ARPs can effectively prevent nodes from needing to send ARP Requests intended to revalidate stale entries for a router. The net result is an overall reduction in the  number of ARP queries routers receive.

Unfortunately, the IPv4 mitigation technique of sending gratuitous ARPs does not work in IPv6. The ND specification specifically states that gratuitous ND “updates” cannot cause an ND entry to be marked “valid”. Rather, such entries are marked “probe”, which causes the receiving node to (eventually) generate a probe back to the sender, which in this case is precisely the behavior that the router is trying to prevent!

RFC 7342 proposes some solutions:

The classic non-virtualized solution for VM mobility:

In order to allow a physical server to be loaded with VMs in different subnets or allow VMs to be moved to different server racks without IP address reconfiguration, the networks need to enable multiple broadcast domains (many VLANs) on the interfaces of L2/L3 boundary routers and Top-of-Rack (ToR) switches and allow some subnets to span multiple router ports.

Improvements in technology to mitigate flooding issues:

Since the majority of data center servers are moving towards 1G or 10G ports, the bandwidth taken by ARP/ND messages, even when flooded to all physical links, becomes negligible compared to the link bandwidth.  In addition, IGMP/MLD (Internet Group Management Protocol and Multicast Listener Discovery) snooping can further reduce the ND multicast traffic to some physical link segments.

As modern servers’ computing power increases, the processing taken by a large amount of ARP broadcast messages becomes less significant to servers. For example, lab testing shows that 2000 ARP requests per second only takes 2% of a single-core CPU server.  Therefore, the impact of ARP broadcasts to end stations is not significant on today’s servers.

But the major impact is on routers’ CPU, which is what this RFC aims to mitigate. One solution already described above is for the router to send periodic GARP requests so that hosts can refresh their cache.

In order to reduce the burden on the router to resolve ARPs for the hosts connected to the local subnet, ARP snooping is generally enabled so that any ARP traffic that is not destined to it can also be trapped and ARP learnt. That way, traffic destined to host need not be queued and then sent.

Caveat about IPv6:

Any solutions that relax the bidirectional requirement of IPv6 ND disable the security that the two-way ND communication exchange provides.:

Overlays:

Overlay networks hide the VMs’ addresses from the interior switches and routers, thereby greatly reducing the number of addresses exposed to the interior switches and router.

There is a draft on ARP reduction that has some more info:

The primary concerns that have influenced network architectures in the data center have been keeping broadcast domains manageable and spanning tree domains contained.

In order for the mobility to be non-disruptive to other hosts that have communication in progress with the VM being moved, the VM must retain its MAC address and IP address. Because of the requirement to retain the MAC and IP address, it is desirable to develop network architectures that would offer the least restrictions in terms of server mobility.

TRILL only solves the problem of spanning trees/accidental loops in the network. The broadcast problem still remains. For ARP reduction, this guy proposes something different:

If the destination IP address in the request is not present in the ARP table, then the original ARP request PDU is broadcast to all the switch ports that are member of the same VLAN except the source port that the Request was received from. However, if the requested (destination) IP address is present in the ARP table, a unicast ARP Reply PDU is prepared and sent to the switch port from which the ARP Request was received and original ARP request PDU is dropped.

One problem with the above is that when the original broadcast request is dropped, other hosts will not be able to refresh their ARP cache with sender IP/MAC information. They will age out and flood again. This will cause CPU burden on the ToR switch (as mentioned in the draft)

A GARP request is sent for DAD, while a GARP reply is sent to keep the entry refreshed in the ARP cache. Why use both? Why cannot we use one GARP request? In either case, the draft proposes the following way to handle a GARP PDU:

If the IP address is new, or exists but with a different hardware address, then the Gratuitous ARP PDU is forwarded out; otherwise the PDU is discarded.

Thats about it. I have a spaghetti of things in my mind about all this stuff. I will slowly let is sink in. Meanwhile, it looks like the VxLAN thing came about with this draft published in 2011 on “virtual machine mobility in l3 networks

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s