0

ARP – Problems and Solutions

In summary, there are three major issue of ARP flooding:

  • Eating link bandwidth
  • Eating CPU at routes
  • Eating CPU at end hosts

To begin with, Address Resolution Statistics:

To extract maximum performance from these applications it is important to optimize and tune all the layers in the data center stack. One critical area that requires particular attention is the link-layer address resolution protocol that maps an IP address with the specific hardware address at the edge of the network.

In the case of ARP, all address resolution request messages are broadcast and these will be received and processed by all nodes on the network. In the case of ND, address resolution messages are sent via multicast and therefore may have a lower overall impact on the network even though the number of messages exchanged is the same.

The results indicate that address resolution traffic scales in a linear fashion with the number of hosts in the network. This linear scaling applies both to ARP as well as ND traffic though raw ARP traffic rate was considerably higher than ND traffic rate.

Next is RFC 6820

Large amounts of broadcast traffic pose a particular burden because every device (switch, host, and router) must process and possibly act on such traffic.

Server virtualization provides numerous benefits, including higher utilization, increased data security, reduced user downtime, and even significant power conservation, along with the promise of a  more flexible and dynamic computing environment.

If a VM moves to a new IP subnet, its address must change, and clients will need to be made aware of that change. From a VM management perspective, management is simplified if all servers are on a single large L2 network.

ARP uses broadcast, whereas ND uses multicast. When querying for a target IP address, ND maps the target address into an IPv6 Solicited Node multicast address. Using multicast rather than broadcast has the benefit that the multicast frames do not necessarily need to be sent to all parts of the network, i.e., the frames can be sent only to segments where listeners for the Solicited Node multicast address reside.

When the L3 domain extends only to aggregation switches, hosts in any of the IP subnets configured on the aggregation switches can be reachable via L2 through any access switches if access switches enable all the VLANs.  Such a topology allows a greater level of flexibility, as servers attached to any access switch can run any VMs that have been provisioned with IP addresses configured on the  aggregation switches.  In such an environment, VMs can migrate between racks without IP address changes.

Depending on the L3 boundary being present at the access/aggregation/core layer, the VM migration flexibility varies. Having L3 in the core only will affect the ARP/ND impact the most as the whole data center below the core becomes one large L2 domain.

The use of overlays in the data center network can be a useful design mechanism to help manage a potential bottleneck at the L2/L3 boundary by redefining where that boundary exists.

The RFC talks about the kinds of traffic – large web server farm, multi-tenant cloud hosting environment, high performance computing cluster. These three scenarios pose different requirements for network design.

If the expected use of the data center is to serve as a large web server farm, where thousands of nodes are doing similar things and the traffic pattern is largely in and out of a large data center, an access layer with EoR switches might be used, as it minimizes complexity, allows for servers and databases to be located in the same L2 domain, and provides for maximum density.

In order to host a multi-tenant cloud hosting service, to isolate inter-customer traffic, smaller L2 domains might be  preferred. The multi-tenant nature of the cloud hosting application requires a smaller and more compartmentalized access layer. A multi-tenant environment might also require the use of L3 all the way to the access-layer ToR switch.

In a high-performance compute cluster, where most of the traffic is expected to stay within the cluster but at the same time there is a high degree of crosstalk between the nodes, would once again  call for a large access layer in order to minimize the requirements at the aggregation layer.

Some workarounds:

Sending out periodic gratuitous ARPs can effectively prevent nodes from needing to send ARP Requests intended to revalidate stale entries for a router. The net result is an overall reduction in the  number of ARP queries routers receive.

Unfortunately, the IPv4 mitigation technique of sending gratuitous ARPs does not work in IPv6. The ND specification specifically states that gratuitous ND “updates” cannot cause an ND entry to be marked “valid”. Rather, such entries are marked “probe”, which causes the receiving node to (eventually) generate a probe back to the sender, which in this case is precisely the behavior that the router is trying to prevent!

RFC 7342 proposes some solutions:

The classic non-virtualized solution for VM mobility:

In order to allow a physical server to be loaded with VMs in different subnets or allow VMs to be moved to different server racks without IP address reconfiguration, the networks need to enable multiple broadcast domains (many VLANs) on the interfaces of L2/L3 boundary routers and Top-of-Rack (ToR) switches and allow some subnets to span multiple router ports.

Improvements in technology to mitigate flooding issues:

Since the majority of data center servers are moving towards 1G or 10G ports, the bandwidth taken by ARP/ND messages, even when flooded to all physical links, becomes negligible compared to the link bandwidth.  In addition, IGMP/MLD (Internet Group Management Protocol and Multicast Listener Discovery) snooping can further reduce the ND multicast traffic to some physical link segments.

As modern servers’ computing power increases, the processing taken by a large amount of ARP broadcast messages becomes less significant to servers. For example, lab testing shows that 2000 ARP requests per second only takes 2% of a single-core CPU server.  Therefore, the impact of ARP broadcasts to end stations is not significant on today’s servers.

But the major impact is on routers’ CPU, which is what this RFC aims to mitigate. One solution already described above is for the router to send periodic GARP requests so that hosts can refresh their cache.

In order to reduce the burden on the router to resolve ARPs for the hosts connected to the local subnet, ARP snooping is generally enabled so that any ARP traffic that is not destined to it can also be trapped and ARP learnt. That way, traffic destined to host need not be queued and then sent.

Caveat about IPv6:

Any solutions that relax the bidirectional requirement of IPv6 ND disable the security that the two-way ND communication exchange provides.:

Overlays:

Overlay networks hide the VMs’ addresses from the interior switches and routers, thereby greatly reducing the number of addresses exposed to the interior switches and router.

There is a draft on ARP reduction that has some more info:

The primary concerns that have influenced network architectures in the data center have been keeping broadcast domains manageable and spanning tree domains contained.

In order for the mobility to be non-disruptive to other hosts that have communication in progress with the VM being moved, the VM must retain its MAC address and IP address. Because of the requirement to retain the MAC and IP address, it is desirable to develop network architectures that would offer the least restrictions in terms of server mobility.

TRILL only solves the problem of spanning trees/accidental loops in the network. The broadcast problem still remains. For ARP reduction, this guy proposes something different:

If the destination IP address in the request is not present in the ARP table, then the original ARP request PDU is broadcast to all the switch ports that are member of the same VLAN except the source port that the Request was received from. However, if the requested (destination) IP address is present in the ARP table, a unicast ARP Reply PDU is prepared and sent to the switch port from which the ARP Request was received and original ARP request PDU is dropped.

One problem with the above is that when the original broadcast request is dropped, other hosts will not be able to refresh their ARP cache with sender IP/MAC information. They will age out and flood again. This will cause CPU burden on the ToR switch (as mentioned in the draft)

A GARP request is sent for DAD, while a GARP reply is sent to keep the entry refreshed in the ARP cache. Why use both? Why cannot we use one GARP request? In either case, the draft proposes the following way to handle a GARP PDU:

If the IP address is new, or exists but with a different hardware address, then the Gratuitous ARP PDU is forwarded out; otherwise the PDU is discarded.

Thats about it. I have a spaghetti of things in my mind about all this stuff. I will slowly let is sink in. Meanwhile, it looks like the VxLAN thing came about with this draft published in 2011 on “virtual machine mobility in l3 networks

Advertisements
0

RFC 1027 – ProxyARP

I was motivated to read more about proxy ARP when I was asked what is the use case of proxy ARP. I stumbled upon RFC 1027 which essentially says this:

Therefore a method for hiding the existence of subnets from hosts was highly desirable.  Since all the local area networks supported ARP, an ARP-based method (commonly known as “Proxy ARP” or the “ARP hack”) was chosen.

The physical networks of host A and B need not be connected to the same gateway. All that is necessary is that the networks be reachable from the gateway.

0

eventfd

I haven’t heard about the eventfd feature in Linux until I came across it in our code base and began to wonder what it is. This article about eventfd gives some good introduction about it. In a gyst:

  • This is similar to pipe that two related processes can signal to each other and synchronize themselves.
  • Unlike pipe(), this has only one file descriptor. One can do both read() and write() on the file descriptor.
  • As a bonus, it can also act like a mutex/semaphore. When one read()s the fd, the value of the eventfd object becomes zero. Any subsequent read()s blocks the reader until some writer comes along and write()s some value in it. If the value being written/read is 1, it is a mutex, otherwise it is a semaphore.
  • One advantage of using this as a mutex/semaphore is that there is no explicit need for locking.
  • pipe() is a bit confusing with two file descriptors – one process has to close one end and the other process has to close other end etc. eventfd is pretty simple.
  • Though pipe() can be used to transfer data between processes, but eventfd() cannot be used for that. In a sense, pipe() should be left for piping data and eventfd should be kept for signalling purposes. Separation of functions.
0

ARP

RFC 826 has some tidbits:

  • When an ARP request is received, it does the following (important) things:
    • If the sender’s <IP, MAC> is already present, update. This happens irrespective of whether the ARP request is for it or not.
    • If the sender’s pair is not present in the cache, then add it to the cache only if the ARP request is intended for self. This requirement is for hosts, but not for routers/switches, because it makes sense for the host to cache only those entries that it is talking to. One more subtle thing to note here is that the sender’s information is populated first and then the opcode is looked at. This, I believe, is an implementation convenience. To expand more, the information in the ARP reply comes in the sender fields (naturally). So, the first thing to do would be to populate the sender’s information into the cache and then check if it is a reply or a request. So, this way, we will learn the other host’s entry as well.
  • When an ARP request is sent out, an incomplete entry is kept in the ARP table (to be filled in later). So, when a response comes back, it is updated even before checking “am I the target protocol address?”. So, target protocol address is not really needed in the ARP reply, but kept just in case.
  • Typically when a host moves, its ARP table is cleared, but others’ are not cleared.
  • Perhaps failure to initiate a connection should inform the Address Resolution module to delete the information on the basis that the host is not reachable, possibly because it is down or the old translation is no longer valid.
  • The suggested algorithm for receiving address resolution packets tries to lessen the time it takes for recovery if a host does move – because we update the entry at every ARP broadcast request packet – if we have that entry
  • ARP refresh is generally unicast to the target so as to not disturb other hosts in the network

RFC 5227 adds some interesting things over RFC 826:

  • When we are doing an ARP probe (broadcast ARP request) to see whether an IP address is owned by somebody or not, we have to keep the sender’s IP address as zero because if the IP address is already owned by somebody else, then our probe will poison the caches of other hosts.
  • An ARP announcement (broadcast ARP request) i ssimilar to ARP probe, but the sender and the target IP address is the same. (GARP)
  • An ARP broadcast request is an “assertion” and a “question” – the sender fields assert its information, the target fields question informaiton.
  • It is probably impossible to ask any truly pure question; asking any question necessarily invites    speculation about why the interrogator wants to know the answer.
  • In some applications of IPv4 Address Conflict Detection (ACD), it may be advantageous to deliver ARP  Replies using broadcast instead of unicast because this allows address conflicts to be detected sooner    than might otherwise happen – the tradeoff with increased broadcast traffic is worth having compared to increased reliability and fault tolerance.
  • RFC 826 implies that replies to ARP Requests are usually delivered using unicast, but it is also acceptable to deliver ARP Replies using broadcast.
  • The RFC has a nice definition of two hosts residing on the same link – when a packet is sent from one host to another, it arrives unmodified (in its entirety) to the destination and a broadcast packet reaches all the nodes in the link. Actually link layer header may be modified (in non-Ethernet [token ring] cases), but IP header and IP payload should not be modified. That is a more precise definition
  • Probing an address takes place during many events – not just IP address configuraiton – link toggle, link connected etc.
  • For ARP Probe – this initial random delay helps ensure that a large number of hosts powered on at the  same time do not all send their initial probe packets simultaneously.
  • While we are doing a probe:
    • If an ARP request/reply comes with “sender’ address same as the probe address, ACD (Address Conflict Detection) cries foul
    • If an ARP probe comes for the same address but with different “sender” hardware address, then ACD cries foul. This is the reason why sender hardware address “must” be populated. If, because of some loop, the ARP probe packet comes back to us, then this will save us from crying foul.
  • ARP probe rate limit can kick in if conflict detection crosses a threshold
  • ARP Announcement is same as ARP Probe but with sender and target fields having the same IP address. The Opcde is NOT a reply, but a request
  • When a host wants to defend an address (and other host also wants to do the same), then for very ARP announcement made, if we send another announcement in return, then there will be an endless loop.
  • Before abandoning an address due to a conflict, hosts SHOULD actively attempt to reset any existing connections using that address.
  • ARP replies as broadcast is desirable when we want to quickly resolve conflicts – like when two hosts are partitioned and initially assume that both have unique IP address. This is mandatory in ACD of dynamically configured IPv4 link local address.
  • Broadcast ARP Replies SHOULD NOT be used universally. Broadcast ARP Replies should be used where the benefit of faster conflict detection outweighs the cost of increased broadcast traffic and increased  packet processing load on the participant network hosts.
  • The “principle of least surprise” dictates that where there are two or more ways to solve a networking problem that are otherwise equally good, the one with the fewest unusual properties is the one likely to have the fewest interoperability problems with existing implementation – because of which ARP announcements are “requests” and not “replies”. In summary, there are more ways that an incorrect ARP implementation    might plausibly reject an ARP Reply (which usually occurs as a result of being solicited by the client) than an ARP Request (which is already expected to occur unsolicited)
  • What Stevens describes as Gratuitous ARP is the exact same packet that this document refers to by the more descriptive term ‘ARP Announcement’
  • GARP is not active – it is just passive. ARP Announcement is in fact, active.