Posted in software

Custom ASICs or General Purpose ASICs For Networking?

If you read some papers such as this, you will see that it is a bit hard to beat custom ASICs for packet routing at line rate when compared to general purpose x86 ASICs. So, how do we make the networking world more open like the server world, given this constraint? If we see the x86 world, what is standardized is an ISA (Instruction Set Architecture) on top of which layers upon layers of multi-source stacks have been built.

Is it possible to standardize such an ISA for custom ASICs? Even though they are custom, all of the ASICs (whether they are closed ASICs from Cisco, Brocade or commodity ASICs from Broadcom, Cavium etc.) do one specific thing – network related processing. Given this, is it not possible to standardize the ISA? May be it is not so straight forward to achieve this, so people ended up with things such as openflow to standardize the interface to the hardware layer.

Anyways, the article linked above (by James Hamilton) gives a very good contrast of what happened in the server world and where we are in the networking world and how it is moving). As Cumulus guys said, history does not repeat, but does rhyme! 🙂

Posted in software

select/poll/epoll/kqueue

This article gives a very nice summary of what are the pros and cons of different I/O multiplexing mechanisms provided by POSIX operating systems (like Linux/FreeBSD)

We will assume here, select is level 0 to start with (and unfortunately, it is still being used by tons of legacy code). The basic problems of select are:

  • A fixed size bitmap with which you can specify the fds
  • Stateless – every time one has to set all the interested fds, as once select returns, only active (result) fds will be set
  • Scan through all the fds to see which are set (both by the process when select returns, as well as by the kernel when select is called) – doing unnecessary work

which poll tries to solve to some extent are:

  • A fixed size bitmap is replaced by a variable array
  • Two separate fields are provided to track interest and the result – i.e. interest is given through one field and result is returned in another field – the only advantage here is that we don’t need to re-initialize all the interest set every time we call poll
  • Kernel does not need to scan through all the fds as only the interested fds are passed on to it (no bitmap, just an array of interested fds). However, when the select returns, the process will still have to scan through all the fds to find which is active

However

  • poll is still stateless – interest fds are to be copied everytime – from user space to kernel space and vice versa

Then came along epoll:

  • It made a transition to stateful event tracking – programs register an fd once and it will be tracked until program wishes to remove it.
  • So, everytime there is no need to send a list of interested events to the kernel and neither the kernel needs to scan the list of interested events everytime, it is already maintained.
  • A separate function blocks waiting for any event to happen and when it returns, the process scans only through the result events, but not through all events (like poll does)

Then came along kqueue:

  • kqueue is even more generic framework where not just sockets, but timers, signals, regular files, processes etc. can be kept and tracked for events.
  • The framework is such that it is easily extensible to any future events that may get implemented.
  • In other words, kqueue is one stop call for tracking all kinds of events. Just that it is not available in Linux yet, only in FreeBSD. 🙂

 

Posted in software

The MLAG Components

When we want to connect two switches and make it an mlag, the following things should be kept in mind:

  • Both the switches should be advertising the same system id to the partner
  • North bound or south bound BUM traffic should not be sent twice to the LACP partner (some rules possible in Linux to drop the traffic?)
  • MAC table should be synchronized between the two switches
  • MAC Learning should be such that no bouncing of MAC addresses coming from the LACP partner happens
  • Neighbor table should be in sync between the two switches (since it is not really L3, but somewhat L2 too)
  • Since we use LACP, the connection between two switches and the partner must be direct

The links to the switches can be many, asymmetrical and when two pairs of switches form a MLAG – it will be a fully meshed network (all connecting to all) with each pair having the same system MAC address (when a host is there instead of switches, this same system MAC address is taken care of by the bonding driver)

Nice summary here. Apparently, cumulus linux uses “clagd” to periodically synchronize the MAC database to the peer switch. “clagd”, among other things sets up the peering relationship, determines the primary/secondary roles for sending BPDUs etc. There is also a presention by Cumulus on MLAGs which is a simple but nice read

Posted in software

On Maturity of an Industry..

here:

..bare-metal networking is more than being affordable; it’s about giving customers degrees of freedom, transparency, and choice that they deserve in a mature industry.

In the dark ages of computing (aka 1983), a customer running IBM DB2 had to buy an IBM mainframe (complete with cables, disks, power distribution, memory, and IO) to go with the application. The compute industry has matured to a point where DB2 runs on hardware ranging from mainframes through p-series down into non-IBM x86 platforms hosted on operating systems including z/OS, Unix, Linux, and Windows. Application independent from OS independent from hardware; degrees of freedom and choice.

We view the natural state of the networking supply chain to be one by which customers are able to purchase their networking hardware as “close to the source” as they’d like, starting at original manufacturers all the way through well known Enterprise IT providers. Regardless of the hardware source, customers are able to deploy the networking software of their choice (CumulusLinux for instance :-). There are zero technical barriers to this model, and we do it around compute every day. This allows customers to make a safe capital investment in infrastructure without being locked into any one vendor; suppliers get to earn their spot every cycle.

The attributes of transparency, choice, and degrees of freedom, not price, are driving all of the mega-scale customers to bare-metal networking solutions, whether they do it in house or leverage companies like Cumulus Networks. Make no mistake, ALL mega-scales have revolted and are somewhere on the path towards independence.

And finally:

Mass availability and access to best-of-breed software and a value chain of features on top of a bare metal switch, as opposed to a proprietary stack pre-integrated custom ASIC solution (which you are stuck with) is NOT about changing the game. It’s what you need to do to STAY in the game. It’s time to scale up, bring the network out of the dark ages, and welcome the unbundled OS.

Posted in software

The Modern Data Center

From Cumulus blog:

The modern data center is a collection of compute elements connected via a network. What used to be a thread is now a process running inside a virtual machine or a physical machine, and storage lives locally in a server or on a SAN. So what is needed is simply an ability to consistently and coherently connect the compute elements to each other and to storage irrespective of where they live and how they are physically connected. And voila, you get fast application delivery!

To achieve this goal, you just need two layers:

  1. A physical connectivity layer that uses standard mechanisms (let’s say IP/ethernet) and allows all end ports to connect to each other with massive bandwidth, multiple paths and QoS.

  2. A controller that connects subset of ports to each other exclusively and securely using logical ports layered on top of the physical connectivity layer.

    Cumulus Linux is the first Linux Distribution for networking gear and it’s finally bringing the economics of Linux to networking. It takes an industry standard networking box and makes it look like a server with a large number of 1G, 10G, or 40G NICs connected at wire-rate. The operating system accelerates the datapath in networking hardware using the existing Linux kernel constructs. So, Cumulus Linux IS Linux and customers can use the best automation, monitoring and orchestration tools onto their networking gear.

Posted in software

What Cumulus Linux is About

From their blog:

Cumulus Linux is a Linux distribution which has support for networking co-processors. It is not simply a networking device which uses Linux as a base OS. Linux is the OS. All of the device’s interfaces are standard Linux interfaces. So when you type “ifconfig”, you’ll see all of the interfaces, just like on a Linux server. Want to bring a link up? You can use “ip link set”. “brctl” sets up and configures bridges. Standard open-source routing protocol suites, like Quagga, can be used.

This also means that you can do things like you would on any other Linux device, like mount remote file systems, or install and use standard monitoring, administration, and reporting tools. Puppet, collectd, Nagios, bwm-ng, and most other Linux-based tools can be easily downloaded and installed using apt-get. You could even run custom scripts written in Python, or Perl, or C and, for example, put them in a crontab to periodically run. And since the network co-processor is handling all of the datapath forwarding functions, the switch operates at wirespeed.

This is Cumulus Linux.

Posted in software

On How to Build a Product

From Cumulus Networks blog:

Capacity should be fast, easy, and affordable. That’s the mantra that’s driven me for the last two decades. I’ve had the privilege of working on quite a few extremely popular products in my career and that principle has been the foundation I apply to each challenge. In each case, we’ve looked at a problem, noted the state of the art, studied the history, and understood cruft around current solutions. Then we carefully remove obsolete concepts, leverage evolved technologies, and innovate in meaningful ways to make capacity fast, easy, and affordable.