MigratoryData has achieved 12 million concurrent connections, that to with a JVM!! Time to rethink about JVM. Some Linux parameters that are tuned are:
- The maximum number of open files the system can handle (by tuning /proc/sys/fs/nr_open)
- The maximum number of open files a process can have (by tuning with “ulimit -n”)
- Increase the number of ephemeral ports on a box (the default being from 32768 to 61000) from “500 to 65535”, with “sysctl -w net.ipv4.ip_local_port_range”. On the server side, it is the same listening address and the listening port that is used to pair with each client end point address.
- Once a connection is established, the quadruple that is used to uniquely identify a connection is <serverIP, serverListeningPort, clientIp, clientEphemeralPort>
- We dont need so many port numbers on the server side, we need only so many open sockets
- Reuse TCP’s TIME_WAIT sockets with “/proc/sys/net/ipv4/tcp_tw_recycle”
- Amount of memory allocated to tcp/udp with “sysctl -w net.ipv4.[tcp|udp]_mem”
- Balance the interrupt processing by changing the smp affinity of each tx/rx queue (now is the time to read more about the Intel 10G ethernet adapter) to a different core (writing the bit to /proc/irq/<interruptNum>/smp_affinity
One thing to note is that with all this connections, CPU was not the bottleneck, but memory was. In fact, CPU was still at around 47% when the server reached 12M connections, but memory reached its brim.
Besides, some JVM tuning is also done, which is out of scope of my mind. :-). All in all, they broke the C10M benchmark with a stock kernel properly tuned!! Which is a magnificient achievement.
If you read some papers such as this, you will see that it is a bit hard to beat custom ASICs for packet routing at line rate when compared to general purpose x86 ASICs. So, how do we make the networking world more open like the server world, given this constraint? If we see the x86 world, what is standardized is an ISA (Instruction Set Architecture) on top of which layers upon layers of multi-source stacks have been built.
Is it possible to standardize such an ISA for custom ASICs? Even though they are custom, all of the ASICs (whether they are closed ASICs from Cisco, Brocade or commodity ASICs from Broadcom, Cavium etc.) do one specific thing – network related processing. Given this, is it not possible to standardize the ISA? May be it is not so straight forward to achieve this, so people ended up with things such as openflow to standardize the interface to the hardware layer.
Anyways, the article linked above (by James Hamilton) gives a very good contrast of what happened in the server world and where we are in the networking world and how it is moving). As Cumulus guys said, history does not repeat, but does rhyme! 🙂
This article gives a very nice summary of what are the pros and cons of different I/O multiplexing mechanisms provided by POSIX operating systems (like Linux/FreeBSD)
We will assume here, select is level 0 to start with (and unfortunately, it is still being used by tons of legacy code). The basic problems of select are:
- A fixed size bitmap with which you can specify the fds
- Stateless – every time one has to set all the interested fds, as once select returns, only active (result) fds will be set
- Scan through all the fds to see which are set (both by the process when select returns, as well as by the kernel when select is called) – doing unnecessary work
which poll tries to solve to some extent are:
- A fixed size bitmap is replaced by a variable array
- Two separate fields are provided to track interest and the result – i.e. interest is given through one field and result is returned in another field – the only advantage here is that we don’t need to re-initialize all the interest set every time we call poll
- Kernel does not need to scan through all the fds as only the interested fds are passed on to it (no bitmap, just an array of interested fds). However, when the select returns, the process will still have to scan through all the fds to find which is active
- poll is still stateless – interest fds are to be copied everytime – from user space to kernel space and vice versa
Then came along epoll:
- It made a transition to stateful event tracking – programs register an fd once and it will be tracked until program wishes to remove it.
- So, everytime there is no need to send a list of interested events to the kernel and neither the kernel needs to scan the list of interested events everytime, it is already maintained.
- A separate function blocks waiting for any event to happen and when it returns, the process scans only through the result events, but not through all events (like poll does)
Then came along kqueue:
- kqueue is even more generic framework where not just sockets, but timers, signals, regular files, processes etc. can be kept and tracked for events.
- The framework is such that it is easily extensible to any future events that may get implemented.
- In other words, kqueue is one stop call for tracking all kinds of events. Just that it is not available in Linux yet, only in FreeBSD. 🙂
When we want to connect two switches and make it an mlag, the following things should be kept in mind:
- Both the switches should be advertising the same system id to the partner
- North bound or south bound BUM traffic should not be sent twice to the LACP partner (some rules possible in Linux to drop the traffic?)
- MAC table should be synchronized between the two switches
- MAC Learning should be such that no bouncing of MAC addresses coming from the LACP partner happens
- Neighbor table should be in sync between the two switches (since it is not really L3, but somewhat L2 too)
- Since we use LACP, the connection between two switches and the partner must be direct
The links to the switches can be many, asymmetrical and when two pairs of switches form a MLAG – it will be a fully meshed network (all connecting to all) with each pair having the same system MAC address (when a host is there instead of switches, this same system MAC address is taken care of by the bonding driver)
Nice summary here. Apparently, cumulus linux uses “clagd” to periodically synchronize the MAC database to the peer switch. “clagd”, among other things sets up the peering relationship, determines the primary/secondary roles for sending BPDUs etc. There is also a presention by Cumulus on MLAGs which is a simple but nice read
Given that offically it is not yet available on Ubuntu, this article gives some hints on how to install it.
If you want to understand why vmotion has a problem across an IP fabric and what to do about it, look nowhere else