The benchmark for “software” isn’t what you get out of a Linux (or other OS) network stack – that’s 100 times slower than the theoretical speed of the hardware. Instead, the benchmark is the same hardware running different software, such as the open-source PF_RING driver or Intel’s DPDK
Some excerpts from the original post:
- Computers now have 1000 times the memory as 15 years ago when the first started handling 10 thousand connections.
- The limits of scalability aren’t our hardware, but the software, specifically, the operating system.
- The first problem is that there is no “fast path” for data. It’s a conflict between a “multi-user” design and a “single-purpose” design.
- The second problem is multi-core scalability. Old code is written according to “multi-threaded” or “multi-tasking” principles, where multiple tasks must share a single CPU. New code must be written using different principles, where a single task must be split across multiple CPUs.
- The third problem is memory. In the last 15 years, the size of main memory has gone up a thousand times, where it’s now practical to get 128-gigabytes in a cheap desktop computer. Yet, L1 and L2 caches have remained the same size, with L3 (or last level) caches only going up 10 times. This means at scale, every pointer is a cache miss. To fix this, code must pay attention to the issues of caching and paging.
- The solution to all these problems is to partition the machine into two parts: the “control” part and the “data” part. The data-partition grabs the network cards it wants and replaces the drivers with ones that bypass the kernel. The data-partition grabs all the CPU cores it wants and removes them from the operating system scheduler, so no other tasks will run on them. The data-partition grabs the bulk of the RAM and manages it itself.
- A decade ago, engineers tackled what they called “the c10k problem” of making servers handle 10 thousand simultaneous connections. The conclusion was to fix the kernels, and write applications in an asynchronous manner. Today with C10M, we start with asynchronous code, but instead of better kernels, we move our code completely out of the kernel.
The limits of scalability aren’t our hardware, but the software, specifically, the operating system.
Can you believe that moving out of kernel can actually result in increase in the performance of network processing?
It’s still hard to believe, but Robert Graham is making this point. In his own words:
The way network stacks work today is to let the kernel do all the heavy lifting. It starts with kernel drivers for Ethernet cards, which passes packets to the kernel’s TCP/IP stack. Upon the reception, the packet must make an arduous climb up the network stack until it finally escapes to user-mode
Instead of moving everything into the kernel we can move everything into user-mode. This is done first by rewriting the network driver. Instead of a network driver that hands off packets to the kernel, you change the driver does that it doesn’t. Instead, you map the packet buffers into user-mode space
In recent benchmark, Intel has demonstrated a system using an 8-core 2.0-GHz 1-socket server forwarding packets at a rate of 80-million packets/second. That means receiving the packet, processing it in user-mode (outside the kernel) and retransmission. That works out to 200 clock cycles per packet
For example, I was having a discussion about DNS servers on Twitter. I was throwing around the number of “10 million DNS requests per second”. The other person said that this was impossible, because you’d hit the packets-per-second performance limit of the system. As the Intel benchmarks show, this is actually 12% the packet limit of the system
The general concept we are working toward is the difference between the “control plane” and the “data plane”. A 100 years ago, telephones were just copper wires. You need a switch board operator to connect your copper wire to your destination copper wire. In the digital revolution in the late 1960s and early 1970s, wires became streams of bits, and switchboards became computers. AT&T designed Unix to control how data was transmitted, but not to handle data transmission itself. Thus, operating system kernels are designed to carry the slow rate of control information, not the fast rate of data. That’s why you get a 100 to 1 performance difference between custom drivers and kernel drivers.
For example, back in the year 2000 at DefCon, I brought up the fact that my intrusion detection system (IDS) running on Windows on my tiny 11-inch notebook could handle a full 148,800 packets/second. Back then kernel drivers caused an interrupt for every packet, and hardware was limited to about 15,000 interrupts-per-second. People had a hard enough time accepting the 10x performance increase, that a tiny notebook could outperform the fastest big-iron server hardware, and that Windows could outperform their favorite operating system like Solaris or Linux. They couldn’t grasp the idea of simply turning off interrupts, and that if you bypass the operating system, it doesn’t matter whether you are running Windows or Linux. These days with PF_RING having the similar architecture to the BlackICE drivers, this is more understandable, but back then, it was unbelievable.
Robert Graham gives a very interesting perspective to performance and scalability.
Performance is the time taken to get a job done. For example, for a webserver it can be number of requests per second. Because if the number of requests per second is more, then the time taken to get a job done is obviously less.
Scalability is the ability to handle so many jobs with out affecting the performance that is measured. For example, if the measured performance is let’s say one second per job (request), then if the system can handle 1000 jobs at one second per job, and adding more jobs to the system increases the time taken per job (for whatever reasons), then we can say that the system is scalable upto 1000 jobs. Because, the measured performance drops down after 1000 jobs, we can say that the system is not scalable beyond 1000 jobs at a time (assuming that performance cannot be compromised with)
The average time taken to get a job done may come down for many reasons when more jobs are added. For example, the time a job spends in the queue to be processed is one of the main factors that affect the response time or the throughput of the system. What can be done to improve the scalability? We can add more cores/more threads etc.
For example, Apache runs one thread per connection. So, having 10000 connections active with the webserver may not be feasible because the system may not allow Apache to create so many threads. For Apache to handle 10000 simultaneous connections, it has to revamp its architecture.
Generally when people talk about scalability, their intention is to keep the performance fixed at a desired level and still able to handle more load.
If a program is multi-threaded, it does not mean that it is suitable for multi-core. Multi-threaded programs are (were) written to synchronize between themselves. And to synchronize between themselves, they use various locking mechanisms such as mutexes, spin locks (with in kernel), semaphores etc. For most user level programs (for example, those that use pthreads), whenever a thread wants to synchronize, it would try to acquire a lock. If a lock is already held, then the acquiring thread would go to sleep in a queue. Sleeping means kernel taking over control, doing a context switch, running a scheduling algorithm to find out the best candidate to run and running it.
In the case of a single CPU or a CPU with a single core, this is necessary because without this, another thread cannot run and release the lock. But in case of multi core CPU, the thread may be running on another core, while the thread on the current core is being put to sleep. It may get ready sooner because the other thread might have released the lock. This way, on a multi-core CPU, context switching poses lot of overhead that the scalability is not linear, but sub-linear. That is, the kernel comes into way too much for multi-threaded programs with syncrhonization that the performance begins to hurt after adding some cores.
Robert Graham at ErrataSec has a very nice article on multi-threaded vs multi-core programs. He mentions that the performace of a system peaks at 4 cores and begins to decline after that.
And the comments section mentioned about “Erlang” which works very well for the multi-core platforms. Should take a look at it once.
Some very important points to note from that post:
Unable to increase clock speeds, chip companies have been adding extra logic to increase the throughput – like multiple instructions per clock cycle, multiple cores etc.
Multi-threaded programs are not multi-core ready.
For multi-core, we need more independent execution, rather than mutually synchronizing models.
Avoid sharing as much as possible.
Do a “Divide and Conquer” w.r.t sharing – that is maintain own states, and merge them/join them when needed to get the required state.
Use lock free versions of data structures. Use stuff such as RCU.
Two basic models in multi-core: pipelining and worker thread models. Former is like a assembly line, while the latter is like a bunch of robots doing everything from start to end. When there is something to share, put a pipeline stage there. When work can be done independently, worker thread model is desirable.