HTTP and Page Load Times

Some salient points to be noted from this article on improving the page load times:

  • Having KeepAlives on has two advantages: the extra time for TCP three way handshake is not necessary and slow start wont happen again – which means the current bandwidth window is used to get the data from the server.
  • With ADSL connections (typical downstream to upstream ratioes being 5:1), if the request sizes are more, the upload bandwidth can become a bottleneck which means the page load time will be more.
  • If we have pipelining enabled, the latency part of the pipe between the client and server can be reasonably hidden.
  • By having more than one connection between the client and server, contents can be parallely downloaded.
  • Use of AJAX will reduce the request/response sizes and hence things will download faster.
  • The disadvantage of a KeepAlive connection is at the server side – the connection resources (which are limited) are held up there.
  • KeepAlive is a HTTP layer concept – so the server will maintain a timer to kill the connections once the KeepAlive expires.
  • Serving static content from a separate server meant for it will take the load off the dynamic content server.
  • There is a technique called CSS Sprites that can be used to combine multiple small images into a single file so that all the images can be downloaded in one request thus taking latency out of the picture to some extent.
  • When different hostnames are used (even if the IP address backing them is the same), browsers have a tendancy to open separate connections for each hostname. So, addressing different resources with different hostnames will also increase the page load time – the average latency is reduced by as many hostnames kept in the webpage.
  • Preferably, the image loaded from a specific hostname is better loaded again and again from the same hostname because the contents may be cached.
  • Content can not only be cached at the server side, but can also be cached at the browser side.
  • Apparently the “Expires” header can be used to say how long the thing can be cached, but for what all objects it can be used is something that needs to be figured out.
  • In general, a “?” found in the URL will make the cachers not cache it.
  • Setting another domain to serve static content will also make the headers small because cookies and other such stuff need not be sent along to this domain.
  • Conditional GETs are also there – a piece of object specific stuff exchanged between the browser and the server will make the server send a “(304) Not Modified” back so that the browser need not load the whole object again.
  • On the whole, the very desired feature – pipelining – is disabled by default on browsers (and Chrome is a bit worse – it does not support it at all) for reasons unknown.

SSD controllers and File Systems

ACM queue carries an interesting article on the effects of deduplication on file system reliability.

The main problem is that flash disk controller does deduplication to avoid writing the same block twice on the disk. This is beneficial because it saves space and also reduces the number of operations to the flash directly improving its life.

The glitch is that file systems which store redundant copies of superblock and other metadata blocks on the disk for reliability – in case one copy goes bad, the other copy can be read from. With flash controllers doing deduplication, only one physical copy is present on the disk. Which means if one copy goes bad, all logical copies go bad. Which is bad.

One possible solution to this problem would be to have something in the block that is different in the duplicate copies – this makes file system operations a bit slower, but still achievable, in my opinion.

So, hardware based deduplication has its own share of issues that needed to be tackled at the file system layer.

Browser Security from Google Chrome

ACM Queue has an interesting article on the security measures taken in the Google chrome web browser in an attempt to thwart the attempts to attack and exploit the weaknesses of a browser.

The article nicely summarizes the three main things to achieve the above goal:

  • Mitigating or nullifying the actions that are caused by vulnerabilities.
  • Push updates frequently.
  • Warn users about malicious sites with the help of a global database of malicious sites.

The first part consists of two things: try preventing the damage in the first place and if the damage happens, keep the damage isolated so that it won’t have any side effects. Measures can be taken to prevent malicious code execution with the help of OS/hardware/tools. Techniques such as:

  • Data Execution Prevention: Mark Nx [not executable] flag on pages that has heap/stack etc. so that when buffer overflow and other such flaws are exploited to crop code in stack or heap, execution of the same can be prevented. The process will just crash.
  • Stack overflow check: A small random value is placed in between the top of stack and the return value. While returning, that small value is checked for. If it is not present, then that is a case of a stack overflow. This feature is provided by the compiler. This is so simple a technique and I wonder why modern compilers do not have this feature by default.
  • Address Space Layout Randomization: This seems to be a new feature where the data/stack/heap sections start are different addresses unlike the current way of starting them at well known virtual addresses in the process address space. This makes identifying those sections difficult.
  • Heap Corruption Detection: This is not very cleanly achievable unless the virtual machine supports it as a native feature.

The main security vulnerabilities seem to crop up from the rendering engine where the javascript code is executed, page rendering is done etc. Chrome has that done inside a sandbox so nothing explodes out of it. This is another way to prevent vulnerabilities from showing side effects.

That completes the first part. The second part is about pushing patches painlessly to the clients. While it is still not possible to apply patches without rebooting the browser (what, huh! Linux has a way to apply kernel patches without rebooting the kernel), Google has still come a long way to make it simpler. The updates that are pushed are incredibly small – because of their smart diff tool Courgette. The net effect of this is that updates can be pushed faster as well as more updates can be pushed which means vulnerabilities are fixed more often and sooner.

The last part of the job is to inform user before hand about visiting a potential malicious site. This job is technically simple when compared to the above two jobs. Colloborate with a site (StopBadware.org) and keep an updated list of malicious sites. There is no need to push user URL to the website, the browser can download the list (or a homomorphic form of the list) and check whether the user is entering a malicious website. This is the simplest of the three jobs. Prevention is better than handling which is better than cure.

One thing worth noting is the extent of automated testing done by Chrome engineers to assure the quality of the product. In their own words:

The Google Chrome team has put significant effort into automating step 3 as much as possible. The team has inherited more than 10,000 tests from the WebKit project that ensure the Web platform features are working properly. These tests, along with thousands of other tests for browser-level features, are run after every change to the browser’s source code.

In addition to these regression tests, browser builds are tested on 1 million Web sites in a virtual-machine farm called ChromeBot. ChromeBot monitors the rendering of these sites for memory errors, crashes, and hangs. Running a browser build through ChromeBot often exposes subtle race conditions and other low-probability events before shipping the build to users.

All in all, professional act!

On the JPMC outage

The blogosphere is abuzz about JPMC outage (1, 2, 3). The basic reason people cite for long recovery time is a big, ambitious database design – to stuff in all the things (even lesser critical ones) into one database and take long time to recover the same.

The basic reason why the outage has occured in the first place is because of a software bug – Oracle has corrupted some files. Besides, this corruption has reached the mirror image too – because of which tape backup has to be brought in.

I was wondering, whether it would have done some good, if the standby mirror is a versioning volume/filesystem so that corruptions can be ridden of and old copy can be restored almost immediately. Is there any difficulty with that? I am sure this versioning can be taken care of without exposing any extra detail at the higher layers.

Scalability

From this website:

Scalability is the ability to keep solving a problem as the size of the problem increases.

Scale is measured relative to your requirements. As long as you can scale enough to solve your problem then you have scale. If you can handle the number of objects and events required for your application then you can scale. It doesn’t really matter what the numbers are.

Scaling often creates a difference in kind for potential solutions. The solution you need to handle a small problem is not the same as you need to handle a large problem. If you incrementally try to evolve one into the other you can be in for a rude surprise, because it won’t work as you pass through different points of discontinuity.

Scale is not language or framework specific. It is a matter of approach and design.

The Power of Negative Thinking

Huh what? Negative thinking has power? You might wonder. One Talin explains that negative thinking is very essential for engineers. He explains:

We know what happens when engineers only think positively:

  • The build bridges that fall down
  • They build trains that derail
  • They build space shuttles that blow up.

He goes on further:

In order to insure the integrity of their work, engineers must ruthlessly and relentlessly hunt down and eliminate their own errors. They must have a total commitment to the task of cleansing their design of even the smallest flaw. And they must resist the human temptation to take the short view, to say good enough too soon. Instead, they maintain an unreasonable persistence and patience, utilizing negative thinking, pessimism, and, perhaps, paranoia to assume there are still flaws remaining in the design even when there aren’t. Perhaps words such as paranoid and pessimistic aren’t correct, but they are words actual working engineers have used to describe themselves and their feelings towards their craft. What if the unthinkable really does happen? they say. What if all the backup systems fail?

He asks, “How do engineers develop these attitudes (that of paranoia and pessimism?). He says the entire mythology is with engineers:

One particularly strong mythic motif is exemplified by Murphyís Law, coined by a test engineer at Edward Air Force Base: If it can go wrong, it will go wrong. There is an entire panoply of corollary and associated laws, enough to fill a medium-sized book. The common thread that binds these laws together is the assumption there is something far more perverse than just the laws of random chance at work. It is a superstition, not only the source of much humor but also much utility.

A nice quote to be carried away from his work is:

Systems that work perfectly for the first time can make master engineers tremble with fear.

Follow

Get every new post delivered to your Inbox.