Loops Per Jiffy to mdelay, udelay and ndelay

What do we do with LPJ? You can use that value and create delays in execution without sleeping (or blocking). But, what if we want to sleep for lesser durations? Like microseconds or nanoseconds? Then we need to do some arithmetic.

loops_per_jiffy = x;
loops_per_sec = x * HZ (there are HZ jiffies in a sec)
loops_per_usec = x * HZ * (1/1000000)
loops_per_n_usec = n * loops_per_usec

The basic problem with this is – loops_per_usec falls below zero and only integer operations are fast and it becomes zero.  Lets take an example:

loops_per_jiffy = 3500
HZ = 1000
loops_per_usec = 3500 * 1000 * (1/1000000) = 3.5

There is one problem with the above calculation. Linux does not use floating point arithmetic. Because floating point arithmetic is slow. If we use integer arithmetic – then we will have loops_per_usec as zero because of the (1/1000000) component.

So, how to make it non-zero? The path taken by kernel is as follows: it multiplies loops_per_usec with 2^32.  Because of that, whole product moves left by 32 bit and then you take the most significant 32 bits.

The asm-ppc has this in its code:

/*
 * Note that 19 * 226 == 4294 ==~ 2^32 / 10^6, so
 * loops = (4294 * usecs * loops_per_jiffy * HZ) / 2^32.
 *
 * The mulhwu instruction gives us loops = (a * b) / 2^32.
 * We choose a = usecs * 19 * HZ and b = loops_per_jiffy * 226
 * because this lets us support a wide range of HZ and
 * loops_per_jiffy values without either a or b overflowing 2^32.
 * Thus we need usecs * HZ <= (2^32 - 1) / 19 = 226050910 and
 * loops_per_jiffy <= (2^32 - 1) / 226 = 19004280
 * (which corresponds to ~3800 bogomips at HZ = 100).
 *  -- paulus
 */
#define __MAX_UDELAY    (226050910UL/HZ)    /* maximum udelay argument */

extern __inline__ void __udelay(unsigned int x)
{
    unsigned int loops;

    __asm__("mulhwu %0,%1,%2" : "=r" (loops) :
        "r" (x), "r" (loops_per_jiffy * 226));
    __delay(loops);
}

#define udelay(n) (__builtin_constant_p(n)? \
    ((n) > __MAX_UDELAY? __bad_udelay(): __udelay((n) * (19 * HZ))) : \
    __udelay((n) * (19 * HZ)))

PPC’s mulhw instruction takes two numbers, multiplies them and divides them by 2^32. So, the code takes all the numbers and segregates them into a and b such that a and b does not overflow for the maximum possible value of udelay. With these constraints, __MAX_UDELAY is defined. So, the __udelay() function then taken the number of usecs, makes sanity checks and then passes on “a” as the parameter to the function. The function already has “b” with it, takes a and uses mulhw to get us the loops which is then passed to __delay function to produce so much delay.

And for the ndelay calculation, the code looks like this:

#define __MAX_NDELAY    (4294967295UL/HZ)   /* maximum ndelay argument */

extern __inline__ void __ndelay(unsigned int x)
{
    unsigned int loops;

    __asm__("mulhwu %0,%1,%2" : "=r" (loops) :
        "r" (x), "r" (loops_per_jiffy * 5));
    __delay(loops);
}

#define ndelay(n) (__builtin_constant_p(n)? \
    ((n) > __MAX_NDELAY? __bad_ndelay(): __ndelay((n) * HZ)) : \
    __ndelay((n) * HZ))

The only thing that needs to be observed here is that – since 2^32/10^6 is 4294, 2^32/10^9 is 4.2. This is the only change that occurs. a and b are again separated so that max ndelay can be achieved. loops_per_jiffy is now multiplied with 5 (ceiling of 4.294).

Calibrating Loops Per Jiffy in Linux Kernel

I was going through the kernel code of calibrating the delay loop and was just amazed at the beauty of it. Thought, would share my joy here. Here is the source code for the warm up before calculation:

        while ((loops_per_jiffy <<= 1) != 0) {
            /* wait for "start of" clock tick */
            ticks = jiffies;
            while (ticks == jiffies)
                /* nothing */;
            /* Go .. */
            ticks = jiffies;
            __delay(loops_per_jiffy);
            ticks = jiffies - ticks;
            if (ticks)
                break;
        }

The above loop tries to set the most possible significant bit in loops_per_jiffy variable such that the resultant number (loops_per_jiffy) when passed to __delay function, will cause the __delay to take more than one jiffy to execute. In other words, starting from LSB, it moves the bit to the left (in effect doubling loops_per_jiffy) in every iteration such until this number is big enough to cause __delay function to take more than one jiffy. The void while loop inside is basically to let go of the current jiffy – that is, to start at the beginning of the next jiffy time window, by letting go of the current jiffy window (we don’t know where in the current jiffy window we are so if we start __delay now, then our loops_per_jiffy calculation is skewed).

Lets take a small notation here: lpj_x is loops_per_jiffy variable with only bit x set. And LPJ be the closest value of the variable that can cause the __delay function to lapse exactly jiffy unit of time.

When we come out of loop, the following assertion is true: lpj_x causes __delay to cross a jiffy and lpj_(x-1) does not. Does this mean that lpj_x is the exact delay that can cause a jiffy to pass? No. That exact LPJ is somewhere between lpj_x and lpj_(x-1). So, starting from here we have to find LPJ further.

And here in comes another piece of beauty.

        loops_per_jiffy >>= 1;
        loopbit = loops_per_jiffy;
        while (lps_precision-- && (loopbit >>= 1)) {
            loops_per_jiffy |= loopbit;
            ticks = jiffies;
            while (ticks == jiffies)
                /* nothing */;
            ticks = jiffies;
            __delay(loops_per_jiffy);
            if (jiffies != ticks)   /* longer than 1 tick */
                loops_per_jiffy &= ~loopbit;
        }

Since lpj_x is a loose upper bound on LPJ, we proceed to calculate a tighter upper bound. The above loop does it. To understand this, consider x to be 3. Then lpj_3 8 – it is 1000 in binary. And lpj_2 4 is 100. The more precise LPJ is between lpj_3 and lpj_2. We would start from 4 and proceed to 8. We start from lpj_2 and set further lsbs and see it comes close. That is, we start with 4, make it 6 and then 7 (100, 110 and 111). Although we miss out on some numbers, this is apparently considered to be good enough an approximation.

That is what the loop does exactly. As usual, the inner void while loop is to align ourselves to the beginning of a jiffy window.

At the end, we will have lpj close to LPJ.

The only other magic sauce here seems to be __delay. It is just an instruction that is used to some CPU intensive (or CPU only) operation for so many iterations. We use that as a tool to tune loops_per_jiffy.

Converting rm to mp3

is child’s play with Linux and mplayer.

  • mplayer -quiet -vo null -ao pcm:waveheader:file=”abc.wav” abc.rm
  • lame abc.wav abc.mp3

Follow

Get every new post delivered to your Inbox.