The AIM benchmark attempts to measure system throughput by running a large number of tasks (perhaps thousands of them), each of which is exercising some part of the kernel. Yanmin Zhang reported that his AIM results got about 40 per cent worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the problem; the guilty patch turned out to be the generic semaphores code. Reverting that patch made the performance regression go away - at the cost of restoring over 7,000 lines of old, unlamented code. The thought of bringing back the previous semaphore implementation was enough to inspire a few people to look more deeply at the problem.
It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few years ago. That part of the process was easy - there aren't a whole lot of other semaphores left in the kernel, especially in performance-critical places. But the BKL stubbornly remains in a number of core places, including the fcntl() system call, a number of ioctl() implementations, the TTY code, and open() for char devices. That's enough for a badly-performing BKL to create larger problems, especially when running VFS-heavy benchmarks with a lot of contention.
Ingo Molnar tracked down the problem in the new semaphore code. In short: the new semaphore code is too fair for its own good. When a semaphore is released, and there is another thread waiting for it, the semaphore is handed over to the new thread (which is then made runnable) at that time. This approach ensures that threads obtain the semaphore in something close to the order in which they asked for it.
The problem is that fairness can be expensive. The thread waiting for the semaphore may be on another processor, its cache could be cold, and it might be at a low enough priority that it will not even begin running for some time. Meanwhile, another thread may request the semaphore, but it will get put at the end of the queue behind the new owner, which may not be running yet. The result is a certain amount of dead time where no running thread holds the semaphore. And, in fact, Yanmin's experience with the AIM benchmark showed this: his system was running idle almost 50 per cent of the time.
The solution is to bring in a technique from the older semaphore code: lock stealing. If a thread tries to acquire a semaphore, and that semaphore is available, that thread gets it regardless of whether a different thread is patiently waiting in the queue. Or, in other words, the thread at the head of the queue only gets the semaphore once it starts running and actually claims it; if it's too slow, somebody else might get there first. In human interactions, this sort of behavior is considered impolite (in some cultures, at least), though it is far from unknown. In a multiprocessor computer, though, it makes the difference between acceptable and unacceptable performance - even a thread which gets its lock stolen will benefit in the long run.
Interestingly, the patch which implements this change was merged into the mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the revert was that the patch broke semaphores in other situations; for some usage patterns, the semaphore code could fail to wake a thread when the semaphore became available. This bug could certainly have been fixed, but it appears that things will not go that way - there is a bit more going on here.
What is happening instead is that Linus has committed a patch which simply turns the BKL into a spinlock. By shorting out the semaphore code entirely, this patch fixes the AIM regression while leaving the slow (but fair) semaphore code in place. This change also makes the BKL non-preemptible, which will not be entirely good news for those who are concerned with latency issues - especially the real time tree.