The kernel can't know if you want those low-priority processes to use all the CPU power on the system, or if you want them to pile up on one CPU and save power on the rest. Developers debate the best way to set the system's power rules.
The schedmcpower_savings parameter (cleverly hidden under /sys/devices/system/cpu) was introduced in the 2.6.18 kernel. If this parameter is set to one (the default is zero), it changes the scheduler load balancing code in an interesting way: it makes an ongoing effort to gather together processes on the smallest number of CPUs. If the system is not heavily loaded, this policy will result in some processors being entirely idle; those processors can then be put into a deep sleep and left there for some time. And that, of course, results in lower power consumption, which is a good thing.
Vaidyanathan Srinivasan recently noted that, while this policy works well in a number of situations, there are others where things could be better. The schedmcpower_savings policy is relatively conservative in how it loads processes onto CPUs, taking care to not overload those CPUs and create excessive latency for applications. As a result, the workload on a large system can still end up spread out more widely than might be optimal, especially if the workload is bursty. In response, Vaidyanathan suggests making the power savings policy more flexible, with the system administrator being able to select a combination of power savings and latency which works well for the workload. On systems where power savings matters a lot, a more aggressive mode (which would pack processes more tightly into CPUs) could be chosen.
This suggestion was controversial. Nobody disputes the idea that smarter power savings policy would be a good idea. But there is resistance to the idea of creating more tuning knobs to control this policy; instead, it is felt, the kernel should work out the optimal policy on its own. As Andi Kleen puts it:
Tunables are basically "we give up, let's push the problem to the user" which is not nice. I suspect a lot of users won't even know if their workloads are bursty or not. Or they might have workloads which are both bursty and not bursty.
There are a couple of answers to that objection. One is that the system cannot know, on its own, what priorities the users and/or administrators have. Those priorities could even change over time, with performance being emphasized during peak times and low power usage otherwise. Additionally, not all users see "performance" the same way; some want responsiveness and low latency, while others place a higher priority on throughput. If the system cannot simultaneously optimize all of those parameters, it will need guidance from somewhere to choose the best policy.
And that's where the other answer comes in: that guidance could come from user space. Special-purpose software running on large installations can monitor the performance of important applications and adjust resources (and policies) to get the desired results. Or, in a somewhat different vision, individual applications could register their performance needs and expected behavior. In this case, the kernel is charged with somehow mediating between applications with different expectations and coming up with a reasonable set of policies.