Linux load averages, for example from top and uptime commands, can be massively incorrect on the low side. WWW.Smythies.com
Note (2012.06.24): Tests results page for another proposed patch to improve Reported Load Averages.
Note (2012.05.28): The HTML (and more up to date) version of the below PNG
Note (2012.05.22): This revised PNG provides information about high reported load averages with the below referenced commit. It is best viewed at a zoom of 1:1 and scroll down as you look at the graphs and read the text.
Note (2012.05.09) The patch was included in the Ubuntu Precise Pangolin 12.04 LTS release.
This site has noticed an huge increase in search query traffic to this web page with search parameters such as "high load average on Ubuntu 12.04".
An issue has been raised against this patch indicating incorrect high reported load averages under conditions of extremely light load and high enter/exit idle frequency conditions (see the references section at the bottom of this web page).
This condition is most apparent with Ubuntu desktop edition (which I don't use, but I have verified these erroneous high reported load averages by modifying my test program).
Some tests were done with low load averages.This PNG file write up was sent to the firstname.lastname@example.org e-mail list and posted to the two ubuntu launchpad bug reports.
It is not clear to me if there is a solution within the current context of how load averages are calculated with tickless kernels.
Note (2012.03.30): This issue is fixed as of Commit-ID: c308b56b5398779cd3da0f62ab26b0453494c3d4
The patch (as described herein) was test backported to Ubuntu Kernels, one was: 3.2.0-20-generic #33~lp838811
The problem: Under conditions of several CPU intentsive processes per second the kernel load average numbers are completley wrong.
For example, for an 8 cpu computer, it is fairly easy to demonstrate an actual load average of about 7.9 showing as a load average of 0.0.
As long as each CPU has some minimal idle time at a frequency greater than 25 hertz (or 10 hertz for version 10.10), the reported load average will be 0.0 (Actually they don't even have to be different processes, they just have to have some idle time.)
The problem has been verified on Ubuntu server edition 10.10, 11.10, the development version of 12.04 and the 3.3RC5 kernel:
Linux doug-64 2.6.35-31-server #63-Ubuntu SMP Mon Nov 28 21:03:37 UTC 2011 x86_64 GNU/Linux
Linux test-smy 3.0.0-14-generic-pae #23-Ubuntu SMP Mon Nov 21 22:07:10 UTC 2011 i686 i686 i386 GNU/Linux
Linux s15 3.0.0-15-server #26-Ubuntu SMP Fri Jan 20 19:07:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Linux test-smy 3.2.0-16-generic-pae #25-Ubuntu SMP Tue Feb 14 04:00:45 UTC 2012 i686 i686 i386 GNU/Linux (3.2.0-17-26 also)
I do not know if the problem exists in desktop editions (but it must, as the code is the same).
The problem can be fixed by re-compiling the kernel with CONFIG_NO_HZ=n. (Which disables tickless or dynamic ticks at the build level.)
What should happen: The kernel code should be fixed. A proposed fix is detailed further down herein.
The default kernel has CONFIG_NO_HZ=y and the typical users desire is to be able to use the stock kernel as it makes keeping up with updates and such so much easier.
As of 2012.03.01 another patch was proposed by the kernel.org maintainers of this code area (Peter Zijlstra and Ingo Molnar. Peter made the patch (peter02)).
As expected and as noted in my proposed patch comments, my proposed patch did not account for a very long system idle that I could not create on my sytems during testing.
Some test results with the "peter02" patch are added herein.
Linux Load Averages: There is an abundance of inaccurate information about linux load averages. While these web notes are not intended as a basic lesson or definition, a couple of things should be clarified.
What is wrong?: With the default server kernel, if there is any cpu idle time, it appears as though processes of high CPU use, but short duration are not included in the kernel load averages. Actually, they are included, but then get clobbered by new idle information being included at the incorrect time. Processes of longer duration are only partially included in the kernel load averages (only sometimes clobbered).
Method of test 1: Make a simple program to consume some CPU time and make a simple script to call that program in a forever loop. Each call will have a seperate PID (Process Idenification). As long as there is some CPU idle time (which for a single CPU system, the user might need to create via a short sleep in the script) and if the users kernel has this issue, the reported Load Averages will be incorrect as long as the process frequency is high enough.
Example Program as a text file. (I never use this method anymore)
Example script as a text file.
An Example script to accumulate load averages over time. Possibly for subsequent graphing. (or just do it manually via the "uptime" or "top" commands.)
Method of test 2: Make a program to spawn several child processes that consume CPU time but include a little sleep (idle) also. This allows us to easily create loads to show this issue at up to the number of CPUs minus a small amount. I.E. for a 8 cpu system the load can be created to about 7.9, while still showing the issue.
This has become the preferred method of test.
Program as a text file. (I always use this method now, but have made some changes to the program.)
Example 1 (my 10.10 server, with 2 CPUs) (I suspect that 10.10 uses a 100 Hertz tick rate):
Process frequency: 1.03 Hertz: load average given is 0.10 to low.
Process frequency: 2.08 Hertz: load average given is 023 to low.
Process frequency: 4.17 Hertz: load average given is 0.37 to low.
Process frequency: 8.33 Hertz: load average given is 0.78 to low.
Process frequency: 9 Hertz: load average given is 0.87 to low. (And very noisy, see example 2 beat frequnecy graph below)
Process frequency: 11 Hertz: load average given is 1.00 to low. (Load showed as 0)
Process frequency: 17 Hertz: load average given is 1.00 to low. (Load showed as 0)
Note: Of course, it is rediculous to list load averages to 2 decimal places, as I have done above. They are all very long term (at least 2 hours) averages.
Example 2 (my (newer) 11.10 server, with 8 CPUs (intel i7 processor)) (11.10 uses a 250 Hertz tick rate):
The below shows a beat frequency:
So why do we see the 10 and 25 hertz intercepts in the above graphs? The dynamic tick mode, or tickless (compile time: CONFIG_NO_HZ=y) allows a 10 tick delay after a load update time for the CPUs that might be idle, and in tickless mode, to catch up. This creates the 10 and 25 hertz intercepts as seen above (100/10 and 250 /10). Otherwise there is aliasing and such causing the wild swings as seen in the beat frequency graph. The system is trying to determine load information based on a 5 second sample rate, yet events can be happening at 100s of hertz, so it just is not possible. However, it can be greatly improved by preventing any idle update during the 10 tick delay.
To make, what has turned out to be, a very long story short:
Original Proposed Patch: I think there should only be 1 inclusion of idle information into the load calculations during the 10 tick catch up grace period. Thus, if there is any idle time, no matter how short, during the grace period, it will not be included until next time.
Proposed Patch variant 2: I think there should be no inlcusion of idle information into the load calculations during the 10 tick catch up grace period. Thus, if there is any idle time, no matter how short, during the grace period, it will not be included until next time.
I have not been able to detect any difference in the two solution over, now, months of tests. It makes some sense because, at least for my servers, the one grace period call to idle_fold is almost always zero.
Note: I have used kernel 3.0.0-15 for all of this work. It was the current kernel when I started and I got the source via "apt-get source linux-image-$(uname -r)".
The output of diff between the original sched.c code and my proposed code. (Hopefully in the proper format for launchpad.)
The output of diff between the original sched.c code and my proposed code variant 2. (Hopefully in the proper format for launchpad.)
E-mail from Peter Zijlstra of kernel.org with his proposed patch.
The output of diff between the original sched.c code and Peter patch 02. (After a seperate e-mail from Peter suggesting the change.)
The output of diff between the original Peter patch (from above e-mail) and Peter patch 02.
Test results of the proposed patches:
~10 Hertz sleep frequency per process:
First do a control test for the above ~10 Hertz sleep rate:
Then re-do the above ~10 Hertz sleep rate test:
The above graph shows that things are much better. If it can be improved beyond what the proposed patch does, I don't know (and I don't have more time to spend on it).
~25 Hertz sleep frequency per process:
Now, do a ~25 Hertz sleep rate test: First, a control test with CONFIG_NO_HZ=y and no patch (I.E. a stock kernel):
Do a control test with no code changes, but compiled with CONFIG_NO_HZ=n:
Next, with CONFIG_NO_HZ=y (dynamic or tickless or whatever one calls it.) with the proposed modification to the calc_load area of kernel/sched.c
Next, with CONFIG_NO_HZ=y with the peter02 patch. (Command: c/waiter 2 51)
~95 Hertz sleep frequency per process:
O.K., so now lets ramp up the sleep frequency some, well into the areas where the load average would always be zero without the patch. First, do a control test with no code changes, but compiled with CONFIG_NO_HZ=n:
Next, with CONFIG_NO_HZ=y (dynamic or tickless or whatever one calls it.) with the proposed modification to the calc_load area of kernel/sched.c
The reported load average is much closer to the real load average than it was before the patch (which would have reported 0). There might still be room for improvement, but recall this is a sampled system with events happening at a much much higher frequency than the sampling frequency which tends to lead to signal aliasing.
Still, attempts were made for further improvements, and a proposed patch variant 2 was derived (which ends up no different, but I didn't know it at the time). So now, with CONFIG_NO_HZ=y with the proposed patch variant 2.
Next, with CONFIG_NO_HZ=y with the peter02 patch. (Command: c/waiter 2 200)
~421 Hertz sleep frequency per process:
Now increase the sleep frequency considerably, beyond one idle enter /exit per tick. First, the control test with no code changes, but compiled with CONFIG_NO_HZ=n:
Next, with CONFIG_NO_HZ=y with the proposed modification to the calc_load area of kernel/sched.c
And finally, with CONFIG_NO_HZ=y with the proposed modification, variant 2, to the calc_load area of kernel/sched.c
Next, with CONFIG_NO_HZ=y with the peter02 patch. Note: Test terminated early so other tests could be done. (Command: c/waiter 2 1000)
~250 Hertz sleep frequency per process:
Some of the previous results suggest a race condition or similar. This test was just an attempt to see if a beat frequency could be discovered at close to the tick frequency,
Test one, with CONFIG_NO_HZ=y with the peter02 patch. (Command: c/waiter 2 555)
Test two. (Commands: c/waiter 1 552, c/waiter 1 553, c/waiter 1 554, c/waiter 1 555)
Comparitive test three. Kernel: 3.2.0-20-generic (bradf@tangerine) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3) ) #33~lp838811 SMP Thu Mar 29 15:04:04 UTC 2012. (Command: c/waiter 2 555)
Implements Commit-ID: c308b56b5398779cd3da0f62ab26b0453494c3d4
~100 Hertz sleep frequency per process:
Tests were done, with CONFIG_NO_HZ=y with the peter02 patch. (Command: c/waiter 2 209)
(Commands: c/waiter 1 206, c/waiter 1 207, c/waiter 1 208, c/waiter 1 209)
Another test but with test method 1:
Now, a test using Method of test 1, a high number of different processes per second (29.9 per second, in this case). No control sample was taken.
The run time option nohz=off makes no difference for this issue. The load calculation code in kernel/sched.c is different depending on the compile time option CONFIG_NO_HZ, and the issue is in the code.
Similarly, for the run time option highres=off. For completeness, the grub line (for my system) for these options is:
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 quiet nohz=off highres=off"
Myself, I think there is a valid argument questioning why the ubuntu server versions should default to tickless kernels. The claim is that it is to save power. However I can not detect any power difference in my system, nor have I been able to find any test results that show significant power savings.
My old old old 200 MHz P4 128 megabyte test computer kernel compile took 33 hours 4 minutes and 57.6 seconds before it gave up with an out of memory error.
My new 3.4 GHz i7 computer takes 13 minutes and 10 seconds to compile the kernel.
I found this code area (kernel/sched.c - calc_load) extremely confusing and difficult to follow.
In kernel 3.3-rc2 this stuff has been moved to kernel/sched/core.c, but it looks the same. I.E. I think this patch could be forward migrated.
Kernel 3.3-rc5 is the same.
Sometimes when the waiter.c program is run (test method 2), it takes quite awhile for the cpus to kick into a higher clock rate, perhaps due to the incorrect low load averages. This effects the overall average sleep frequency calculation.
Why does the waiter.c program takes almost 2 times as long to execute the loops at high enough sleep frequencies (depends on sleep duration)?: If the sleep time percentage becomes enough, about 50% I think, then the CPUs will throttle back to a lower clock rate. Consider making the sleep duration a run time variable.
However, it still remains a question as to why the CPUs do not throttle up quickly sometimes.
Additional note: The default CPU governor system is the "ondemand" method, and the default "up_threshold" is 95%, and the default "down_threshold" is 20%.
See also: CPUFreq Governors in Linux Kernel
Note: For my part of it, I made a mistake when I was testing my original patch proposal and the one from Peter Z. I did notice a huge change in load averages when the idle / not idle duty cycle went low enough. However, I attributed it all to the CPUs throttling back to a lower clock frequency. I continued testing with a preference to always having a high enough load to keep the CPUs at maximum frequency. My mistake was not realizing the contribution from incorrectly reported load averages at lower loads, in addition to the CPU clock frequency changes. See other web notes in this area with newer test results. The frequency governors are often now set to "powersave" mode for all CPUs while doing reported load average testing.
The launchpad bug that I added to.(main reference)
The newer launchpad bug complaining of high load averages with the patch.
Another newer launchpad bug complaining of high load averages.
Another launchpad bug report. (this one has very relevant information and insight)
Ubuntu forums discussion with links to other references.
Debian bug report.
An interesting, but older, article.