Linux Intel Pstate driver. Method to reproduce long deferrable timer delay. WWW.Smythies.com
Note: there are multiple issues between the intel_pstate driver and the deferrable timers. This web page is only about 1 of them.
There seems to be a problem with the deferrable timers and periodic workflows that can result in the driver incorrectly pushing the CPU frequency downwards.
For the periodic cases where the CPU wakes up and does its chunk of work and goes back to sleep all within one jiffy boundary, the deferred timer system does not
realize this situation has occurred and the driver is not called. Eventually (after as long as 4 seconds), perhaps for some other reason, the driver is called,
and the duration method for detecting a relatively idle CPU incorrectly takes action to force the CPU target pstate downwards.
What we expected to happen was upon wakeup (or before going back to sleep) the system would realize that there is a deferrable timer call pending and perform
the call. This never seems to happen under these conditions. It is understood, and of course expected, that the CPU will not be woken up just to service the deferrable
timer.
A perfect example (1000 Hz kernel):
Consider a periodic workflow that has 600 microseconds of work to do at 100 hertz, and the job is started asynchronously with respect to the current jiffy count.
Then it is 40% probable that the CPU frequency will be pushed to minimum, whereas there is actually enough load that it should increase significantly from the minimum.
The intel_pstate driver gets called much much much more infrequently than it should.
What about the not perfect case?:
What about the more generic, but still periodic, case?
Well, then it comes down to beat frequencies between the
process and the jiffy boundaries. A great many cases were run with
different work/sleep frequencies and there are lots of
cases where the driver call is deferred enough to cause
the duration code to engage incorrectly. However in these
scenarios the CPU frequency will oscillate in step with
the beat frequency.
Note: Yes, fixed time work chunks does not consider the
more realistic scenario where the work time would scale
with CPU frequency. That just changes the threshold for
the issue, but not the root issue itself.
A Method to reproduce the issue:
Note: It is assumed that the reader knows how to make sure their system is using the intel pstate driver and that it is in powersave mode.
From the references section below, get and compile the consume program.
Extract this script, modify it for your particular kernel and your particular paths, and run it several times. On 40% of the runs the issue of discussion herein will occur.
doug@s15:~/temp$ cat set_cpu_timer_stats #! /bin/bash # Reset the counters echo "0" > /proc/timer_stats echo "1" > /proc/timer_stats # For a 1000 Hz kernel and for 40% probability of issue: # Apply a 6 percent load at 100 hertz for 20 seconds # taskset -c 5 /home/doug/c/consume 6.0 100.0 20.0 & # For a 250 Hz kernel and for 40% probability of issue: # Apply a 30 percent load at 125 hertz for 20 seconds taskset -c 5 /home/doug/c/consume 30.0 125.0 20.0 & sleep 10 # What is the CPU 5 frequency? echo -n "CPU 5 frequency (~ at or near minimum means forced low due to excessive timer deferral, ~~higher means O.K.): " cat /sys/devices/system/cpu/cpu5/cpufreq/cpuinfo_cur_freq sleep 11 cat /proc/timer_stats | grep intel echo "0" > /proc/timer_stats
Example: Notice the timer counts between the run that works properly and the one that doesn't
doug@s15:~/temp$ sudo ./set_cpu_timer_stats
CPU 5 frequency (~ at or near minimum means forced low due to excessive timer defarral, ~~higher means O.K.): 2287695
consume: 30.0 125.0 20.0 PID: 4125 Elapsed: 20000097 Now: 1403282894670797
1256D, 0 swapper/5 intel_pstate_timer_func (intel_pstate_timer_func)
27D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
16D, 0 swapper/7 intel_pstate_timer_func (intel_pstate_timer_func)
25D, 0 swapper/4 intel_pstate_timer_func (intel_pstate_timer_func)
16D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
31D, 0 swapper/2 intel_pstate_timer_func (intel_pstate_timer_func)
6D, 0 swapper/3 intel_pstate_timer_func (intel_pstate_timer_func)
4D, 0 swapper/6 intel_pstate_timer_func (intel_pstate_timer_func)
doug@s15:~/temp$ sudo ./set_cpu_timer_stats
CPU 5 frequency (~ at or near minimum means forced low due to excessive timer defarral, ~~higher means O.K.): 1599992
consume: 30.0 125.0 20.0 PID: 4133 Elapsed: 20000097 Now: 1403282921060854
17D, 0 swapper/7 intel_pstate_timer_func (intel_pstate_timer_func)
23D, 0 swapper/5 intel_pstate_timer_func (intel_pstate_timer_func)
18D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
28D, 0 swapper/4 intel_pstate_timer_func (intel_pstate_timer_func)
34D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
6D, 0 swapper/3 intel_pstate_timer_func (intel_pstate_timer_func)
36D, 0 swapper/2 intel_pstate_timer_func (intel_pstate_timer_func)
6D, 0 swapper/6 intel_pstate_timer_func (intel_pstate_timer_func)
Another Example: Notice the timer counts between the run that works properly and the one that doesn't
chad@A54C-NB91:~$ uname -r
3.15.0-031500-generic
chad@A54C-NB91:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
2300000
chad@A54C-NB91:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
800000
chad@A54C-NB91:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
powersave
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 1938648
consume: 30.0 125.0 20.0 PID: 14153 Elapsed: 20000130 Now: 1403151105471872
1338D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
435D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 2117707
consume: 30.0 125.0 20.0 PID: 14732 Elapsed: 20000140 Now: 1403151134988980
1359D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
434D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 807156
consume: 30.0 125.0 20.0 PID: 17753 Elapsed: 20000110 Now: 1403151280566414
556D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
514D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 799968
consume: 30.0 125.0 20.0 PID: 20732 Elapsed: 20000095 Now: 1403151426773978
727D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
913D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 1967128
consume: 30.0 125.0 20.0 PID: 21410 Elapsed: 20000074 Now: 1403151461013224
1384D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
681D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
chad@A54C-NB91:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 1826613
consume: 30.0 125.0 20.0 PID: 22007 Elapsed: 20000251 Now: 1403151491423607
1393D, 13345 consume intel_pstate_timer_func (intel_pstate_timer_func)
673D, 0 swapper/0 intel_pstate_timer_func (intel_pstate_timer_func)
Another Example: This time the wrong kernel is loaded, but we can still see when it the deferred counters are working properly or not via the counters
user@MS-7851:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
powersave
user@MS-7851:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
3000000
user@MS-7851:~$ cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
800000
shawn@MS-7851:~$
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 9911 Elapsed: 20000100 Now: 1403195243564286
138D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
182D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 10572 Elapsed: 20000101 Now: 1403195275507036
1329D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
388D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 11131 Elapsed: 20000101 Now: 1403195302117907
168D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
1304D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 1014375
consume: 30.0 125.0 20.0 PID: 11619 Elapsed: 20000100 Now: 1403195325361680
172D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
1284D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 12067 Elapsed: 20000095 Now: 1403195347599928
158D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
190D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 12513 Elapsed: 20000104 Now: 1403195369729762
173D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
1285D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
user@MS-7851:~$ sudo /tmp/script
CPU 1 frequency (~ 1.6 GHz means forced low, ~~higher means O.K.): 826875
consume: 30.0 125.0 20.0 PID: 12981 Elapsed: 20000103 Now: 1403195392962285
161D, 2256 xfce4-terminal intel_pstate_timer_func (intel_pstate_timer_func)
1288D, 0 swapper/1 intel_pstate_timer_func (intel_pstate_timer_func)
References (files are often as a text file):
Consume Program (Changed since earlier writeups).
Kernel configuration (for the 250 hertz version of the kernel).
Note 1: Tests were also done with CONFIG_NO_HZ_FULL not set, no difference.
Note 2: Also, and as a sanity check, the boot parameter "nohz=off" was tested, and everything worked fine (as expected).
Script to run the test program and then get the timer stats (as above, but perhaps more recent).
Script to run the timer stats test script forever.