Alfred Chen's Blog: About hotplug affinity enhancement

Sunday, May 10, 2015

About hotplug affinity enhancement

This enhancement comes from investigating the issue from Brandon BerHent who back-port the -gc branch to 3.10 for android system and build customized kernel for nexus 6. It's very cool thing and I got to said "Hello Moto", which recall the memory of my first cell-phone.

The android system, unlike pc platform, seems use a lot cpu hotplug mechanism for power-saving functionality. When I look at the cpu hotplug code, I notice the below behaviors.

p5qe ~ # schedtool -a 0x02 1388
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
1
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x1
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
0
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x3

As you can see, after cpu 1 offline then online, task's affinity changes from 0x2 to 0x3, which include the new online cpu 1, but not the original design what the task to run on. And the most interesting thing is, it is not only the behaviors of BFS, it's same for mainline CFS.
Normally, for pc platform, it's not a big problem, as there is not much cpu hotplug events unless suspend/resume. But if just a small enhancement that can maintenance task's original affinity intend, why not? Below is the behaviors with the enhancement.

p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0xf
p5qe ~ # schedtool -a 0x2 1375
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x1
p5qe ~ # dmesg | tail
[    9.771522] zram3: detected capacity change from 0 to 268435456
[    9.783513] Adding 262140k swap on /dev/zram0. Priority:10 extents:1 across:262140k SSFS
[    9.785789] Adding 262140k swap on /dev/zram1. Priority:10 extents:1 across:262140k SSFS
[    9.788066] Adding 262140k swap on /dev/zram2. Priority:10 extents:1 across:262140k SSFS
[    9.790311] Adding 262140k swap on /dev/zram3. Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[ 105.757001] Renew affinity for 198 processes to cpu 1
[ 105.757001] kvm: disabling virtualization on CPU1
[ 105.757140] smpboot: CPU 1 is now offline
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # dmesg | tail
[    9.790311] Adding 262140k swap on /dev/zram3. Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[ 105.757001] Renew affinity for 198 processes to cpu 1
[ 105.757001] kvm: disabling virtualization on CPU1
[ 105.757140] smpboot: CPU 1 is now offline
[ 137.348718] x86: Booting SMP configuration:
[ 137.348722] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 137.359727] kvm: enabling virtualization on CPU1
[ 137.363338] Renew affinity for 203 processes to cpu 1

This enhancement changes the default behaviors of the kernel/system, I have tested it for a while with different use cases, all looks good. So I mark this changes version 1, if you have any comments/concert, please let me know. I'll look into it.

Here is the commit of this enhancement.

BR Alfred

Edit: Just push a minor fix when CONFIG_HOTPLUG_CPU is not enabled.

12 comments:

UnknownMay 12, 2015 at 10:56 AM
I am currently using this now. I will report back if any issues. Thanks :)
ReplyDelete
Replies
AnonymousMay 12, 2015 at 3:07 PM
OMG, so many kernels to test this week. ;-)

From bootup this kernel works very well. Can be that it already balances processes <-> cores better from the very start. As I often do resume/suspend to disk, any advantage is highly appreciated!

{I still need to investigate, why TuxOnIce is so unreliable with 4.0.2 + gc. So, I'd only come back to further VRQ testing after this is done.}

BR, Manuel Krause
ReplyDelete
Replies
AnonymousMay 13, 2015 at 4:04 PM
Yes, I 've already thought about bisecting this. And already begun.
But have not found to a senseful testing scenario.

When it fails with resume from hibernation, with kernel command line += no_console_suspend I'd get as last displayed line "Disabling non-boot CPUs ..." (What should be answered to on failureless resume.) This is at the time that TuxOnIce's UI displayed in the small console format before, then SHOULD switch to the widescreen format (normally).

How should I make up a senseful testing scenario?
* normally the problem gets worse with dev/shm usage by files > 1G & my browser usage
* not sure, how many bootups and iterations are needed to assure
* last night, with full -gc patches I had 5 good resumes first +17 approaches to get the 6th, then aborted,
* today, with only first 09 -gc patches I had 7 failures to load the image, then succeeded, and afterwards, 5 further suspend/resumes succeeded.

Maybe you've got some ideas how to make my testing more valuable,
thanks in advance,
BR Manuel Krause
ReplyDelete
Replies
AnonymousMay 26, 2015 at 12:37 PM
Someone already had a look at this?
"sched: always use blk_schedule_flush_plug in io_schedule_out"
https://github.com/torvalds/linux/commit/10d784eae2b41e25d8fc6a88096cd27286093c84

The code is in bfs.c with (curr) and -gc with (current), too.

BR, Manuel Krause
ReplyDelete
Replies

Add comment