Sunday, May 10, 2015

About hotplug affinity enhancement

This enhancement comes from investigating the issue from Brandon BerHent who back-port the -gc branch to 3.10 for android system and build customized kernel for nexus 6. It's very cool thing and I got to said "Hello Moto", which recall the memory of my first cell-phone.

The android system, unlike pc platform, seems use a lot cpu hotplug mechanism for power-saving functionality. When I look at the cpu hotplug code, I notice the below behaviors.

p5qe ~ # schedtool -a 0x02 1388
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
1
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x1
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
0
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x3

As you can see, after cpu 1 offline then online, task's affinity changes from 0x2 to 0x3, which include the new online cpu 1, but not the original design what the task to run on. And the most interesting thing is, it is not only the behaviors of BFS, it's same for mainline CFS.
Normally, for pc platform, it's not a big problem, as there is not much cpu hotplug events unless suspend/resume. But if just a small enhancement that can maintenance task's original affinity intend, why not? Below is the behaviors with the enhancement.

p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0xf
p5qe ~ # schedtool -a 0x2 1375
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x1
p5qe ~ # dmesg | tail
[    9.771522] zram3: detected capacity change from 0 to 268435456
[    9.783513] Adding 262140k swap on /dev/zram0.  Priority:10 extents:1 across:262140k SSFS
[    9.785789] Adding 262140k swap on /dev/zram1.  Priority:10 extents:1 across:262140k SSFS
[    9.788066] Adding 262140k swap on /dev/zram2.  Priority:10 extents:1 across:262140k SSFS
[    9.790311] Adding 262140k swap on /dev/zram3.  Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[  105.757001] Renew affinity for 198 processes to cpu 1
[  105.757001] kvm: disabling virtualization on CPU1
[  105.757140] smpboot: CPU 1 is now offline
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # dmesg | tail
[    9.790311] Adding 262140k swap on /dev/zram3.  Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[  105.757001] Renew affinity for 198 processes to cpu 1
[  105.757001] kvm: disabling virtualization on CPU1
[  105.757140] smpboot: CPU 1 is now offline
[  137.348718] x86: Booting SMP configuration:
[  137.348722] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  137.359727] kvm: enabling virtualization on CPU1
[  137.363338] Renew affinity for 203 processes to cpu 1

 This enhancement changes the default behaviors of the kernel/system, I have tested it for a while with different use cases, all looks good. So I mark this changes version 1, if you have any comments/concert, please let me know. I'll look into it.

Here is the commit of this enhancement.

BR Alfred

Edit: Just push a minor fix when CONFIG_HOTPLUG_CPU is not enabled.

12 comments:

  1. I am currently using this now. I will report back if any issues. Thanks :)

    ReplyDelete
  2. OMG, so many kernels to test this week. ;-)

    From bootup this kernel works very well. Can be that it already balances processes <-> cores better from the very start. As I often do resume/suspend to disk, any advantage is highly appreciated!

    {I still need to investigate, why TuxOnIce is so unreliable with 4.0.2 + gc. So, I'd only come back to further VRQ testing after this is done.}

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. You should not feel any different about this commit, as pc platform doesn't cpu hotplug much.
      It just keep the original affinity after cpu offline/online.

      For your investigation, I'd like to suggest you bitsect which commit in -gc introduce the unreliable.

      Delete
  3. Yes, I 've already thought about bisecting this. And already begun.
    But have not found to a senseful testing scenario.

    When it fails with resume from hibernation, with kernel command line += no_console_suspend I'd get as last displayed line "Disabling non-boot CPUs ..." (What should be answered to on failureless resume.) This is at the time that TuxOnIce's UI displayed in the small console format before, then SHOULD switch to the widescreen format (normally).

    How should I make up a senseful testing scenario?
    * normally the problem gets worse with dev/shm usage by files > 1G & my browser usage
    * not sure, how many bootups and iterations are needed to assure
    * last night, with full -gc patches I had 5 good resumes first +17 approaches to get the 6th, then aborted,
    * today, with only first 09 -gc patches I had 7 failures to load the image, then succeeded, and afterwards, 5 further suspend/resumes succeeded.

    Maybe you've got some ideas how to make my testing more valuable,
    thanks in advance,
    BR Manuel Krause

    ReplyDelete
    Replies
    1. ...and even with patches 01..02, only, what is BFS, TuxOnIce can leave me out for one addon trial. BR, MK

      Delete
    2. How about the behaviours in pure BFS? First 2 commits in -gc are just sync-up commits, the impact should be very minor.

      Delete
    3. Sorry, that I've been apparently confusing with my numbering scheme. Patch 01 is always pure BFS, patch 02 then is "bfs: 0462 v4.0.2 sync up." and so on.
      Need more testing to verify whether pure BFS really makes TuxOnIce more reliably.

      Delete
    4. Just an intermediate update information about my testings:
      After having several successful TuxOnIce resumes from disk with pure BFS, I added the next patch, "bfs: [Sync] __schedule() and io_schedule_timeout()", my No. 03, and failed to get the image within tried 14 attempts (I then I aborted it). Then I rebooted to the pure BFS kernel and needed 24 attempts to get the image loaded successfully. It's a weird issue to test, as I can also get many positive results (or at least very few attempts to get the image).
      What is confirmed now, at least, the issue appears with pure BFS, too. ATM I don't even want to/ can't say the issue get's worse with your -gc patches' optimisations, as I can't make up scientific-like statistics for that.
      Together with your considerations upon VRQ testing regarding timings and gfx -- thank you for your comprehensive explanations in your last email -- I'll repeat the tests with my gfx compiled in.

      BR, Manuel

      Delete
    5. So! Now that the issue is solved for me -- I want to say a big THANK YOU for your indirect encouragement to test different .config options.
      I was able to eliminate the unreliability of TuxOnIce on my system only with one working combination:
      Compile DRM into kernel and the i915 gfx driver as module. (Both into kernel or both as modules cause the known unpredictable TuxOnIce resume results.) I've tested this working solution with 4.0.4 and pure BFS, 4.0.4 and all actual -gc patches and I've also tested it on 3.19.8 and -gc. And, yes, they're working well.
      I wish I had tried this before... sooo much wasted time over months.

      Thank you,
      Manuel Krause

      Delete
    6. Happy to hear that it solved for you. Graphic seems to be most timing sensitive part of the kernel and sometime you have to workaround it, :)
      And that may also apply to your experience on -vrq branch too.

      Delete
  4. Someone already had a look at this?
    "sched: always use blk_schedule_flush_plug in io_schedule_out"
    https://github.com/torvalds/linux/commit/10d784eae2b41e25d8fc6a88096cd27286093c84

    The code is in bfs.c with (curr) and -gc with (current), too.

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Having the for-BFS-modified patch applied is running fine for some days now.

      BR, Manuel

      Delete