Sunday, May 10, 2015

About hotplug affinity enhancement

This enhancement comes from investigating the issue from Brandon BerHent who back-port the -gc branch to 3.10 for android system and build customized kernel for nexus 6. It's very cool thing and I got to said "Hello Moto", which recall the memory of my first cell-phone.

The android system, unlike pc platform, seems use a lot cpu hotplug mechanism for power-saving functionality. When I look at the cpu hotplug code, I notice the below behaviors.

p5qe ~ # schedtool -a 0x02 1388
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
1
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x1
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
0
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID  1388: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x3

As you can see, after cpu 1 offline then online, task's affinity changes from 0x2 to 0x3, which include the new online cpu 1, but not the original design what the task to run on. And the most interesting thing is, it is not only the behaviors of BFS, it's same for mainline CFS.
Normally, for pc platform, it's not a big problem, as there is not much cpu hotplug events unless suspend/resume. But if just a small enhancement that can maintenance task's original affinity intend, why not? Below is the behaviors with the enhancement.

p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0xf
p5qe ~ # schedtool -a 0x2 1375
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x1
p5qe ~ # dmesg | tail
[    9.771522] zram3: detected capacity change from 0 to 268435456
[    9.783513] Adding 262140k swap on /dev/zram0.  Priority:10 extents:1 across:262140k SSFS
[    9.785789] Adding 262140k swap on /dev/zram1.  Priority:10 extents:1 across:262140k SSFS
[    9.788066] Adding 262140k swap on /dev/zram2.  Priority:10 extents:1 across:262140k SSFS
[    9.790311] Adding 262140k swap on /dev/zram3.  Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[  105.757001] Renew affinity for 198 processes to cpu 1
[  105.757001] kvm: disabling virtualization on CPU1
[  105.757140] smpboot: CPU 1 is now offline
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID  1375: PRIO   0, POLICY N: SCHED_NORMAL  , NICE   0, AFFINITY 0x2
p5qe ~ # dmesg | tail
[    9.790311] Adding 262140k swap on /dev/zram3.  Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[  105.757001] Renew affinity for 198 processes to cpu 1
[  105.757001] kvm: disabling virtualization on CPU1
[  105.757140] smpboot: CPU 1 is now offline
[  137.348718] x86: Booting SMP configuration:
[  137.348722] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  137.359727] kvm: enabling virtualization on CPU1
[  137.363338] Renew affinity for 203 processes to cpu 1

 This enhancement changes the default behaviors of the kernel/system, I have tested it for a while with different use cases, all looks good. So I mark this changes version 1, if you have any comments/concert, please let me know. I'll look into it.

Here is the commit of this enhancement.

BR Alfred

Edit: Just push a minor fix when CONFIG_HOTPLUG_CPU is not enabled.

v4.0.2-gc updates

Here comes the sync update for 4.0 gc branch

1. Add ck's sync patch upon 0462 for v4.0.2
2. Fix return type error for SMT_NICE, thanks pf pointing it out.
3. bfs: hotplug affinity enhancement, v1
    This is a little long story and basically it is not for pc, I'll open another topic for it later.

I also enable the github repo, so the code are on GitHub and BitBucket, feel free to pick whatever you like. Having fun!

Monday, May 4, 2015

Sanity Test for -gc&-vrq branch for linux 4.0

Here are the sanity test results of BFS, -gc branch and -vrq branch. No regression found on -gc branch, still doing better than origin BFS at 50% and 100% workload.

For vrq branch, there is not huge improvement against the gc branch, 50% and 300% workload performance are almost the same, there is even little regression at 100% workload,  the only good news is there are improvement at 150% workload.

The reasons why vrq doesn't make good performance that I expected are
1. Introduced some additional rq lock sessions when implement the new lock strategy.
2. The grq lock conflict doesn't seem to be a major problem for system with few cores, at least like my test hw platform(4 cores).

I wished I had a chance to reach some 30+ cores monsters to prove that all codes in vrq worthy it. But before that, I'll continue the unfinished features in vrq like the cache_count, see how much performance can gain from these opened doors.

BFS0462:
>>>>>50% workload
>>>>>round 1
real    5m21.850s
user    9m55.977s
sys     0m41.537s
>>>>>round 2
real    5m21.653s
user    9m55.750s
sys     0m41.411s
>>>>>round 3
real    5m21.973s
user    9m56.570s
sys     0m41.192s
>>>>>100% workload
>>>>>round 1
real    2m52.203s
user    10m8.151s
sys     0m43.575s
>>>>>round 2
real    2m52.050s
user    10m8.423s
sys     0m43.515s
>>>>>round 3
real    2m50.865s
user    10m8.283s
sys     0m43.700s
>>>>>150% workload
>>>>>round 1
real    2m56.355s
user    10m29.334s
sys     0m44.955s
>>>>>round 2
real    2m56.189s
user    10m29.469s
sys     0m44.782s
>>>>>round 3
real    2m56.264s
user    10m29.485s
sys     0m44.845s
>>>>>300% workload
>>>>>round 1
real    3m0.412s
user    10m42.805s
sys     0m46.352s
>>>>>round 2
real    3m1.408s
user    10m42.618s
sys     0m46.341s
>>>>>round 3
real    3m0.287s
user    10m43.304s
sys     0m46.244s

linux-4.0.y-gc:
>>>>>50% workload
>>>>>round 1
real    5m18.823s
user    9m50.911s
sys     0m41.302s
>>>>>round 2
real    5m19.032s
user    9m51.597s
sys     0m40.984s
>>>>>round 3
real    5m18.960s
user    9m51.490s
sys     0m41.046s
>>>>>100% workload
>>>>>round 1
real    2m51.085s
user    10m8.806s
sys     0m43.699s
>>>>>round 2
real    2m50.870s
user    10m8.108s
sys     0m44.142s
>>>>>round 3
real    2m50.839s
user    10m8.290s
sys     0m43.979s
>>>>>150% workload
>>>>>round 1
real    2m56.285s
user    10m30.045s
sys     0m44.629s
>>>>>round 2
real    2m56.286s
user    10m30.054s
sys     0m44.866s
>>>>>round 3
real    2m56.333s
user    10m30.379s
sys     0m44.425s
>>>>>300% workload
>>>>>round 1
real    3m0.427s
user    10m43.455s
sys     0m46.739s
>>>>>round 2
real    3m0.222s
user    10m43.341s
sys     0m46.519s
>>>>>round 3
real    3m0.244s
user    10m43.029s
sys     0m46.608s

linux-4.0.y-vrq:
>>>>>round 1
real    5m18.905s
user    9m51.214s
sys     0m40.890s
>>>>>round 2
real    5m18.994s
user    9m51.203s
sys     0m41.029s
>>>>>round 3
real    5m18.818s
user    9m51.266s
sys     0m40.819s
>>>>>100% workload
>>>>>round 1
real    2m51.414s
user    10m7.739s
sys     0m43.785s
>>>>>round 2
real    2m51.146s
user    10m7.449s
sys     0m43.848s
>>>>>round 3
real    2m51.103s
user    10m7.721s
sys     0m43.499s
>>>>>150% workload
>>>>>round 1
real    2m54.407s
user    10m21.732s
sys     0m44.407s
>>>>>round 2
real    2m54.436s
user    10m21.212s
sys     0m44.824s
>>>>>round 3
real    2m55.156s
user    10m21.279s
sys     0m44.796s
>>>>>300% workload
>>>>>round 1
real    3m0.549s
user    10m43.723s
sys     0m46.342s
>>>>>round 2
real    3m0.475s
user    10m44.249s
sys     0m45.982s
>>>>>round 3
real    3m0.393s
user    10m44.088s
sys     0m46.114s