Saturday, August 30, 2014

GC update & VRQ dev cycle

-gc

It seems that mainline 3.16.2 is not hit this week, so I have to updated and taged 3.16.1-gc.

Remarkable changes:
-- sync up with v3.16.1
-- sync up with BFS 0456
-- resched_best_idle() bug fix get 3% performance  improvement

v3.16.1-gc


-vrq

VRQ solution is based on BFS. In order to avoid additional sync up with BFS and mainline update, I decide to take the following strategy and line up with stable kernel update strategy.

mainline .0 .1 .2 release     ---- sync up with mainline and BFS new release.
mainline .3 .4 .5 .6 release ---- put new VRQ changes in, Baseline frozen. VRQ release frozen.
mainline .7 .8 ...                 ---- testing and benchmark.


Regression investigation and resched_best_idle issue found

After reversed the commit which I bitsect to find out, I get a approaching result comparing to the Baseline, but the grp lock session is till there to work on.

I played with grq lock in the idle task path in many ways.
1. Placing grq_lock() before update_clocks(), approaching result comparing to the Baseline.
2. Placing grq_lock() after update_clocks(), the regression comes back.
3. A draft grq lock free idle fast path solution, the regression comes back.

These code changes doesn't make sense contributing to the regression to me, and I believe the grq lock may be just an illusion of a hidden issue.

Looking back at the unusual sched_count and sched_goidle value in schedstat, sched_goidle is earlier to be traced than sched_count, and there are three code path lead to sched_goidle, which are

1. idle = prev, !qnr
2. idle != prev and deactivate, !qnr
3. idle != prev and prev need other cpus, !qnr

so I wrote debug loads to find out how these three path contributes to sched_goidle, here is the result

                  idle        deactivate    needother
Baseline    9727      56106          0
VRQ          27764    61276          0

It looks like that schedule() is called while idle task is running and actually none task is queued in grq and so scheduler has to run idle again. This idle->idle code path is inefficient, it should be hit as less as possible.

One suspicious code cause schedule() to go idle->idle code path is the resched_best_idle(p) call. Currently, the resched_best_idle(p) is called when prev not equal next. I wrote a patch to remove the duplicated resched_best_idle() call in 3.15.x. This time, I decide to take a close look at under what condition resched_best_idle() should be called.

#1 prev == idle
#2 !#1 and deactivate
#3 !#1 and !#2 and need_other_cpus
#4 !#1..3 and qnr
#5 !#1..3 and !qnr

                               #1  #2  #3  #4 #5
resched_best_idle    N   N   Y     ?    N
next != prev             ?    Y   Y     ?    N

? means that the result depends what next task is fetched from grq.

Obviously, current next != prev condition can't cover #1 and #2, which will caused unnecessary schedule() calls. Take the 50% job/core ratio test for example, when a task is deactivated, and next task is not generated yet, scheduler will choose idle to run. But the resched_best_idle(prev) will cause schedule() run again on this rq, which hits the idle->idle code path.
There is a lot of talk but the code change is very simple, for baseline, just one line added

Below are the 50% job/core ratio throughput test results for the baseline and vrq with fix

                    sched_count   sched_goidle    ttwu_count    ttwu_local   real
Baseline+fix   200751          52479               85490           39002         5m22.097s
VRQ+fix        202821           51795              85278           36583         5m23.010s

                     idle    deactivate
Baseline+fix   8024  44455
VRQ+fix        11379 40388

The fix is good for both baseline and vrq, compare to the baseline w/o fix, which cost about 5m33s, 10~11 seconds are saved, that's about 3% improvement.

Tuesday, August 26, 2014

VRQ 0.1 schedstat result and regression on 50% ratio test

The first stage VRQ 50% job/core ratio test shows there is about 3 seconds regression comparing to the Baseline. So I decide to find out the cause of it.

First, looked at the schedstat output which has been collected during the test. Below shows the result between of the Baseline code-base and the VRQ.

*********************************************************************
sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay    rq_sched_info.pcount
251973            61299     83682         39699         181334347343    372048261887                        97709
domain0 19012
domian1 24971

288154            77212     85273         43396         174958346976    317271010776                        107263
domain0 18564
domian1 23882
*********************************************************************

The most remarkable diff is the larger sched_count and sched_goidle. The stat number doesn't tell what caused this. I have to bisect and find it out.

Finally, it turns out the commit which divide update_clocks() into update_rq_clock() and update_grq_clock() cause the regression. It's a mistake to separate these two too far away as bfs will use jiffies to adjust ndiff. After reversed that commit(part of it), an approaching result comes.

*********************************************************************
sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay
267428            66304     82625         40292         171637450209 
*********************************************************************

Tuesday, August 19, 2014

Variable Run Queue Locking(VRQL) for BFS

One of the BFS's concern is lock contention with single grq lock. As I read mainline and bfs code these days, it comes up to me an idea of multiple queue lock solution for BFS. In brief, introduce per cpu run queue lock "like" mainline does, to guard rq data accessing and minimize grq locking sessions, by this mean, to reduce lock contention. Depends on what kind of data the code need to access, rq lock or grq lock or both rq/grq lock need to be held, what's the name VRQL(variable run queue locking) comes from.

How is current grq lock
Currently BFS is using single grq lock for almost everything, which including task info accessing, rq structure accessing and grq accessing. It's a simple design but all operations about task/rq/grq will need to acquire the grq lock.


The run queue lock
In VBRL, a raw_spin_lock will be add to current bfs run queue structure, to guard run queue structure data accessing cross cpus. At the meantime, local run queue data accessing is still safe with preempt and interrupt disabled.

Lock strategy for cross cpu data accessing

After run queue lock is introduced, the "all grq" locking strategy will be changed as below

a. To accessing task structure info, if task is currently running, task's rq lock needs to be held, which prevent task from switching out of cpu/rq; otherwise, grq lock needs to be held, which prevent task from taken from or put into grq.
task_access_lock...()/task_access_unlock...() inlined routines are introduced for this purpose.

b. To accessing run queue structure, rq's lock needs to be held.
For the convenience of accessing BOTH task's structure and task's rq structure, task_vrq_lock_...()/task_vrq_unlock...() inlined routines are introduced, which will firstly lock task's rq lock then acquire grq lock if task is not running.

c. To accessing grq structure, grq lock needs to be held.
rq_grq_lock...()/rq_grq_unlock...() inline routines are introduced to lock-down the given rq and grq both.

Minimize grq locking session
Adding the above locking strategy will not automatically give performance improvement. Instead, it generates additional locking overhead which impact performance. The performance gain comes by identify and narrow grq sessions, in both session numbers and the run-time spend in it.

To keep things simple and no surprise when play with the grp sessions, the re-factory method is used. That means, just change the code itself but *NOT* the bfs functionality logic, as possible as it can be.

First Stage
In this stage, the first priority task is apply new locking strategy to bfs and make it works. Second priority task is focused on schedule() re-factory, as it is the hot spot of scheduling core code. For the rest part of code, conservative rule is used.

There are 4 concept in schedule().
a. the rq/cpu which schedule() happens on.
b. prev task, the task rq is running (rq->curr) at the beginning schedule() is run.
c. next task, the task rq will run on after schedule().
d. grq, which stores the queued task and next task comes from and prev task puts back into when switch happens.

Changes in schedule()
 
First of all, grq lock is replaced by rq->lock when locking strategy applied.

In the first part of schedule(), from need_resched label to blk_schedule_flush_plug() block, most codes are safe under rq lock guarding as just prev task data structure is accessed.
Only try_to_wake_up_local() needs grq lock to be held in this part. A debug test shows that it is not the hot code path of schedule(), only less than 50 hits when system boots up and a suspend/resume cycle, so simply around try_to_wake_up_local() by a grq_lock() and grq_unlock() pair.

Second half of schedule(), idle!=prev block, fetch next block and context_switch block. To narrow the grq block session, the spots which need read/write access to grq are highlight then find a way to re-factory it. The code tells what it finally has been done. In brielf, update_clocks() has been re-factoried into 2 calls, update_rq_clock() and update_grq_clock(), the last one just need to be called when idle!=prev and before check_deadline(); update_cpu_clock_switch() has been divided into two versions, which add _idle and _nonidle subfix and be called in individual branch. The grq locking session ends before context_switch().

All these re-factory should *NOT* change the original bfs functionality logic. *EXCEPT* that the prev task needs to be taken care of, as prev task has been returned to grq before context_switch(), once grq lock releases before context_switch(), it may be fetched by other rq/cpu and run on other rq/cpu, but actually prev might haven't done yet in this rq. There are 2 options to solve this

1. dequeue prev task from grq and hold it within rq during context_switch(), once context_switch() done, return prev to grq.
2. hold grq lock during context_switch() so prev is still safe in grq.

Currently, option 1 is used, will make the choice later based on further testing and optimization.
Scalability&Critical Spot
In VRQL, the run queue lock is per cpu design so it is scalable when cpu number goes up. The grq lock sessions are still the critical spots of scheduling code.

Testing

a. Four jobs/cores ratio tests are designed to test possible code path in schedule(), 50%, 100% and 150% jobs/cores ratio tests, which simplely modify the -j<num> according to the cpu core number of the tested machines. 100% + 50%, which runs 50% ratio mprime testing at SCHED_IDLEPRIO priority using schedtool, then run 100% ratio kernel compilation.
b. Kernel compilation throughput tests are done on Intel Xeon L5420 (quad core).
c. Two different bfs code-base kernel are used, they are
1. 3.16_Baseline   3.16 kernel with BFS 0449 + my bfs patch set
2. 3.16_VRQ         3.16 kernel with BFS 0449 + my bfs patch set + VRQ locking patch

Test steps:
1. Boot with targeting kernel into text mode.
2. First-Time-Compile for the kernel code(First-Time-Compile always takes longer, used to cache in files)
3. Compile the kernel code with -j<num> and record real time used, change <num> based on what jobs/cores radio for test and machine is using.
4. Repeat step3 10 times, exclude the max and min time recorded and calculate the average time of the rest.

                        50%              100%            150%           100%+50%
3.16_Baseline    5m32.978s    2m50.706s    2m56.419s    2m50.656s
3.16_VRQ         5m35.576s     2m50.432s    2m56.642s    2m50.731s

Conclusion
The first stage tests show that the gained performance improvement is comparable to the overhead caused by additional locking. 2 remarkable items list here
1. In 50% jobs/cores radio test, because not all cores are used, the performance gain from refine grq locking is less than additional locking overhead.
2. Even VRQ 50% test is worse than the Baseline, but in 100% test, it gets equal result of the Baseline, which shows VRQ gain performance by avoid some lock contention, as what it is designed to be.

Hopefully in the coming stage of VRQ coding, the VRQ tests can get better results.

The Code
New branch has been created for VRQ, the code should be considered highly experimental, and should only use for testing.

Branch: https://bitbucket.org/alfredchen/linux-gc/downloads/0450-enable-polling-check.patch
3.16_Baseline at commit: badf57e bfs: Fix runqueues decarelation.
3.16_VRQ at commit: ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.

VRQ related commits are:
ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
d59b75c bfs: Make grq cacheline aligned.
e5de0ef bfs: (6/6) Implement task_vrq_lock...() grq_test_lock() logic.
307cc5e bfs: (5/6) Refine schedule() grq locking session.
e2f1be3 bfs: (4/6) use rq&grq double locking.
a61ee81 bfs: (3/6) isolate prev task from grq.
9aad936 bfs: (2/6) other code use grq&rq double locking.
482345f bfs: (1/6) schedule() use grq&rq double locking.

What next
1. It will be nice to have feedback from more testers, especial from who runs on multiple cores or cores with SMT, and the owner with amd cpus.
2. Sync-up with 3.16 code and new BFS release.
3. Identify what next stage code change should be focused and move on.