Saturday, August 30, 2014

Regression investigation and resched_best_idle issue found

After reversed the commit which I bitsect to find out, I get a approaching result comparing to the Baseline, but the grp lock session is till there to work on.

I played with grq lock in the idle task path in many ways.
1. Placing grq_lock() before update_clocks(), approaching result comparing to the Baseline.
2. Placing grq_lock() after update_clocks(), the regression comes back.
3. A draft grq lock free idle fast path solution, the regression comes back.

These code changes doesn't make sense contributing to the regression to me, and I believe the grq lock may be just an illusion of a hidden issue.

Looking back at the unusual sched_count and sched_goidle value in schedstat, sched_goidle is earlier to be traced than sched_count, and there are three code path lead to sched_goidle, which are

1. idle = prev, !qnr
2. idle != prev and deactivate, !qnr
3. idle != prev and prev need other cpus, !qnr

so I wrote debug loads to find out how these three path contributes to sched_goidle, here is the result

                  idle        deactivate    needother
Baseline    9727      56106          0
VRQ          27764    61276          0

It looks like that schedule() is called while idle task is running and actually none task is queued in grq and so scheduler has to run idle again. This idle->idle code path is inefficient, it should be hit as less as possible.

One suspicious code cause schedule() to go idle->idle code path is the resched_best_idle(p) call. Currently, the resched_best_idle(p) is called when prev not equal next. I wrote a patch to remove the duplicated resched_best_idle() call in 3.15.x. This time, I decide to take a close look at under what condition resched_best_idle() should be called.

#1 prev == idle
#2 !#1 and deactivate
#3 !#1 and !#2 and need_other_cpus
#4 !#1..3 and qnr
#5 !#1..3 and !qnr

                               #1  #2  #3  #4 #5
resched_best_idle    N   N   Y     ?    N
next != prev             ?    Y   Y     ?    N

? means that the result depends what next task is fetched from grq.

Obviously, current next != prev condition can't cover #1 and #2, which will caused unnecessary schedule() calls. Take the 50% job/core ratio test for example, when a task is deactivated, and next task is not generated yet, scheduler will choose idle to run. But the resched_best_idle(prev) will cause schedule() run again on this rq, which hits the idle->idle code path.
There is a lot of talk but the code change is very simple, for baseline, just one line added

Below are the 50% job/core ratio throughput test results for the baseline and vrq with fix

                    sched_count   sched_goidle    ttwu_count    ttwu_local   real
Baseline+fix   200751          52479               85490           39002         5m22.097s
VRQ+fix        202821           51795              85278           36583         5m23.010s

                     idle    deactivate
Baseline+fix   8024  44455
VRQ+fix        11379 40388

The fix is good for both baseline and vrq, compare to the baseline w/o fix, which cost about 5m33s, 10~11 seconds are saved, that's about 3% improvement.

No comments:

Post a Comment