Alfred Chen's Blog: Regression investigation and resched_best

After reversed the commit which I bitsect to find out, I get a approaching result comparing to the Baseline, but the grp lock session is till there to work on.

I played with grq lock in the idle task path in many ways.
1. Placing grq_lock() before update_clocks(), approaching result comparing to the Baseline.
2. Placing grq_lock() after update_clocks(), the regression comes back.
3. A draft grq lock free idle fast path solution, the regression comes back.

These code changes doesn't make sense contributing to the regression to me, and I believe the grq lock may be just an illusion of a hidden issue.

Looking back at the unusual sched_count and sched_goidle value in schedstat, sched_goidle is earlier to be traced than sched_count, and there are three code path lead to sched_goidle, which are

1. idle = prev, !qnr
2. idle != prev and deactivate, !qnr
3. idle != prev and prev need other cpus, !qnr

so I wrote debug loads to find out how these three path contributes to sched_goidle, here is the result

                  idle        deactivate    needother
Baseline    9727      56106     0
VRQ          27764    61276          0

It looks like that schedule() is called while idle task is running and actually none task is queued in grq and so scheduler has to run idle again. This idle->idle code path is inefficient, it should be hit as less as possible.

One suspicious code cause schedule() to go idle->idle code path is the resched_best_idle(p) call. Currently, the resched_best_idle(p) is called when prev not equal next. I wrote a patch to remove the duplicated resched_best_idle() call in 3.15.x. This time, I decide to take a close look at under what condition resched_best_idle() should be called.

#1 prev == idle
#2 !#1 and deactivate
#3 !#1 and !#2 and need_other_cpus
#4 !#1..3 and qnr
#5 !#1..3 and !qnr

                               #1 #2 #3 #4 #5
resched_best_idle    N   N   Y     ?    N
next != prev      ?    Y   Y     ?    N

? means that the result depends what next task is fetched from grq.

Obviously, current next != prev condition can't cover #1 and #2, which will caused unnecessary schedule() calls. Take the 50% job/core ratio test for example, when a task is deactivated, and next task is not generated yet, scheduler will choose idle to run. But the resched_best_idle(prev) will cause schedule() run again on this rq, which hits the idle->idle code path.

There is a lot of talk but the code change is very simple, for baseline, just one line added

Baseline Fix

Below are the 50% job/core ratio throughput test results for the baseline and vrq with fix

sched_count sched_goidle ttwu_count ttwu_local real
Baseline+fix 200751 52479 85490 39002 5m22.097s

VRQ+fix 202821 51795 85278 36583 5m23.010s

idle deactivate

Baseline+fix 8024 44455

VRQ+fix 11379 40388

The fix is good for both baseline and vrq, compare to the baseline w/o fix, which cost about 5m33s, 10~11 seconds are saved, that's about 3% improvement.

Alfred Chen's Blog

Saturday, August 30, 2014

Regression investigation and resched_best_idle issue found

No comments:

Post a Comment