Tuesday, August 26, 2014

VRQ 0.1 schedstat result and regression on 50% ratio test

The first stage VRQ 50% job/core ratio test shows there is about 3 seconds regression comparing to the Baseline. So I decide to find out the cause of it.

First, looked at the schedstat output which has been collected during the test. Below shows the result between of the Baseline code-base and the VRQ.

sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay    rq_sched_info.pcount
251973            61299     83682         39699         181334347343    372048261887                        97709
domain0 19012
domian1 24971

288154            77212     85273         43396         174958346976    317271010776                        107263
domain0 18564
domian1 23882

The most remarkable diff is the larger sched_count and sched_goidle. The stat number doesn't tell what caused this. I have to bisect and find it out.

Finally, it turns out the commit which divide update_clocks() into update_rq_clock() and update_grq_clock() cause the regression. It's a mistake to separate these two too far away as bfs will use jiffies to adjust ndiff. After reversed that commit(part of it), an approaching result comes.

sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay
267428            66304     82625         40292         171637450209 

