Alfred Chen's Blog: 2014

Wednesday, December 31, 2014

VRQ 0.3 updates

In the last day of 2014, I will like to announce the 0.3 updates of VRQ solution for BFS.

It's almost a rework to address the lock dependency issues which caused by introduction of rq lock in VRQ. I'll like to do more clean up during 3.18 release, if it goes well, it will be moved to -gc branch.

Here are the test result of CFS, BFS original, Baseline(the VRQ based on) and the VRQ. It shows that after three release of VRQ, it can be better than original BFS in all kinds of workload and comparable against CFS(again? As I remember correctly, BFS is better than CFS in compiling test some release ago). But most important of all, VRQ open the opportunity to further improvements, I will give these new ideas a try next year.

Happy new Year 2015.

#1 50% task/cores ratio workload test

3.18_CFS_50
5m18.385s
5m17.836s
5m17.783s
5m17.765s
5m17.730s
5m17.663s
5m17.596s
5m17.566s
5m17.487s
5m17.455s

3.18_0460_50
5m20.448s
5m20.384s
5m20.374s
5m20.286s
5m20.245s
5m20.192s
5m20.166s
5m20.155s
5m20.150s
5m20.093s

3.18_Baseline_50
5m17.975s
5m17.552s
5m17.512s
5m17.492s
5m17.485s
5m17.482s
5m17.480s
5m17.475s
5m17.457s
5m17.342s

3.18_VRQ_50
5m17.350s
5m17.339s
5m17.318s
5m17.297s
5m17.284s
5m17.282s
5m17.248s
5m17.220s
5m17.209s
5m17.145s

#2 100% task/cores ratio workload test

3.18_CFS_100
2m50.047s
2m50.037s
2m49.825s
2m49.750s
2m49.749s
2m49.744s
2m49.706s
2m49.703s
2m49.673s
2m49.633s

3.18_0460_100
2m51.933s
2m50.640s
2m50.431s
2m50.424s
2m50.421s
2m50.386s
2m50.362s
2m50.267s
2m50.248s
2m50.129s

3.18_Baseline_100
2m51.862s
2m50.527s
2m50.506s
2m50.400s
2m50.370s
2m50.361s
2m50.283s
2m50.282s
2m50.189s
2m50.146s

3.18_VRQ_100
2m50.944s
2m49.812s
2m49.797s
2m49.700s
2m49.683s
2m49.672s
2m49.653s
2m49.649s
2m49.613s
2m49.506s

#3 150% task/cores ratio workload test

3.18_CFS_150
2m53.382s
2m53.366s
2m53.328s
2m53.326s
2m53.326s
2m53.310s
2m53.307s
2m53.262s
2m53.208s
2m53.127s

3.18_0460_150
2m57.280s
2m56.710s
2m56.124s
2m55.860s
2m55.843s
2m55.725s
2m55.646s
2m55.643s
2m55.597s
2m55.582s

3.18_Baseline_150
2m57.100s
2m55.907s
2m55.796s
2m55.788s
2m55.755s
2m55.749s
2m55.740s
2m55.736s
2m55.732s
2m55.726s

3.18_VRQ_150
2m55.449s
2m53.168s
2m52.112s
2m51.898s
2m51.649s
2m51.539s
2m51.527s
2m51.371s
2m51.272s
2m51.270s

#4 100%+50%(IDLE) task/cores ratio workload tests

3.18_CFS_100_50
2m55.069s
2m53.714s
2m53.638s
2m53.586s
2m53.474s
2m53.466s
2m53.313s
2m53.280s
2m53.196s
2m53.187s

3.18_0460_100_50
2m50.730s
2m50.713s
2m50.651s
2m50.620s
2m50.615s
2m50.598s
2m50.560s
2m50.548s
2m50.460s
2m50.457s

3.18_Baseline_100_50
2m50.826s
2m50.691s
2m50.683s
2m50.632s
2m50.601s
2m50.574s
2m50.555s
2m50.549s
2m50.549s
2m50.507s

3.18_VRQ_100_50
2m49.822s
2m49.799s
2m49.766s
2m49.706s
2m49.673s
2m49.631s
2m49.623s
2m49.587s
2m49.583s
2m49.564s

Wednesday, November 26, 2014

-gc updates and VRQ benchmark in 3.17

There is BFS 0458 release for 3.17, I have synced up -gc with it then rebase -gc branch with v3.17.3 then v3.17.4, you can fetch them from my git.

As 0458 BFS released, I kicked off another new branch of testing to get the result of BFS 0458, Baseline of VRQ patch and the VRQ patch.
The result highlighted are
a. For low workload, Baseline&VRQ is ~3 seconds(1%) faster than original BFS.
b. For heavy workload, VRQ is ~5 seconds(3%) faster than original BFS and the baseline code.

It's kind of busy these days I have tried some new idea based on VRQ solution, but the result isn't that good, so I spend most of time in debugging and haven't update the blog, :)

Based on the test result of VRQ, I believe it is a good direction to go. Especially it shows little overhead under heavy load than original BFS. And recently, I am thinking about improving BFS/VRQ by reducing grq usage, but the debug code prove an existed bug in current VRQ code, now I have to rework the whole VRQ code allover again to debug it. So in the coming 0.3 release of VRQ, I plan to fix the known issues in vrq branch and clean-up the code, which will help to stabilize VRQ code and also helpful for new idea which based on VRQ solution. At the meantime, I have update the -gc branch and put detail comments into bfs commits(if needed), hopefully CK will pick-up some of them in next BFS release.

Stay tuned. :D

#1 Jobs/Cores ratio 50%

3.17_0458_50
5m20.933s
5m20.896s
5m20.882s
5m20.867s
5m20.849s
5m20.743s
5m20.663s
5m20.621s
5m20.340s
5m20.251s

3.17_Baseline_50
5m18.065s
5m18.019s
5m17.956s
5m17.953s
5m17.941s
5m17.896s
5m17.891s
5m17.839s
5m17.826s
5m17.794s

3.17_VRQ_50
5m17.897s
5m17.856s
5m17.852s
5m17.788s
5m17.759s
5m17.748s
5m17.721s
5m17.673s
5m17.642s
5m17.634s

#2 Jobs/Cores ratio 150%

3.17_0458_150
2m56.468s
2m56.379s
2m56.377s
2m56.367s
2m56.364s
2m56.335s
2m56.322s
2m56.321s
2m56.261s
2m56.184s

3.17_Baseline_150
2m56.472s
2m56.407s
2m56.341s
2m56.312s
2m56.303s
2m56.293s
2m56.289s
2m56.288s
2m56.259s
2m56.202s

3.17_VRQ_150
2m51.553s
2m51.516s
2m51.473s
2m51.472s
2m51.448s
2m51.379s
2m51.358s
2m51.339s
2m51.338s
2m51.292s

#3 Jobs/Cores ratio 100%

3.17_0458_100
2m52.031s
2m50.718s
2m50.663s
2m50.650s
2m50.624s
2m50.603s
2m50.559s
2m50.541s
2m50.485s
2m50.468s

3.17_Baseline_100
2m50.945s
2m50.719s
2m50.683s
2m50.639s
2m50.602s
2m50.597s
2m50.583s
2m50.530s
2m50.521s
2m50.486s

3.17_VRQ_100
2m49.896s
2m49.675s
2m49.670s
2m49.658s
2m49.604s
2m49.538s
2m49.501s
2m49.480s
2m49.463s
2m49.457s

#4 Jobs/Cores ratio 100%

3.17_0458_100_50
2m52.026s
2m51.384s
2m51.151s
2m51.039s
2m50.983s
2m50.901s
2m50.863s
2m50.852s
2m50.810s
2m50.699s

3.17_Baseline_100_50
2m52.161s
2m51.007s
2m50.934s
2m50.864s
2m50.861s
2m50.857s
2m50.842s
2m50.817s
2m50.789s
2m50.733s

3.17_VRQ_100_50
2m49.801s
2m49.783s
2m49.740s
2m49.730s
2m49.716s
2m49.703s
2m49.700s
2m49.666s
2m49.662s
2m49.606s

Tuesday, November 4, 2014

What's new for 3.17 gc patch set

In the last kernel release cycle, I played around some ideas with BFS, ideas like how to select a task to run, sticky task etc. The result is not as good as expected but all these trying let me knows BFS code better. So there will be no huge changes in BFS VRQ solution branch, I may need another round to think it all over again. Instead, there are some changes which can barely apply upon the original bfs grq lock solution are back-ported to -gc branch.

3e678e6 bfs: sync with mainline sched_setscheduler() logic.
-- There is new parameter called sched_attr is introduced during 3.14 in sched_setscheduler() and related functions. BFS is not fully sync with these changes yet.

f3a98f9 bfs: Refactory online_cpus() checking in try_preempt().
9c72372 bfs: Refactory needs_other_cpu().
--Two changes in try_preempt().

65f06e8 bfs: priodl to speed up try_preempt().
-- Introduced task priodl, which first 8 bits is the task prio and the last 56 bits are the task deadline(higher 56 bits) to speed up try_preempt() calculation.

1fa3bb5 bfs: Full cpumask based and LLC sensitive cpu selection.
-- Rewrite the cpu selection logic for tasks, by using full cpumask based calculation. The benefit of cpumask based calculation is that the cost is not scaled with cpu numbers when it is among a certain range(64 cpus for 64bits system and 32 cpus for 32bit system). The cost to transfer tasks among cpu which shares same LLC(Last Level Cache) should be consider free.
The best_mask_cpu() cpu selection logic now follows below orders:
* Non scaled same cpu as task originally runs on
* Non scaled SMT of the cup
* Non scaled cores/threads shares last level cache
* Scaled same cpu as task originally runs on
* Scaled same cpu as task originally runs on
* Scaled SMT of the cup
* Scaled cores/threads shares last level cache
* Non scaled cores within the same physical cpu
* Non scaled cpus/Cores within the local NODE
* Scaled cores within the same physical cpu
* Scaled cpus/Cores within the local NODE
* All cpus avariable
To implement full cpumask calculation, non_scaled_cpumask is introduced in grq structure. The plug-in code in cpufreq and intel_pstate drivers also be modified to avoid multi-trigger when scaling down from max cpu freq(intel_pstate driver just pass compile test, I have no hardware which runs on intel_pstate driver)

Here are the test result of the 3.17-gc, comparing to vrq-02-baseline-test-result, in low workload, these patches give another 3~4 seconds improvement.

#1 3.17.2-gc 50% tasks/cores ratio
5m18.630s
5m18.509s
5m18.504s
5m18.494s
5m18.489s
5m18.487s
5m18.481s
5m18.461s
5m18.383s
5m18.339s

#2 3.17.2-gc 150% tasks/cores ratio
2m56.475s
2m56.447s
2m56.418s
2m56.410s
2m56.402s
2m56.367s
2m56.349s
2m56.197s
2m56.173s
2m56.128s

#3 3.17.2-gc 100% tasks/cores ratio
2m52.318s
2m50.698s
2m50.654s
2m50.596s
2m50.578s
2m50.543s
2m50.534s
2m50.508s
2m50.437s
2m50.430s

#4 3.17.2-gc 100%+50% tasks/cores ratio
2m52.115s
2m51.246s
2m50.892s
2m50.880s
2m50.847s
2m50.819s
2m50.805s
2m50.804s
2m50.789s
2m50.645s

If you want to try these new patches, please check my linux-3.17.y-gc git branch.

Monday, September 22, 2014

VRQ 0.2 release

As 3.17 will be released soon, earlier than it's expected, VRQ development is cut off and tagged for 0.2 release.

There are some bug fixes and others are improvement. Some is not related to VRQ locking, and I will see if it can be back-port to original BFS as baseline improvement in the next release. The detail changes are:

3ef882c bfs: Rework swap_sticky().
-- Yet another activity will be continued in next release.
e8754f9 bfs: rework resched_xxxx_idle(), basic version.
-- I will write another post to describe it in detail, but in brief, it rewrite the resched_xxxx_idle() using cpumask method.
2ae0fb6 bfs: refactory schedule() for rq&grq lock ctx switching.
cea6ce8 bfs: vrq: rq&grq locking ctx switch v3.
-- It's a bad idea to separate a context_switch process into two grq locking sessions, so I turn to this solution which hold rq and grq locking during context_switch.
319cd02 bfs: vrq, refactory wake_up_new_task.
79f5644 bfs: Fix need_other_cpu logic in schedule().
4d511fb bfs: RQ niffy solution.
-- Already described in previous post.
7c519a7 bfs: inlined routines update.

The test result are

50% Ratio:
5m19.531s
5m19.519s
5m19.509s
5m19.508s
5m19.430s
5m19.376s
5m19.363s
5m19.359s
5m19.333s
5m19.299s

150% Ratio:
2m54.394s
2m53.632s
2m51.960s
2m51.929s
2m51.925s
2m51.801s
2m51.790s
2m51.747s
2m51.641s
2m51.592s

100% Ratio:
2m51.001s
2m50.150s
2m49.881s
2m49.860s
2m49.812s
2m49.770s
2m49.764s
2m49.733s
2m49.699s
2m49.660s

100%+50% Ratio:
2m49.987s
2m49.980s
2m49.916s
2m49.865s
2m49.835s
2m49.828s
2m49.802s
2m49.784s
2m49.744s
2m49.733s

Comparing to vrq-02-baseline-test-result, under low or heavy workload, VRQ 0.2 shows a visible better throughput than the baseline. And under the optimize workload, VRQ 0.2 shows a slight better than baseline.

If you want have a try with VRQ 0.2, the code is located at v3.16.2-vrq.

Wednesday, September 17, 2014

VRQ 0.2: RQ niffy solution

One of the changes in VRQ 0.2 is RQ niffy solution, which is a replacement solution of grq niffies by put niffy into each RQ. For the original design of grq niffies, please read CK's post http://ck-hack.blogspot.com/2010/10/of-jiffies-gjiffies-miffies-niffies-and.html

Functions which need grq.niffies are time_slice_expired() and task_prio().

There are update_clocks() called before every time_slice_expired() with grq lock, that means there is no impact if RQ niffy solution is used instead of grq.niffies solution in update_clocks() and time_slice_expired().

In task_prio(), grq.niffies can be replaced by niffy in current RQ, it may not be the latest niffy among all the RQs, but it is acceptable.

By using RQ niffy solution, grq lock for niffy update/read is not required. It is designed to reduce grq lock hot spots.

For the code change, please check https://bitbucket.org/alfredchen/linux-gc/commits/f6ec6f5303cb88e7462f4321b7a29d6c8ab83e89?at=linux-3.16.y-vrq

Saturday, September 13, 2014

VRQ 0.2 Baseline Test Result

After sync up kernel mainline stable release, the baseline for this VRQ cycle is frozen. In the following 3 or 4 weeks till 3.16.5 or 3.16.6 release, feature code of VRQ 0.2 will be committed.

In the weekend, I run the testing for baseline and the current VRQ. The result seems good. Below are the details.

50% ratio:

3.16_0456_50#
5m34.068s
5m34.061s
5m34.021s
5m33.930s
5m33.927s
5m33.923s
5m33.860s
5m33.855s
5m33.767s
5m33.754s

3.16_Baseline_50#
5m22.297s
5m22.272s
5m22.173s
5m22.159s
5m22.085s
5m22.062s
5m22.024s
5m21.983s
5m21.967s
5m21.884s

3.16_VRQ_50#
5m22.313s
5m22.089s
5m22.071s
5m22.037s
5m22.026s
5m21.964s
5m21.949s
5m21.918s
5m21.900s
5m21.782s

Result shows that commit https://bitbucket.org/alfredchen/linux-gc/commits/ad9dd03db1002717f155c859ee613641620d3ba0?at=linux-3.16.y-gc
really boost system performance, about 3%. VRQ is as good as Baseline for this testing.

150% ratio:

3.16_0456_150 #
2m56.433s
2m56.412s
2m56.371s
2m56.354s
2m56.348s
2m56.342s
2m56.340s
2m56.327s
2m56.279s
2m56.271s

3.16_Baseline_150 #
2m57.551s
2m57.516s
2m56.365s
2m56.354s
2m56.335s
2m56.295s
2m56.290s
2m56.271s
2m56.258s
2m56.187s

3.16_VRQ_150 #
2m53.562s
2m53.048s
2m51.942s
2m51.855s
2m51.803s
2m51.786s
2m51.777s
2m51.771s
2m51.694s
2m51.585s

Baseline is as good as original BFS, VRQ shows a performance boost, about 2%.

100% ratio:

3.16_0456_100 #
2m51.594s
2m50.640s
2m50.598s
2m50.592s
2m50.574s
2m50.535s
2m50.511s
2m50.509s
2m50.477s
2m50.434s

3.16_Baseline_100 #
2m50.702s
2m50.633s
2m50.614s
2m50.598s
2m50.579s
2m50.555s
2m50.501s
2m50.498s
2m50.445s
2m50.311s

3.16_VRQ_100 #
2m49.929s
2m49.860s
2m49.853s
2m49.836s
2m49.827s
2m49.800s
2m49.800s
2m49.788s
2m49.770s
2m49.721s

Baseline is as good as original BFS, VRQ shows a little better than Baseline.

100%+50%IdlePrio Ratio:

3.16_0456_100_50 #
2m50.928s
2m50.893s
2m50.828s
2m50.796s
2m50.794s
2m50.784s
2m50.771s
2m50.746s
2m50.675s
2m50.666s

3.16_Baseline_100_50 #
2m50.863s
2m50.861s
2m50.854s
2m50.844s
2m50.804s
2m50.736s
2m50.713s
2m50.707s
2m50.640s
2m50.536s

3.16_VRQ_100_50 #
2m50.114s
2m49.965s
2m49.926s
2m49.910s
2m49.878s
2m49.863s
2m49.817s
2m49.802s
2m49.796s
2m49.705s

The result is almost same as 100% ratio test, it is worthy to look close look into.

Friday, September 12, 2014

Branches sync-up with 3.16.2

-bfs

This branch is for bfs related development, which consider stable and apply upon original bfs code.

Changes:

-- Find a regression commit and revert it at this time.
-- Sync up with mainline 3.16.2

linux-3.16.y-bfs

-vrq

This branch is for bfs vrq solution development, it should be considered experimental and just use for testing.

Changes:
-- Sync up with mainline 3.16.2
-- Rebased latest -bfs branch.
-- Add some bug fix to vrq code.

linux-3.16.y-vrq

-gc

Changes:

-- sync up with mainline 3.16.2
-- merge -bfs branch instead of tracing all bfs related commits.

linux-3.16.y-gc

Saturday, August 30, 2014

GC update & VRQ dev cycle

-gc

It seems that mainline 3.16.2 is not hit this week, so I have to updated and taged 3.16.1-gc.

Remarkable changes:
-- sync up with v3.16.1
-- sync up with BFS 0456
-- resched_best_idle() bug fix get 3% performance improvement

v3.16.1-gc

-vrq

VRQ solution is based on BFS. In order to avoid additional sync up with BFS and mainline update, I decide to take the following strategy and line up with stable kernel update strategy.

mainline .0 .1 .2 release ---- sync up with mainline and BFS new release.
mainline .3 .4 .5 .6 release ---- put new VRQ changes in, Baseline frozen. VRQ release frozen.
mainline .7 .8 ... ---- testing and benchmark.

Regression investigation and resched_best_idle issue found

After reversed the commit which I bitsect to find out, I get a approaching result comparing to the Baseline, but the grp lock session is till there to work on.

I played with grq lock in the idle task path in many ways.
1. Placing grq_lock() before update_clocks(), approaching result comparing to the Baseline.
2. Placing grq_lock() after update_clocks(), the regression comes back.
3. A draft grq lock free idle fast path solution, the regression comes back.

These code changes doesn't make sense contributing to the regression to me, and I believe the grq lock may be just an illusion of a hidden issue.

Looking back at the unusual sched_count and sched_goidle value in schedstat, sched_goidle is earlier to be traced than sched_count, and there are three code path lead to sched_goidle, which are

1. idle = prev, !qnr
2. idle != prev and deactivate, !qnr
3. idle != prev and prev need other cpus, !qnr

so I wrote debug loads to find out how these three path contributes to sched_goidle, here is the result

                  idle        deactivate    needother
Baseline    9727      56106     0
VRQ          27764    61276          0

It looks like that schedule() is called while idle task is running and actually none task is queued in grq and so scheduler has to run idle again. This idle->idle code path is inefficient, it should be hit as less as possible.

One suspicious code cause schedule() to go idle->idle code path is the resched_best_idle(p) call. Currently, the resched_best_idle(p) is called when prev not equal next. I wrote a patch to remove the duplicated resched_best_idle() call in 3.15.x. This time, I decide to take a close look at under what condition resched_best_idle() should be called.

#1 prev == idle
#2 !#1 and deactivate
#3 !#1 and !#2 and need_other_cpus
#4 !#1..3 and qnr
#5 !#1..3 and !qnr

                               #1 #2 #3 #4 #5
resched_best_idle    N   N   Y     ?    N
next != prev      ?    Y   Y     ?    N

? means that the result depends what next task is fetched from grq.

Obviously, current next != prev condition can't cover #1 and #2, which will caused unnecessary schedule() calls. Take the 50% job/core ratio test for example, when a task is deactivated, and next task is not generated yet, scheduler will choose idle to run. But the resched_best_idle(prev) will cause schedule() run again on this rq, which hits the idle->idle code path.

There is a lot of talk but the code change is very simple, for baseline, just one line added

Baseline Fix

Below are the 50% job/core ratio throughput test results for the baseline and vrq with fix

sched_count sched_goidle ttwu_count ttwu_local real
Baseline+fix 200751 52479 85490 39002 5m22.097s

VRQ+fix 202821 51795 85278 36583 5m23.010s

idle deactivate

Baseline+fix 8024 44455

VRQ+fix 11379 40388

The fix is good for both baseline and vrq, compare to the baseline w/o fix, which cost about 5m33s, 10~11 seconds are saved, that's about 3% improvement.

Tuesday, August 26, 2014

VRQ 0.1 schedstat result and regression on 50% ratio test

The first stage VRQ 50% job/core ratio test shows there is about 3 seconds regression comparing to the Baseline. So I decide to find out the cause of it.

First, looked at the schedstat output which has been collected during the test. Below shows the result between of the Baseline code-base and the VRQ.

*********************************************************************
sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay    rq_sched_info.pcount
251973            61299     83682         39699         181334347343    372048261887                        97709
domain0 19012
domian1 24971

288154            77212     85273         43396         174958346976    317271010776                        107263
domain0 18564
domian1 23882
*********************************************************************

The most remarkable diff is the larger sched_count and sched_goidle. The stat number doesn't tell what caused this. I have to bisect and find it out.

Finally, it turns out the commit which divide update_clocks() into update_rq_clock() and update_grq_clock() cause the regression. It's a mistake to separate these two too far away as bfs will use jiffies to adjust ndiff. After reversed that commit(part of it), an approaching result comes.

*********************************************************************
sched_count sched_goidle    ttwu_count    ttwu_local    rq_cpu_time        rq_sched_info.run_delay
267428            66304     82625         40292         171637450209
*********************************************************************

Tuesday, August 19, 2014

Variable Run Queue Locking(VRQL) for BFS

One of the BFS's concern is lock contention with single grq lock. As I read mainline and bfs code these days, it comes up to me an idea of multiple queue lock solution for BFS. In brief, introduce per cpu run queue lock "like" mainline does, to guard rq data accessing and minimize grq locking sessions, by this mean, to reduce lock contention. Depends on what kind of data the code need to access, rq lock or grq lock or both rq/grq lock need to be held, what's the name VRQL(variable run queue locking) comes from.

How is current grq lock
Currently BFS is using single grq lock for almost everything, which including task info accessing, rq structure accessing and grq accessing. It's a simple design but all operations about task/rq/grq will need to acquire the grq lock.

The run queue lock
In VBRL, a raw_spin_lock will be add to current bfs run queue structure, to guard run queue structure data accessing cross cpus. At the meantime, local run queue data accessing is still safe with preempt and interrupt disabled.

Lock strategy for cross cpu data accessing
After run queue lock is introduced, the "all grq" locking strategy will be changed as below

a. To accessing task structure info, if task is currently running, task's rq lock needs to be held, which prevent task from switching out of cpu/rq; otherwise, grq lock needs to be held, which prevent task from taken from or put into grq.
task_access_lock...()/task_access_unlock...() inlined routines are introduced for this purpose.

b. To accessing run queue structure, rq's lock needs to be held.
For the convenience of accessing BOTH task's structure and task's rq structure, task_vrq_lock_...()/task_vrq_unlock...() inlined routines are introduced, which will firstly lock task's rq lock then acquire grq lock if task is not running.

c. To accessing grq structure, grq lock needs to be held.
rq_grq_lock...()/rq_grq_unlock...() inline routines are introduced to lock-down the given rq and grq both.

Minimize grq locking session
Adding the above locking strategy will not automatically give performance improvement. Instead, it generates additional locking overhead which impact performance. The performance gain comes by identify and narrow grq sessions, in both session numbers and the run-time spend in it.

To keep things simple and no surprise when play with the grp sessions, the re-factory method is used. That means, just change the code itself but *NOT* the bfs functionality logic, as possible as it can be.

First Stage
In this stage, the first priority task is apply new locking strategy to bfs and make it works. Second priority task is focused on schedule() re-factory, as it is the hot spot of scheduling core code. For the rest part of code, conservative rule is used.

There are 4 concept in schedule().
a. the rq/cpu which schedule() happens on.
b. prev task, the task rq is running (rq->curr) at the beginning schedule() is run.
c. next task, the task rq will run on after schedule().
d. grq, which stores the queued task and next task comes from and prev task puts back into when switch happens.

Changes in schedule()

First of all, grq lock is replaced by rq->lock when locking strategy applied.

In the first part of schedule(), from need_resched label to blk_schedule_flush_plug() block, most codes are safe under rq lock guarding as just prev task data structure is accessed.
Only try_to_wake_up_local() needs grq lock to be held in this part. A debug test shows that it is not the hot code path of schedule(), only less than 50 hits when system boots up and a suspend/resume cycle, so simply around try_to_wake_up_local() by a grq_lock() and grq_unlock() pair.

Second half of schedule(), idle!=prev block, fetch next block and context_switch block. To narrow the grq block session, the spots which need read/write access to grq are highlight then find a way to re-factory it. The code tells what it finally has been done. In brielf, update_clocks() has been re-factoried into 2 calls, update_rq_clock() and update_grq_clock(), the last one just need to be called when idle!=prev and before check_deadline(); update_cpu_clock_switch() has been divided into two versions, which add _idle and _nonidle subfix and be called in individual branch. The grq locking session ends before context_switch().

All these re-factory should *NOT* change the original bfs functionality logic. *EXCEPT* that the prev task needs to be taken care of, as prev task has been returned to grq before context_switch(), once grq lock releases before context_switch(), it may be fetched by other rq/cpu and run on other rq/cpu, but actually prev might haven't done yet in this rq. There are 2 options to solve this

1. dequeue prev task from grq and hold it within rq during context_switch(), once context_switch() done, return prev to grq.
2. hold grq lock during context_switch() so prev is still safe in grq.

Currently, option 1 is used, will make the choice later based on further testing and optimization.

Scalability&Critical Spot
In VRQL, the run queue lock is per cpu design so it is scalable when cpu number goes up. The grq lock sessions are still the critical spots of scheduling code.

Testing

a. Four jobs/cores ratio tests are designed to test possible code path in schedule(), 50%, 100% and 150% jobs/cores ratio tests, which simplely modify the -j<num> according to the cpu core number of the tested machines. 100% + 50%, which runs 50% ratio mprime testing at SCHED_IDLEPRIO priority using schedtool, then run 100% ratio kernel compilation.
b. Kernel compilation throughput tests are done on Intel Xeon L5420 (quad core).
c. Two different bfs code-base kernel are used, they are
1. 3.16_Baseline   3.16 kernel with BFS 0449 + my bfs patch set
2. 3.16_VRQ         3.16 kernel with BFS 0449 + my bfs patch set + VRQ locking patch

Test steps:
1. Boot with targeting kernel into text mode.
2. First-Time-Compile for the kernel code(First-Time-Compile always takes longer, used to cache in files)
3. Compile the kernel code with -j<num> and record real time used, change <num> based on what jobs/cores radio for test and machine is using.
4. Repeat step3 10 times, exclude the max and min time recorded and calculate the average time of the rest.

                        50%              100%            150%           100%+50%
3.16_Baseline    5m32.978s    2m50.706s    2m56.419s    2m50.656s
3.16_VRQ   5m35.576s   2m50.432s    2m56.642s    2m50.731s

Conclusion
The first stage tests show that the gained performance improvement is comparable to the overhead caused by additional locking. 2 remarkable items list here
1. In 50% jobs/cores radio test, because not all cores are used, the performance gain from refine grq locking is less than additional locking overhead.

2. Even VRQ 50% test is worse than the Baseline, but in 100% test, it gets equal result of the Baseline, which shows VRQ gain performance by avoid some lock contention, as what it is designed to be.

Hopefully in the coming stage of VRQ coding, the VRQ tests can get better results.

The Code
New branch has been created for VRQ, the code should be considered highly experimental, and should only use for testing.

Branch: https://bitbucket.org/alfredchen/linux-gc/downloads/0450-enable-polling-check.patch

3.16_Baseline at commit: badf57e bfs: Fix runqueues decarelation.
3.16_VRQ at commit: ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.

VRQ related commits are:
ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
d59b75c bfs: Make grq cacheline aligned.
e5de0ef bfs: (6/6) Implement task_vrq_lock...() grq_test_lock() logic.
307cc5e bfs: (5/6) Refine schedule() grq locking session.
e2f1be3 bfs: (4/6) use rq&grq double locking.
a61ee81 bfs: (3/6) isolate prev task from grq.
9aad936 bfs: (2/6) other code use grq&rq double locking.
482345f bfs: (1/6) schedule() use grq&rq double locking.

What next
1. It will be nice to have feedback from more testers, especial from who runs on multiple cores or cores with SMT, and the owner with amd cpus.
2. Sync-up with 3.16 code and new BFS release.
3. Identify what next stage code change should be focused and move on.