Alfred Chen's Blog: 2015

Monday, December 28, 2015

VRQ v4.3_0466_4 test patch

When BFS 0466 came out and I rebase -gc and -vrq branch upon it and done some sanity tests. In short, 0466 improve throughput when workload >=100% for both bare bfs code and -gc branch, it's 2m34.6s for 0466 -gc comparing to original 2m36.7s at 300% workload. It's good to see CK is improving bfs as he stops adding new feature in bfs for a long time. Thought it's known now that bfs 0466 cause regressions of interaction and CK release bfs 0467 to address this issue by manually setting the schedule option.

When continue work on the -vrq branch, I found there is regression in performance sanity tests for bfs 0466 code changes. I use bisect to found out the commit "bfs/vrq: [3/3] preempt task solution" contributes to the regression. The original design of task_preemptable_rq() function, is to go through all cpu to find out the highest priority/deadline task which is running and set it as the target rq to be preempted. In this way, just the highest priority/deadline task got preempted. But go through "all" cpu to found out the best one seems to be an overhead which cause the regression. So, alternatively, it is now changed to select the first cpu which running the task with priority/deadline higher than the given task as the cpu/rq to be preempted. With this code change, the best benchmark of sanity result is recorded for -vrq branch.

After removing performance bottleneck, it's time to handle the interaction issue. In original bfs, sticky task are are (a) not allow to run on cpu which is scaling, (b) cpu affinity by adjusting the deadline. Look back the bfs 0466 code changes, it makes not only sticky tasks are cpu affinity, but *ALL* tasks are cpu affinity. In such way, it improves performance but also impacts the interaction at the same time. When ck release bfs 0467 to address the interactivity issue, it introduced a run-time option to do the switching work. And considering in -vrq, the sticky task has been replaced by cache task mechanism and scost and cached timeout are introduced to control when the task should be cached, I decided to use existed codes in vrq to balance the task performance and interaction.

First of all, to mark all tasks switched out of cpu "cached", previously only part of tasks which still need cpu(in activate schedule) are marked "cached".
Secondly, mark all new fork task "cpu affinity", based on the test, this also contributes the performance improvement.
Thirdly, after the bottleneck is removed, the SCOST design is truly be tested. It turns out it is not working as expected(huge threshold doesn't impact performance) at least for my sanity test pattern(all gcc share binary, only PSS are difference among all gcc threads running at the same time). It looks like that SCOST may be not a good design, in other word, it may be a bad design, because the threshold is tested under some certain pattern, for other pattern, it may impact the performance or interaction. The SCOST code is still existed in this patch, but it will not be functional at all, and will be removed when I clean up the commits.

Now, the only control of the caching tasks and "cpu affinity" tasks is the cached time-out, and it's a per task policy setting, for example, batch/idle tasks has unlimited cached time-out, as user don't care about their interactivity. In implement, the unlimited time-out is set as 1 second. For rt tasks, the time-out is set as default-rr-interval(6ms). For normal tasks that users most likely run, the time-out setting is depended on the preemption model kernel config, when it is configured as CONFIG_PREEMPT_NONE which means the machine tends to be used as server and doesn't care task interactivity, the cached wait time is unlimited, otherwise the time-out is set to defautl-rr-interval(6ms).

Interactivity tests has been done is normal policy mpv h264 play-back with no frame drop while normal nice 19 300% compiling workload at the background.

Batch policy 300% workload compile benchmark:
Previous vrq -- 2m37.840s
Current vrq -- 2m33.831s

Idle policy 300% workload compile benchmark:
Previous vrq -- 2m35.673s
Current vrq -- 2m35.154s

Normal policy 300% workload compile benchmark:
Previous vrq -- 2m37.005s
Current vrq -- 2m36.633s

The result is ok and the new vrq patch is ready for user testing, the all in one patch file for kernel 4.3 is uploaded at bitbucket download. It will need more time to clean up the commits before update the git, I'd like to finish it during the new year holiday.

Happy New Year and have fun with this new -vrq patch, your feedback will be welcome.

BR Alfred

Thursday, December 10, 2015

GC and VRQ branch update for v4.3.1 and latency test

Finally it comes the first stable release for 4.3, and gc and vrq branch are updated with bug fixes during these few weeks.

*A non-return error when enable SMT_NICE(though SMT_NICE is not recommended for VRQ)
*Go through threads list with tasklist_lock held when cpu hotplugs. It's for both gc and vrq branch.

*Task caching scheduling PartIII, as usual I will write another post for it.

The gc branch for v4.3.1 can be found at bitbucket and github.
The vrq branch for v4.3.1 can be found at bitbucket and github.

One more thing, I would like to add more tests/benchmark for scheduling for a long time. And I finally found one yesterday, it is Cyclictest, you can check the detail on this wiki(it's a little old but it's a good start point). Based on my research, it is scheduler independent and use no scheduler statics.

Here is my first idle workload cyclictest result for v4.3 cfs, bfs and vrq. (I'm still playing with it)

4.3 CFS
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.05 0.04 0.05 1/219 1504

T: 0 ( 1499) P:80 I:10000 C: 10000 Min:   1831 Act:    2245 Avg:    2413 Max:   12687
T: 1 ( 1500) P:80 I:10500 C:   9524 Min:   1917 Act:    2965 Avg:    2560 Max:    7547
T: 2 ( 1501) P:80 I:11000 C:   9091 Min:   1702 Act:    2254 Avg:    2313 Max:   10650
T: 3 ( 1502) P:80 I:11500 C:   8696 Min:   1546 Act:    2297 Avg:    2274 Max:   13723

4.3 BFS
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.15 0.10 0.04 1/234 1540

T: 0 ( 1536) P:80 I:10000 C: 10000 Min:   1437 Act:    2002 Avg:    1893 Max:   10912
T: 1 ( 1537) P:80 I:10500 C:   9524 Min:   1427 Act:    2010 Avg:    1907 Max:    7534
T: 2 ( 1538) P:80 I:11000 C:   9091 Min:   1402 Act:    1755 Avg:    1902 Max:   13059
T: 3 ( 1539) P:80 I:11500 C:   8696 Min:   1408 Act:    1878 Avg:    1866 Max:   12921

4.3 VRQ
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.00 0.01 0.00 0/226 1607

T: 0 ( 1602) P:80 I:10000 C: 10000 Min:   1349 Act:    1785 Avg:    1647 Max:    4934
T: 1 ( 1603) P:80 I:10500 C:   9524 Min:   1355 Act:    1464 Avg:    1642 Max:   12378
T: 2 ( 1604) P:80 I:11000 C:   9091 Min:   1334 Act:    1926 Avg:    1676 Max:   12544
T: 3 ( 1605) P:80 I:11500 C:   8696 Min:   1350 Act:    1801 Avg:    1627 Max:   10989

Enjoy with gc/vrq on v4.3.1 and try cyclictest if you care about the latency and task interaction.

BR Alfred

Edit:
If you have failed s2ram/resume issue with this new gc/vrq release, you can try below 2 patches(one for gc and one for vrq) and see if it help with you.
4.3_gc3_fix.patch and 4.3_vrq1_fix.patch

Tuesday, November 17, 2015

New vrq patch for v4.3

I'd like to wait for v4.3.1 before official release new gc and vrq code, but it looks like a test patch would be welcome before that.

Here it comes, please download the vrq_v4.3_0465_2.patch from bitbucket download page and it contains the bfs 0465 rebase and part3 of task caching commit in it. Feel free to git it a try and report back.

PS, I got feedback from a user who report vrq with wine gaming has better experience like mouse movement etc. It turns out the initial idea of vrq about reducing grq lock sessions helps.

BR Alfred

Monday, November 9, 2015

GC and VRQ branch update for v4.3

GC branch has been updated at bitbucket and github, gc version has bumped to v4.3_0463_1 with a minor v4.3 sync-up update. The all-in-one patch file is at gc_v4.3_0463_1.patch.

VRQ branch has been also updated at bitbucket and github, the version is v4.3_0463_1_vrq0, new commit for vrq will be added upon it. All-in-one patch file is at gc_v4.3_0463_1_vrq0.patch.

Have fun with 4.3 kernel.

BR Alfred

Tuesday, November 3, 2015

gc_v4.3_0463_0 patch released

The "zero" version of -gc patch for 4.3 kernel has been released at bitbucket download page. It's a single patch file you can apply upon vanilla kernel source tree. I'm working on the rest of commits on -gc branch and waiting for upstream patch updates for v4.3 before pushing the -gc branch to public git.

Pls report back if you have issue with the patch, I'm still looking at the minor changes and may bump the version to "one" when it's official released.

BR Alfred

Monday, October 26, 2015

-gc and -vrq update for v4.2.4

There are API updates in v4.2.4, so here comes the -gc and -vrq branch updates.

You can checkout the latest code from bitbucket or github for gc branch. Main sync-up commit is e8d2e33 bfs: [Sync] sync-up with v4.2.4

For -vrq branch, besides the sync-up commit 2f4a9ac bfs: [Sync] sync-up with v4.2.4, there is a bug fix commit 7de8ad5 bfs/vrq: Fix preemptible code in sys_sched_yield(), this commit is to address smp_processor_id() in preemptible [00000000] code error in sys_sched_yield(). Vrq branch is updated at bitbucket and github.

BR Alfred

Friday, September 25, 2015

Consider cache in task scheduling Part 2

In this part, let's look at the first factor of the task caching -- the cpu cache size. Talking about cache, there is a fighting between Intel and AMD about the cache size in their cpu design years ago. Intel tends to have large cpu cache size while AMD uses less. I remembered one of AMD's explain is software for gaming doesn't use large cache size. I'm kind of agree that.

IMO, cpu cache size, especially the LLC(Last Level Cache) size determined the hardware capacity of how many data can be cached for cpu. And look at this in another way, giving a sequence of tasks switching, the cpu cache size determined how long the data of a task can be kept in cache. For system with large workload, large number of tasks are running at the same time, cpu with larger cache size will help for keeping task's data in cache than cpu with less cache size. For system workload for short response time, like gaming, large cache size doesn't help much. So, AMD is right for this, but large cache size design is good for common workload usage, not just for the workload like gaming.

Task scheduling should take cpu cache size(llc size) into account. In the latest 4.2 vrq branch, there is a new commit implements the first version of code change, which aware of llc cache size of cpu and auto adjust the cache_scost_threshold(cache switch cost threshold) by it. (For the concept of cache switch cost threshold, please reference to Part 1). Here is the brief of what it has been done.

1. export a scheduler interface to cacheinfo(drivers/base/cacheinfo.c) as a call back once cacheinfo is ready.
2. implement a simple linear formula to auto cache_scost_threshold adjustment.

Now this formula is considered based on intel core2 cpu topology and 64bit kernel. Every 512KB LLC size increase CACHE_SCOST_THRESHOLD value by 3.
For 32bit kernel, consider 32bit use less data/instruction size, every 256KB LLC size increase CACHE_SCOST_THRESHOLD value by2, but I don't have 32bit systems to prove this yet.
The new cpu topology like smt is not yet be considered in the formula, because there are bounces in the benchmark result when SMT enable, so it's hard to compare the result of different CACHE_SCOST_THRESHOLD value.

3. A kernel boot parameter "cache_scost_threshold" is introduced, it can be used for manually assigning the CACHE_SCOST_THRESHOLD value to scheduler if

cacheinfo is not available in your arch(like arm, most?) or
the simple linear formula is not covering your cpu topology and you want to find out the best value of it.

It's still at its first version, in next version, I'd like to complete the formula to cover the SMT topology.

... to be continued (Part 3)

Thursday, September 24, 2015

VRQ branch for 4.2 release

Just want to post a notice that vrq branch for 4.2 is released, at both bitbucket and github, and it's version is tagged v4.2_0463_2_vrq1.

What's new:
1. Combine a few commits into VRQ solution and bump it's version

3edb4d4 bfs/vrq: VRQ solution v0.6

2. Rename rq->nr_interruptible to rq->nr_uninterruptible, to keep sync with mainline code

dcdc064 bfs/vrq: Instroduce rq->nr_uninterruptible

3. Task caching scheduler part 2, there will be an individual post for it

8f5e7d2 bfs/vrq: cache scost threshold adjust by llc size

Meanwhile, I'm trying to improve performance for SMT/using task caching property, unfortunately not every try turns out a positive result, as usual. Anyway, have fun with this new 4.2 vrq release.

BR Alfred

Wednesday, September 16, 2015

gc_v4.2_0463_2 released

As title, just two new updates in this release

1. Remove resched_closest_idle() as planned in previous release post. I haven't got the feedback yet, but considering more reports are coming, both calls are removed at this release. The modified commit is

da50716 bfs: Full cpumask based and LLC sensitive cpu selection, v4

2. Fix update_cpu_load_nohz() redefine error when undefine CONFIG_NO_HZ_COMMON, this is a missing for the 4.2 sync up, the modified commit is

75fd9b3 bfs: [Sync] 4.2 sync up v2

New -gc branch can be found at bitbucket and github. An all-in-one patch includes all bfs related changes(*NOT* all commits in -gc branch) is also provided.

Have fun with this new -gc release, and -vrq branch update is in-coming.

BR Alfred

Friday, September 11, 2015

Consider cache in task scheduling Part 1

Recently, in the past two release, I am working on the idea taking task caching states into account in scheduler code and implement the first part in later commits on 4.0 and 4.1 release. Now it's in 4.2 and I am working on the second part of code, so it's time to give it a brief introduction.

Till now, the cpu cache is still like a "back-hole" of hardware piece from the view of software it impacts computer performance greatly, so it's still worthy to emulate task caching usage even it's not 100% accurate. Here is the factors that should be considered

1. Cache size of the cpu

Larger cache size gives larger hardware capability to cache a task. Talking about the cache size, we specially means the LLC(Last Level Cache) size, which is usual has the largest amount of cache size among all levels of cache and shared with multi cpu cores/smts.

2. How many tasks have been switched and their caching status and memory footprint after the given task has been switched off the cpu.

More tasks have been switched, more likely the given task's caching data will be flushed away.
Switch to a task already been cached, will likely less impact to other task's cache data.
Switch to a task with more memory footprint will likely impact other task's cache data than the task with less memory footprint.

3. How long the given task has been switched off the cpu

And these factors are restrict each others. The commit

ed20056 bfs: vrq: [2/2] scost for task caching v0.7

focuses on the 2nd factor above and introduce switch cost value in each run queue to emulate the impact of task switching to the cache data on the cpu. The implement is in function update_rq_switch().

And when a task switches off a run queue, it records run queue's current switch cost value into it's cache_scost of the task structure. This information combines the cpu which the task run on and the time being switched off(not consider yet) are the caching property of the task, and can be used to emulate the task is cache hot or cool late after.

The algorithm is very simple as just one factor is taken into account in this commit. Whether the task is cache hot or cool is compare the delta of scost value current the rq has and the one when task switched off to a given threshold. The given threshold is now a hard-coded value which optimized for my test-bed system. And the threshold will be adjusted when other factors are taken into account.

Currently, the task caching proprieties are just used to replace the sticky task implementation in original BFS. And for sure it will be used in further development. Talking about the sticky task replacement implementation using task caching, the codes are simpler now
1. run queue doesn't need to reference to it's sticky task, and multiple tasks from a run queue can be "sticky"(caching) at the same time.
2. run queue doesn't need to trace sticky task's sticky state, now the cached state of the task is "auto" updated during task selection(in the earliest_deadline_task() function).

Pros:

With this commit and optimize the cache scost threshold for my test-bed system, the sanity test shows better performance in all kind of workload.

Cons：
Depends on what the threshold is chosen, the task may be waiting for the cache hot cpu to pick it up in grq longer than original BFS "sticky task" design. This will impact task's response time and an improvement is planned in further development.

... to be continue (Part 2)

Thursday, September 3, 2015

4.2 Sync up completed for -gc branch

It's good to announce that my 4.2 bfs sync up work has been completed for -gc branch. Lots of sync up work in 4.2, considering the code diff between the -gc and -vrq, some of the sync-up code would be just done in -vrq branch.

Basically, there is no new code for -gc branch, just pick up

fa0c119

bfs: [Fix] Add additional checking in sched_submit_work()

from my latest post.

There are total three reports that there are issue caused in resched_closest_idle(). I am considering remove this function in -gc as there is a total replacement implement in -vrq branch, but I'm still waiting for the feedback to decide to remove both two calls or may be just one of them. So there will be an update once it's finalized. And as upstream asm code clean-up, some X86 cpumask api is no longer supported, but good news is we using more generic ones, this happens in

2ad40cf

bfs: Full cpumask based and LLC sensitive cpu selection, v3

At this point of time, some upstream patch, like BFQ is not yet updated, so the official -gc branch is not created yet, but there is a linux-4.2.y-bfs branch on bitbucket and github, which contains all bfs related commits in -gc branch for kernel 4.2. And there is an all-in-one patch file you can apply upon vanilla kernel tree easily.

Have fun with 4.2 and reports back if any related issue with this porting of bfs.

BR Alfred

Edit:
Missed one api changes for SMT, here is the update patch.

Thursday, August 27, 2015

The BFS unpluged io issue

We traced the unplugged_io issue these two weeks, most discussion are in the replies of a-big-commit-added-to-41-vrq

At first I though that

"I guess the sched_submit_work() doesn't work for bfs b/c bfs use grq_lock instead task_lock() in mainline which a combine of task's pi_lock and rq->lock, the checking of tsk_is_pi_blocked(tsk) is not enough for BFS."

After investigation, it turns out that tsk_is_pi_blocked() is introduced in v3.3
3c7d518 sched/rt: Do not submit new work when PI-blocked
And it's not indicate tsk->pi_lock is held as I used to think it was.

So the question is back again, when sched_submit_works() is introduced in mainline 3.1, it moves the blk_schedule_flush_plug(tsk) call outside from schedule(), but relaxing the checking when not calling it. This code change is ok for mainline CFS but it's not for BFS somehow.
Adding back those checking is the current solution. The last patch for this issue is unchanged. I'd update -gc and -vrq branch soon to include it.

BR Alfred

Monday, August 24, 2015

4.1 -gc -vrq sanity test result and look forward

Since there are toolchain upgrade in my distribution. Now the system is using new gcc 4.9.x etc, it runs a little bit slow than 4.8.x, I have to overclock the test-bed system to get an acceptable run time of the sanity test. The result is as expected, comparing to previous test results, no regression is introduced in this release. The result is attached at the end of this post.

And 4.2 official release is delayed one week, it gives me a chance to list the todo items in next cycle, here they are
1. Sync up mainline 4.2, when preview the code changes during 4.2, there are much changes in scheduler code, over 1200+ lines of diff.
2. Start work on new commit which auto adjust the the cpu cache size factor of the task caching, current it's hard-code to optimize for my test-bed machine.
3. Fix known bugs, add comments and try to finalize some of the commits in VRQ.
4. Test and tune SMT.
5. Introduce another benchmark test.

Seems that there are enough thing to keep me busy for weeks, :)

BR Alfred

4.1 CFS
>>>>>spining up
>>>>>50% workload
>>>>>round 1
real    4m40.652s
user    8m39.005s
sys     0m35.902s
>>>>>round 2
real    4m40.688s
user    8m39.100s
sys     0m35.892s
>>>>>round 3
real    4m40.879s
user    8m39.041s
sys     0m35.881s
>>>>>100% workload
>>>>>round 1
real    2m30.750s
user    8m56.625s
sys     0m38.958s
>>>>>round 2
real    2m32.314s
user    9m2.696s
sys     0m39.169s
>>>>>round 3
real    2m32.873s
user    9m5.219s
sys     0m39.235s
>>>>>150% workload
>>>>>round 1
real    2m35.384s
user    9m13.719s
sys     0m40.464s
>>>>>round 2
real    2m34.874s
user    9m11.656s
sys     0m40.704s
>>>>>round 3
real    2m34.973s
user    9m10.739s
sys     0m40.397s
>>>>>200% workload
>>>>>round 1
real    2m36.812s
user    9m17.614s
sys     0m40.828s
>>>>>round 2
real    2m36.634s
user    9m18.383s
sys     0m40.701s
>>>>>round 3
real    2m36.992s
user    9m19.108s
sys     0m40.819s
>>>>>250% workload
>>>>>round 1
real    2m37.632s
user    9m21.271s
sys     0m41.163s
>>>>>round 2
real    2m38.446s
user    9m24.224s
sys     0m41.022s
>>>>>round 3
real    2m38.602s
user    9m24.575s
sys     0m41.436s
>>>>>300% workload
>>>>>round 1
real    2m39.867s
user    9m29.286s
sys     0m41.574s
>>>>>round 2
real    2m40.615s
user    9m29.444s
sys     0m41.578s
>>>>>round 3
real    2m40.111s
user    9m29.686s
sys     0m41.852s

4.1 BFS
>>>>>50% workload
>>>>>round 1
real    4m45.965s
user    8m53.304s
sys     0m32.862s
>>>>>round 2
real    4m45.964s
user    8m53.812s
sys     0m32.378s
>>>>>round 3
real    4m45.919s
user    8m53.194s
sys     0m32.927s
>>>>>100% workload
>>>>>round 1
real    2m30.846s
user    9m1.581s
sys     0m33.857s
>>>>>round 2
real    2m31.267s
user    9m2.822s
sys     0m34.096s
>>>>>round 3
real    2m31.666s
user    9m4.665s
sys     0m33.841s
>>>>>150% workload
>>>>>round 1
real    2m34.415s
user    9m16.511s
sys     0m34.483s
>>>>>round 2
real    2m34.530s
user    9m16.214s
sys     0m35.030s
>>>>>round 3
real    2m34.578s
user    9m17.104s
sys     0m34.456s
>>>>>200% workload
>>>>>round 1
real    2m35.951s
user    9m22.398s
sys     0m34.514s
>>>>>round 2
real    2m37.026s
user    9m22.704s
sys     0m34.639s
>>>>>round 3
real    2m36.158s
user    9m22.571s
sys     0m35.061s
>>>>>250% workload
>>>>>round 1
real    2m37.269s
user    9m25.792s
sys     0m35.212s
>>>>>round 2
real    2m37.058s
user    9m25.937s
sys     0m34.739s
>>>>>round 3
real    2m37.132s
user    9m25.538s
sys     0m35.453s
>>>>>300% workload
>>>>>round 1
real    2m37.935s
user    9m24.762s
sys     0m35.681s
>>>>>round 2
real    2m37.039s
user    9m25.452s
sys     0m35.822s
>>>>>round 3
real    2m38.103s
user    9m26.001s
sys     0m35.129s

4.1 GC
>>>>>50% workload
>>>>>round 1
real    4m43.899s
user    8m50.524s
sys     0m32.508s
>>>>>round 2
real    4m43.831s
user    8m50.031s
sys     0m32.868s
>>>>>round 3
real    4m43.810s
user    8m49.999s
sys     0m32.926s
>>>>>100% workload
>>>>>round 1
real    2m30.824s
user    9m1.669s
sys     0m34.747s
>>>>>round 2
real    2m31.382s
user    9m4.495s
sys     0m34.260s
>>>>>round 3
real    2m31.539s
user    9m5.008s
sys     0m34.470s
>>>>>150% workload
>>>>>round 1
real    2m35.457s
user    9m18.970s
sys     0m34.946s
>>>>>round 2
real    2m34.628s
user    9m18.050s
sys     0m34.884s
>>>>>round 3
real    2m34.648s
user    9m18.807s
sys     0m34.446s
>>>>>200% workload
>>>>>round 1
real    2m36.268s
user    9m23.971s
sys     0m35.149s
>>>>>round 2
real    2m36.410s
user    9m24.660s
sys     0m35.172s
>>>>>round 3
real    2m36.670s
user    9m25.137s
sys     0m35.346s
>>>>>250% workload
>>>>>round 1
real    2m37.606s
user    9m29.152s
sys     0m36.025s
>>>>>round 2
real    2m38.546s
user    9m27.398s
sys     0m35.950s
>>>>>round 3
real    2m38.509s
user    9m28.057s
sys     0m35.655s
>>>>>300% workload
>>>>>round 1
real    2m37.824s
user    9m28.526s
sys     0m36.302s
>>>>>round 2
real    2m37.473s
user    9m28.433s
sys     0m35.741s
>>>>>round 3
real    2m37.049s
user    9m27.219s
sys     0m35.622s

4.2 VRQ
>>>>>50% workload
>>>>>round 1
real    4m43.533s
user    8m49.706s
sys     0m32.653s
>>>>>round 2
real    4m43.630s
user    8m49.385s
sys     0m32.904s
>>>>>round 3
real    4m43.468s
user    8m49.845s
sys     0m32.537s
>>>>>100% workload
>>>>>round 1
real    2m30.467s
user    9m1.640s
sys     0m34.555s
>>>>>round 2
real    2m30.812s
user    9m1.790s
sys     0m34.305s
>>>>>round 3
real    2m30.675s
user    9m2.192s
sys     0m34.027s
>>>>>150% workload
>>>>>round 1
real    2m33.289s
user    9m12.513s
sys     0m34.640s
>>>>>round 2
real    2m33.166s
user    9m12.042s
sys     0m34.795s
>>>>>round 3
real    2m33.135s
user    9m12.005s
sys     0m35.120s
>>>>>200% workload
>>>>>round 1
real    2m36.200s
user    9m19.313s
sys     0m35.160s
>>>>>round 2
real    2m35.053s
user    9m18.936s
sys     0m35.322s
>>>>>round 3
real    2m34.917s
user    9m19.771s
sys     0m34.833s
>>>>>250% workload
>>>>>round 1
real    2m37.391s
user    9m23.886s
sys     0m35.097s
>>>>>round 2
real    2m35.889s
user    9m23.426s
sys     0m35.680s
>>>>>round 3
real    2m36.198s
user    9m23.343s
sys     0m35.443s
>>>>>300% workload
>>>>>round 1
real    2m36.724s
user    9m26.019s
sys     0m35.194s
>>>>>round 2
real    2m36.576s
user    9m25.513s
sys     0m35.794s
>>>>>round 3
real    2m36.759s
user    9m25.738s
sys     0m35.238s

Sunday, August 16, 2015

4.1 VRQ branch rework finished

Here are the new commits added to vrq branch(in reverse order)

2a8eea0 bfs: vrq: grq.lock free schedule for deactivate code path
8e1ae7c bfs: vrq: grq.lock free context switch for prev==idle path
34c262f bfs: vrq: refine task_preemptable_rq().
22ce18c bfs: vrq: [3/3] preempt task solution, v1.2
79265ca bfs: vrq: [2/3] introduce xxxx_choose_task() in __schedule().
fc44466 bfs: vrq: [1/3] RQ on_cpu states v1.1
be4207e bfs: vrq: refine rq->prq/w_prq as rq->try_preempt_tsk
f4aeee0 bfs: vrq: remove unused unsticky_task.
9c53147 bfs: vrq: Fix vrq solution 0.5 UP compile issue

Both bitbucket and github are updated! The most important objective of this release is stability. I got a new HW platform which found stability issues that can't be found in old platforms, and I believed the major ones have been fixed.

There still three key features on vrq branch as mentioned in vrq-04-update-for-linux-40y. But the cache count solution has advanced a little bit. Now the responsible commit is
ed20056 bfs: vrq: [2/2] scost for task caching v0.7

which is a replacement for the sticky_task design in origin bfs. I'll start another topic for it.

Now, all commits are set for vrq 4.1 branch. Benchmark will be run this week since there are many toolchain upgrade for my distribution in this release. Looking forward, next week 4.2 will be out and hopefully there will be less sync up work to spend more time on new commits.

BR Alfred

Monday, August 10, 2015

A big commit added to 4.1 VRQ

As title, this big commit is 117d783 bfs: VRQ solution v0.5

I think the most unstable issues in previous vrq release is caused by this and I believe most known issues(on my machines) have been fixed. It has been run stably for two weeks. So you are encouraged to have a try.

Know issue:
BUG: using smp_processor_id() in preemptible code, call trace from sys_sched_yield().

There still a few commits left I haven't reworked yet. I plan to finish them in two weeks before new kernel release and another sync-up cycle begins.

BR Alfred

Friday, August 7, 2015

4.1 vrq branch update -- reworking

4.1 vrq branch is updated, but there is no new commit added, as there is new sync to pick up bfs0463 and kernel v4.1.4, new commit has to be postponed to next week.

A fix has been added to the last commit to fix the compile error on UP config.

BR Alfred

Wednesday, August 5, 2015

gc-branch update with CK's BFS 0463

CK finally releases BFS 0463 against kernel 4.1 this week, so here comes the gc branch updates.

What's new:
1. Base on BFS 0463 and kernel v4.1.4
2. Fix/Sync against BFS 0463

3b14908 bfs: [Sync] 4.1 schedule_user().
9f9dc34 bfs: [Fix] 0463 remove unused register_task_migration_notifier().

3. New Sync patches which pick up sync changes from previous mainline changes (most from 3.17 to 3.18 and some fixes upon previous patch)

0145370 bfs: [Sync] TIF_POLLING_NRFLAG for wake_up_if_idle() and resched_curr().
775e28a bfs: [Sync] sched_init_numa().
c6c5894 bfs: [Sync] task_sched_runtime().
4a48abf bfs: [Sync] sched_setscheduler() logic, v3

4. Give a meaningful version name for this patch sets "v4.1_0463_1"

dc4fa45 bfs: -gc BFS enchancement patch set version.

Code has been forced push to bitbucket and github . For those just want to easier apply the patches, here is the one for all patch include all BFS related commits in my gc-branch: bfs_enhancement_v4.1_0463_1.patch

If you are using the gc-branch, I'll highly suggest you to upgrade to this gc release. An updated -vrq branch will be coming soon, no new commits is planned(have to delay to next week as much sync-up works this week), but will be some bug fixes for the existed ones.

BR Alfred Chen

Update:
Add one more commit to fix RCU stall issue.

4623b19

bfs: v4.1_0463_1 rcu stall fix.

Saturday, July 25, 2015

UP booting issue in BFS

For those who have UP boot issue with BFS, I have found that this issue was introduced between 0458 and 0460 during kernel 3.17 to 3.18. It's caused by cpu becomes idle and enter C-states.

It's not yet found why BFS UP has this issue while mainline scheduler doesn't, maybe there is some piece of codes BFS is missing during 3.17 to 3.18 migration. We still need to look into it.

Here is a workaround if you have this kind of UP booting issue. Add "idle=halt" to kernel cmdline which will disable cpu C-states when cpu is idle. Plz try this and see if it works for you.

BR Alfred

Monday, July 20, 2015

4.1 vrq -- reworking

It takes much for me to work out the vrq on 4.1 branch. It's because the -vrq branch used to work well on three of core2 based machines but now it is not quit happy with the chromebook pixel. I'll said that the new platform expose the issues. It's not so bad that reproduce the issue is the 50% work done to solve the issue, :)

In order to find out more possible log when issues pop up. I have enabled some kernel hacking configs. And in order to see the crash kernel message even kernel hangs in earlier stage, I enabled earlyprintk and vt console output.

For the vrq commits, I am reworking them by moving the less potential troublesome issues ahead and test them if any stabilize problem existed. Now I am on stabilize issue when udev starting up which has one ten or less possibility to be reproduced. I meet it once but unlucky too slow to pick up the cell-phone to get a screenshot and the incoming kernel message just flow it away.

I would update the vrq branch after a commit passed my stabilize test and you can check if it works on your hw to help with the development.

First of all, here are the kernel hacking configs I enabled for testing

CONFIG_SCHED_DEBUG
CONFIG_SCHEDSTATS
CONFIG_SCHED_STACK_END_CHECK
CONFIG_TIMER_STATS
CONFIG_PROVE_LOCKING
CONFIG_LOCK_STAT
CONFIG_DEBUG_LOCKDEP

CONFIG_VT_CONSOLE

and the kernel options/parameters is "earlyprintk=vga,keep"
add the below line into /etc/sysctl.conf
kernel.printk = 7 7 1 7

Current -gc branch should works fine with above setup, plz have a try and I'll push -vrq branch soon.

BR Alfred

Edit:
-vrq branch for 4.1 has been pushed to bitbucket and github. Which now have four more vrq commits add upon -gc branch and it should consider stable as -gc branch itself, if you have any issue plz try to enable above debug methods and report back.

Sunday, June 28, 2015

Time to have fun with kernel 4.1

Just pushed my BFS0462 port and -gc bfs enhancement patches for kernel 4.1. There is no new features in -gc branch but some bug fixes and sync up changes with mainline kernel.

Nothing remarkable items, I have put all of them in commits if I remember correctly.

Pls check it from bitbucket or github

PS, recently I got a google chromebook pixel(2013), I could have some test with SMT after I set up the system on it.

BR Alfred Chen

Edit: We found an issue that UP is broken in BFS since kernel 3.18, investigation is going on but I put it in low priority than the kernel 4.1 -vrq branch release.

Edit(Jul 17): Update -gc branch to rebase kernel v4.1.2 and fix compile error when enable some kernel hack config.

9654667 bfs: [Fix] Fix undeclared sched_domains_mutex.

b6e4eaf bfs: [Fix] Fix wrong rcu_dereference_check() usage.

I have done a force update on the linux-4.1.y-gc branch, so if you have fetched it before, please delete the remote branch in your git and re-fetched it again.

Sunday, May 10, 2015

About hotplug affinity enhancement

This enhancement comes from investigating the issue from Brandon BerHent who back-port the -gc branch to 3.10 for android system and build customized kernel for nexus 6. It's very cool thing and I got to said "Hello Moto", which recall the memory of my first cell-phone.

The android system, unlike pc platform, seems use a lot cpu hotplug mechanism for power-saving functionality. When I look at the cpu hotplug code, I notice the below behaviors.

p5qe ~ # schedtool -a 0x02 1388
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
1
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x1
p5qe ~ # cat /sys/devices/system/cpu/cpu1/online
0
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1388
PID 1388: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x3

As you can see, after cpu 1 offline then online, task's affinity changes from 0x2 to 0x3, which include the new online cpu 1, but not the original design what the task to run on. And the most interesting thing is, it is not only the behaviors of BFS, it's same for mainline CFS.
Normally, for pc platform, it's not a big problem, as there is not much cpu hotplug events unless suspend/resume. But if just a small enhancement that can maintenance task's original affinity intend, why not? Below is the behaviors with the enhancement.

p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0xf
p5qe ~ # schedtool -a 0x2 1375
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # echo 0 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x1
p5qe ~ # dmesg | tail
[    9.771522] zram3: detected capacity change from 0 to 268435456
[    9.783513] Adding 262140k swap on /dev/zram0. Priority:10 extents:1 across:262140k SSFS
[    9.785789] Adding 262140k swap on /dev/zram1. Priority:10 extents:1 across:262140k SSFS
[    9.788066] Adding 262140k swap on /dev/zram2. Priority:10 extents:1 across:262140k SSFS
[    9.790311] Adding 262140k swap on /dev/zram3. Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[ 105.757001] Renew affinity for 198 processes to cpu 1
[ 105.757001] kvm: disabling virtualization on CPU1
[ 105.757140] smpboot: CPU 1 is now offline
p5qe ~ # echo 1 > /sys/devices/system/cpu/cpu1/online
p5qe ~ # schedtool 1375
PID 1375: PRIO   0, POLICY N: SCHED_NORMAL , NICE   0, AFFINITY 0x2
p5qe ~ # dmesg | tail
[    9.790311] Adding 262140k swap on /dev/zram3. Priority:10 extents:1 across:262140k SSFS
[   12.103469] sky2 0000:02:00.0 eth1: Link is up at 1000 Mbps, full duplex, flow control both
[   25.360122] random: nonblocking pool is initialized
[ 105.757001] Renew affinity for 198 processes to cpu 1
[ 105.757001] kvm: disabling virtualization on CPU1
[ 105.757140] smpboot: CPU 1 is now offline
[ 137.348718] x86: Booting SMP configuration:
[ 137.348722] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 137.359727] kvm: enabling virtualization on CPU1
[ 137.363338] Renew affinity for 203 processes to cpu 1

This enhancement changes the default behaviors of the kernel/system, I have tested it for a while with different use cases, all looks good. So I mark this changes version 1, if you have any comments/concert, please let me know. I'll look into it.

Here is the commit of this enhancement.

BR Alfred

Edit: Just push a minor fix when CONFIG_HOTPLUG_CPU is not enabled.

v4.0.2-gc updates

Here comes the sync update for 4.0 gc branch

1. Add ck's sync patch upon 0462 for v4.0.2
2. Fix return type error for SMT_NICE, thanks pf pointing it out.
3. bfs: hotplug affinity enhancement, v1
This is a little long story and basically it is not for pc, I'll open another topic for it later.

I also enable the github repo, so the code are on GitHub and BitBucket, feel free to pick whatever you like. Having fun!

Monday, May 4, 2015

Sanity Test for -gc&-vrq branch for linux 4.0

Here are the sanity test results of BFS, -gc branch and -vrq branch. No regression found on -gc branch, still doing better than origin BFS at 50% and 100% workload.

For vrq branch, there is not huge improvement against the gc branch, 50% and 300% workload performance are almost the same, there is even little regression at 100% workload, the only good news is there are improvement at 150% workload.

The reasons why vrq doesn't make good performance that I expected are
1. Introduced some additional rq lock sessions when implement the new lock strategy.
2. The grq lock conflict doesn't seem to be a major problem for system with few cores, at least like my test hw platform(4 cores).

I wished I had a chance to reach some 30+ cores monsters to prove that all codes in vrq worthy it. But before that, I'll continue the unfinished features in vrq like the cache_count, see how much performance can gain from these opened doors.

BFS0462:
>>>>>50% workload
>>>>>round 1
real    5m21.850s
user    9m55.977s
sys     0m41.537s
>>>>>round 2
real    5m21.653s
user    9m55.750s
sys     0m41.411s
>>>>>round 3
real    5m21.973s
user    9m56.570s
sys     0m41.192s
>>>>>100% workload
>>>>>round 1
real    2m52.203s
user    10m8.151s
sys     0m43.575s
>>>>>round 2
real    2m52.050s
user    10m8.423s
sys     0m43.515s
>>>>>round 3
real    2m50.865s
user    10m8.283s
sys     0m43.700s
>>>>>150% workload
>>>>>round 1
real    2m56.355s
user    10m29.334s
sys     0m44.955s
>>>>>round 2
real    2m56.189s
user    10m29.469s
sys     0m44.782s
>>>>>round 3
real    2m56.264s
user    10m29.485s
sys     0m44.845s
>>>>>300% workload
>>>>>round 1
real    3m0.412s
user    10m42.805s
sys     0m46.352s
>>>>>round 2
real    3m1.408s
user    10m42.618s
sys     0m46.341s
>>>>>round 3
real    3m0.287s
user    10m43.304s
sys     0m46.244s

linux-4.0.y-gc:
>>>>>50% workload
>>>>>round 1
real    5m18.823s
user    9m50.911s
sys     0m41.302s
>>>>>round 2
real    5m19.032s
user    9m51.597s
sys     0m40.984s
>>>>>round 3
real    5m18.960s
user    9m51.490s
sys     0m41.046s
>>>>>100% workload
>>>>>round 1
real    2m51.085s
user    10m8.806s
sys     0m43.699s
>>>>>round 2
real    2m50.870s
user    10m8.108s
sys     0m44.142s
>>>>>round 3
real    2m50.839s
user    10m8.290s
sys     0m43.979s
>>>>>150% workload
>>>>>round 1
real    2m56.285s
user    10m30.045s
sys     0m44.629s
>>>>>round 2
real    2m56.286s
user    10m30.054s
sys     0m44.866s
>>>>>round 3
real    2m56.333s
user    10m30.379s
sys     0m44.425s
>>>>>300% workload
>>>>>round 1
real    3m0.427s
user    10m43.455s
sys     0m46.739s
>>>>>round 2
real    3m0.222s
user    10m43.341s
sys     0m46.519s
>>>>>round 3
real    3m0.244s
user    10m43.029s
sys     0m46.608s

linux-4.0.y-vrq:
>>>>>round 1
real    5m18.905s
user    9m51.214s
sys     0m40.890s
>>>>>round 2
real    5m18.994s
user    9m51.203s
sys     0m41.029s
>>>>>round 3
real    5m18.818s
user    9m51.266s
sys     0m40.819s
>>>>>100% workload
>>>>>round 1
real    2m51.414s
user    10m7.739s
sys     0m43.785s
>>>>>round 2
real    2m51.146s
user    10m7.449s
sys     0m43.848s
>>>>>round 3
real    2m51.103s
user    10m7.721s
sys     0m43.499s
>>>>>150% workload
>>>>>round 1
real    2m54.407s
user    10m21.732s
sys     0m44.407s
>>>>>round 2
real    2m54.436s
user    10m21.212s
sys     0m44.824s
>>>>>round 3
real    2m55.156s
user    10m21.279s
sys     0m44.796s
>>>>>300% workload
>>>>>round 1
real    3m0.549s
user    10m43.723s
sys     0m46.342s
>>>>>round 2
real    3m0.475s
user    10m44.249s
sys     0m45.982s
>>>>>round 3
real    3m0.393s
user    10m44.088s
sys     0m46.114s

Wednesday, April 29, 2015

VRQ 0.4 update for linux-4.0.y

There are three major feature in this VRQ branch

1. VRQ lock strategy update, replace grq lock strategy with task_access lock strategy
    That is
* lock on rq->lock when task is on cpu
* lock on grq.lock when task is in queue
* otherwise lock on task's pi_lock

    It's the most huge changes which impact almost the whole scheduler code. Based on this, there are some grq lock sessions improvements are made for activate and idle task schedule.

2. preempt task solution
    This is an enhancement for try_to_wake_up(), instread of putting the wake up task in grq and reschedule a cpu/rq to pick it up, the woken task now becomes the preempt task in the rq and be picked immeditly in next schedule run, this save the effect to put/get the task from grq and avoid other cpus/rqs to access grq.

3. cache_count solution
    Introduce cache_count for task, which indicate cache hot when task waiting in queue. This replaces sticky task solution in BFS.
    Current setting 14 for activate tasks and 4 for deactivate tasks are both tested values. In future version, algorithm will based on more meaningful factors.

REMARKABLE NOTICE:
1. SMT_NICE code is kept but is not tested, don't enable it for VRQ yet.
2. yield_to() locking is unchanged and not tested, so kvm may not work.
3. UP is not tested, and VRQ is not designed for UP, don't try it on VRQ.
4. Based on user reports, VRQ may not work with some kernel config, but it's unknow which config is causing the issue. Further testing still needed.
5. Try VRQ if you want to help testing, if it runs good, keep using it, if not, fall back to the -gc branch.

Enjoy and have fun.

BR Alfred

Update:
Found an issue by investigating Manuel's config, a quick workaround is set NR_CPUS to exact core number of your system. A fix will be in for 0.5 release.

Sunday, April 26, 2015

linux-4.0.y-gc updates

All needed patches are set, here are the changes

BFS:
1. Add a [Sync] patch for 0462, that is bfs: [Sync] __schedule() and io_schedule_timeout()
2. bfs: [Sync] sched_setscheduler() logic, v2
*Refine the code in check_task_change()
3. bfs: [3/3] priodl to speed up try_preempt(), V2
*V2 rq's priodl store in grq.rq_priodls array, for 32bits system, grq.priodl_lock needs to be held for r/w.
4. bfs: Full cpumask based and LLC sensitive cpu selection, V2
* V2 Fix issue when SMP config enabled, but non of SMT or MC config is
enabled.
* Add enhancement code for cpu hotplug, for the case the cpu is idle when it goes offline, which will cause grq.cpu_idle_map still be set.

Others:
Remove the patch for IBM thinkpad KB, upstream has fixed the issue.

Link location

Have fun with linux 4.0

BR Alfred

Wednesday, April 15, 2015

BFS0461 linux 4.0 syncup patch

Linux 4.0 is released this week. Some expected scheduler fixes are in it, and here is my sync-up patch for BFS0461. For my -gc branch, I am still waiting for the bfq new release and porting bfs related patches from 3.19 to 4.0

Please be noticed that extract __schedule() and sched_submit_work() from schedule() used to cause some issues when ck release similar changes for BFS in 3.18 release, then it is reverted in 0460 for 3.19. Hopefully these are fixed in upstream kernel changes.

https://bitbucket.org/alfredchen/linux-gc/downloads/bfs0461_linux4.0_syncup.patch

Monday, April 13, 2015

BFS fun for Raspberry Pi 2

Recently I brought a Raspberry Pi 2 as there are 4 cores than the single one on the first generation and double ram size. I would like to check whether it can be a thin-client replacement for me at this moment and for sure, Pi is very extendable and a fun toy to play with, I already have several idea want to be made from Pi.

After try with the NOOB to make sure the hardware works, I installed funtoo into Pi, system runs fine, X works with FB driver, FF 31 can be compiled and works, but it runs slow than I expected. I may try Xorg rpi driver to see if it acts better. But before that, I tried to custom the pi kernel for my needs and of course, get BFS run on it.

Pi now using 3.18.x kernel, the default pi kernel conig is a little be "fat", I am not going to customize the kernel config heavily right now, as it's the first time to touch this arch. Changed two kernel config to my needed, rebuild the kernel and it works on Pi. BFS 0460 can be clearly applied and be built, the kernel is running for 8H now. So far, BFS on Pi2 goes smoothly, going to apply -gc branch onto it, to be continued...

Thursday, April 2, 2015

Pre-release of VRQ 0.4

Here comes the vrq patch up on the latest -gc branch.

https://bitbucket.org/alfredchen/linux-gc/downloads/vrq_upon_3.19-gc.patch

As a pre-release, I don't want add details here, just want to know if it works for you and don't expect improvement comparing to -gc branch, based on my sanity test, it just can match the -gc branch at this moment.

I will continue do some code cleanup and tune before formal release.

Monday, March 23, 2015

3.19.y-gc branch updates

Yes. -gc branch should be updated but I personal want to wait for the v3.19.2 stable version to be out then rebase on it, so it is late.
And while the -new branch once appeared in git which I used to sync up git tree between desktop and notebook make some trouble for friends who have tried it. This remind me to publish codes carefully.

There is no new commits for -gc branch, just minor bug/comment fixes. I have rewritten some commit titles and tag with [Fix]/[Sync], so it would be easier to tell what the commit is used for.

And, I have done one *make no more friend* thing, I have delete the origin linx-3.19.y-gc branch and create a new one, with the same name. Because I don't want to keep the old merge commits info in the branch. Any one know how to make it clean please let me know.
So, you may have trouble to pull out this new linx-3.19.y-gc branch, because your local copy is different and no more existed in "server" side. Simply delete the old remote linx-3.19.y-gc branch in your git tree and pull again, then you will be fine.

Branch url:
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-3.19.y-gc

Monday, February 9, 2015

VRQ： About Issue found in 0.3

Issue
Thanks Manuel for testing -vrq and report an issue that while background(SCHED_BATCH, nice 19) workload is running, one of the cpu failed to pick up normal/system tasks.
The detail of the issue can be found from http://cchalpha.blogspot.com/2014/12/vrq-03-updates.html

Cause
After asking Manuel about his usage for a few rounds, I finally be able to reproduced the issue and used bisect to find out that "bfs: vrq: RQ niffy solution." which introduce the issue.
The root cause is that the difference of niffy of each cpu goes very large after system keep on running, I recorded 6+ seconds difference after system up for 16+hours in my system. This cause the tasks which run on the cpu with lower niffy has earlier deadline than others, so it failed to pick up other normal/system load.

Solution
Simplely revert the commit can fix the issue but it is against the intention of commit "bfs: vrq: RQ niffy solution.", to make update_clocks() grq lock free.
After testing, I found that rq->clock which based on sched_clock_cpu() is stable on my system and the difference among cpus are small enough for deadline calculation. So I give the rq->clock a try and remove sched_clock sanity checks. I also notice that CK add the sanity check for "crazy sched_clock interface", so there may be some unexpected behaviors on some hardware, specially old machines, if this unexpected behaviors is still popular, I will add some kind of sanity check back.

The new solution will be posted with the incoming 3.19 -vrq branch, as this bug is found on -vrq and I would like the -vrq branch stay on itself for one more release before merging them into -gc branch.