Monday, December 28, 2015

VRQ v4.3_0466_4 test patch

When BFS 0466 came out and I rebase -gc and -vrq branch upon it and done some sanity tests. In short, 0466 improve throughput when workload >=100% for both bare bfs code and -gc branch, it's 2m34.6s for 0466 -gc comparing to original 2m36.7s at 300% workload. It's good to see CK is improving bfs as he stops adding new feature in bfs for a long time. Thought it's known now that bfs 0466 cause regressions of interaction and CK release bfs 0467 to address this issue by manually setting the schedule option.

When continue work on the -vrq branch, I found there is regression in performance sanity tests for bfs 0466 code changes. I use bisect to found out the commit "bfs/vrq: [3/3] preempt task solution"  contributes to the regression. The original design of task_preemptable_rq() function, is to go through all cpu to find out the highest priority/deadline task which is running and set it as the target rq to be preempted. In this way, just the highest priority/deadline task got preempted. But go through "all" cpu to found out the best one seems to be an overhead which cause the regression. So, alternatively, it is now changed to select the first cpu which running the task with priority/deadline higher than the given task as the cpu/rq to be preempted. With this code change, the best benchmark of sanity result is recorded for -vrq branch.

After removing performance bottleneck, it's time to handle the interaction issue. In original bfs, sticky task are are (a) not allow to run on cpu which is scaling, (b) cpu affinity by adjusting the deadline. Look back the bfs 0466 code changes, it makes not only sticky tasks are cpu affinity, but *ALL* tasks are cpu affinity. In such way, it improves performance but also impacts the interaction at the same time. When ck release bfs 0467 to address the interactivity issue, it introduced a run-time option to do the switching work. And considering in -vrq, the sticky task has been replaced by cache task mechanism and scost and cached timeout are introduced to control when the task should be cached,  I decided to use existed codes in vrq to balance the task performance and interaction.

First of all, to mark all tasks switched out of cpu "cached", previously only part of tasks which still need cpu(in activate schedule) are marked "cached".
Secondly, mark all new fork task "cpu affinity", based on the test, this also contributes the performance improvement.
Thirdly, after the bottleneck is removed, the SCOST design is truly be tested. It turns out it is not working as expected(huge threshold doesn't impact performance) at least for my sanity test pattern(all gcc share binary, only PSS are difference among all gcc threads running at the same time). It looks like that SCOST may be not a good design, in other word, it may be a bad design, because the threshold is tested under some certain pattern, for other pattern, it may impact the performance or interaction. The SCOST code is still existed in this patch, but it will not be functional at all, and will be removed when I clean up the commits.

Now, the only control of the caching tasks and "cpu affinity" tasks is the cached time-out, and it's a per task policy setting, for example, batch/idle tasks has unlimited cached time-out, as user don't care about their interactivity. In implement, the unlimited time-out is set as 1 second. For rt tasks, the time-out is set as default-rr-interval(6ms). For normal tasks that users most likely run, the time-out setting is depended on the preemption model kernel config, when it is configured as CONFIG_PREEMPT_NONE which means the machine tends to be used  as server and doesn't care task interactivity, the cached wait time is unlimited, otherwise the time-out is set to defautl-rr-interval(6ms).

Interactivity tests has been done is normal policy mpv h264 play-back with no frame drop while normal nice 19 300% compiling workload at the background.

Batch policy 300% workload compile benchmark:
Previous vrq -- 2m37.840s
Current vrq -- 2m33.831s

Idle policy 300% workload compile benchmark:
Previous vrq -- 2m35.673s
Current vrq -- 2m35.154s

Normal policy 300% workload compile benchmark:
Previous vrq -- 2m37.005s
Current vrq -- 2m36.633s

The result is ok and the new vrq patch is ready for user testing, the all in one patch file for kernel 4.3 is uploaded at bitbucket download. It will need more time to clean up the commits before update the git, I'd like to finish it during the new year holiday.

Happy New Year and have fun with this new -vrq patch, your feedback will be welcome.

BR Alfred

Thursday, December 10, 2015

GC and VRQ branch update for v4.3.1 and latency test

Finally it comes the first stable release for 4.3, and gc and vrq branch are updated with bug fixes during these few weeks.

*A non-return error when enable SMT_NICE(though SMT_NICE is not recommended for VRQ)
*Go through threads list with tasklist_lock held when cpu hotplugs. It's for both gc and vrq branch.

*Task caching scheduling PartIII, as usual I will write another post for it.

The gc branch for v4.3.1 can be found at bitbucket and github.
The vrq branch for v4.3.1 can be found at bitbucket and github.

One more thing, I would like to add more tests/benchmark for scheduling for a long time. And I finally found one yesterday, it is Cyclictest, you can check the detail on this wiki(it's a little old but it's a good start point). Based on my research, it is scheduler independent and use no scheduler statics.

Here is my first idle workload cyclictest result for v4.3 cfs, bfs and vrq. (I'm still playing with it)

4.3 CFS
 # /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.05 0.04 0.05 1/219 1504         

T: 0 ( 1499) P:80 I:10000 C:  10000 Min:   1831 Act:    2245 Avg:    2413 Max:   12687
T: 1 ( 1500) P:80 I:10500 C:   9524 Min:   1917 Act:    2965 Avg:    2560 Max:    7547
T: 2 ( 1501) P:80 I:11000 C:   9091 Min:   1702 Act:    2254 Avg:    2313 Max:   10650
T: 3 ( 1502) P:80 I:11500 C:   8696 Min:   1546 Act:    2297 Avg:    2274 Max:   13723

4.3 BFS
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.15 0.10 0.04 1/234 1540         

T: 0 ( 1536) P:80 I:10000 C:  10000 Min:   1437 Act:    2002 Avg:    1893 Max:   10912
T: 1 ( 1537) P:80 I:10500 C:   9524 Min:   1427 Act:    2010 Avg:    1907 Max:    7534
T: 2 ( 1538) P:80 I:11000 C:   9091 Min:   1402 Act:    1755 Avg:    1902 Max:   13059
T: 3 ( 1539) P:80 I:11500 C:   8696 Min:   1408 Act:    1878 Avg:    1866 Max:   12921

4.3 VRQ
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.00 0.01 0.00 0/226 1607         

T: 0 ( 1602) P:80 I:10000 C:  10000 Min:   1349 Act:    1785 Avg:    1647 Max:    4934
T: 1 ( 1603) P:80 I:10500 C:   9524 Min:   1355 Act:    1464 Avg:    1642 Max:   12378
T: 2 ( 1604) P:80 I:11000 C:   9091 Min:   1334 Act:    1926 Avg:    1676 Max:   12544
T: 3 ( 1605) P:80 I:11500 C:   8696 Min:   1350 Act:    1801 Avg:    1627 Max:   10989

Enjoy with gc/vrq on v4.3.1 and try cyclictest if you care about the latency and task interaction.

BR Alfred

Edit:
If you have failed s2ram/resume issue with this new gc/vrq release, you can try below 2 patches(one for gc and one for vrq) and see if it help with you.
4.3_gc3_fix.patch and 4.3_vrq1_fix.patch