Alfred Chen's Blog: VRQ v4.3_0466

Monday, December 28, 2015

VRQ v4.3_0466_4 test patch

When BFS 0466 came out and I rebase -gc and -vrq branch upon it and done some sanity tests. In short, 0466 improve throughput when workload >=100% for both bare bfs code and -gc branch, it's 2m34.6s for 0466 -gc comparing to original 2m36.7s at 300% workload. It's good to see CK is improving bfs as he stops adding new feature in bfs for a long time. Thought it's known now that bfs 0466 cause regressions of interaction and CK release bfs 0467 to address this issue by manually setting the schedule option.

When continue work on the -vrq branch, I found there is regression in performance sanity tests for bfs 0466 code changes. I use bisect to found out the commit "bfs/vrq: [3/3] preempt task solution" contributes to the regression. The original design of task_preemptable_rq() function, is to go through all cpu to find out the highest priority/deadline task which is running and set it as the target rq to be preempted. In this way, just the highest priority/deadline task got preempted. But go through "all" cpu to found out the best one seems to be an overhead which cause the regression. So, alternatively, it is now changed to select the first cpu which running the task with priority/deadline higher than the given task as the cpu/rq to be preempted. With this code change, the best benchmark of sanity result is recorded for -vrq branch.

After removing performance bottleneck, it's time to handle the interaction issue. In original bfs, sticky task are are (a) not allow to run on cpu which is scaling, (b) cpu affinity by adjusting the deadline. Look back the bfs 0466 code changes, it makes not only sticky tasks are cpu affinity, but *ALL* tasks are cpu affinity. In such way, it improves performance but also impacts the interaction at the same time. When ck release bfs 0467 to address the interactivity issue, it introduced a run-time option to do the switching work. And considering in -vrq, the sticky task has been replaced by cache task mechanism and scost and cached timeout are introduced to control when the task should be cached, I decided to use existed codes in vrq to balance the task performance and interaction.

First of all, to mark all tasks switched out of cpu "cached", previously only part of tasks which still need cpu(in activate schedule) are marked "cached".
Secondly, mark all new fork task "cpu affinity", based on the test, this also contributes the performance improvement.
Thirdly, after the bottleneck is removed, the SCOST design is truly be tested. It turns out it is not working as expected(huge threshold doesn't impact performance) at least for my sanity test pattern(all gcc share binary, only PSS are difference among all gcc threads running at the same time). It looks like that SCOST may be not a good design, in other word, it may be a bad design, because the threshold is tested under some certain pattern, for other pattern, it may impact the performance or interaction. The SCOST code is still existed in this patch, but it will not be functional at all, and will be removed when I clean up the commits.

Now, the only control of the caching tasks and "cpu affinity" tasks is the cached time-out, and it's a per task policy setting, for example, batch/idle tasks has unlimited cached time-out, as user don't care about their interactivity. In implement, the unlimited time-out is set as 1 second. For rt tasks, the time-out is set as default-rr-interval(6ms). For normal tasks that users most likely run, the time-out setting is depended on the preemption model kernel config, when it is configured as CONFIG_PREEMPT_NONE which means the machine tends to be used as server and doesn't care task interactivity, the cached wait time is unlimited, otherwise the time-out is set to defautl-rr-interval(6ms).

Interactivity tests has been done is normal policy mpv h264 play-back with no frame drop while normal nice 19 300% compiling workload at the background.

Batch policy 300% workload compile benchmark:
Previous vrq -- 2m37.840s
Current vrq -- 2m33.831s

Idle policy 300% workload compile benchmark:
Previous vrq -- 2m35.673s
Current vrq -- 2m35.154s

Normal policy 300% workload compile benchmark:
Previous vrq -- 2m37.005s
Current vrq -- 2m36.633s

The result is ok and the new vrq patch is ready for user testing, the all in one patch file for kernel 4.3 is uploaded at bitbucket download. It will need more time to clean up the commits before update the git, I'd like to finish it during the new year holiday.

Happy New Year and have fun with this new -vrq patch, your feedback will be welcome.

BR Alfred

36 comments:

jwh7December 28, 2015 at 10:32 PM
Interesting stuff as always Alfred; I've been running ck's 467 on 4.3.3, and hope to try out vrq this week as well. Thanks! Looking forward to Manuel's feedback... :)
ReplyDelete
Replies
AnonymousDecember 29, 2015 at 7:21 AM
Here are my observations:
On my usual setup, running firefox, mpv video playback, worldcommunitygrid clients (as SCHED_BATCH) on my Core2duo this patch re-introduces imbalanced cpu load of the SCHED_NORMAL tasks between cpu0 and cpu1, what I observe in gkrellm. Cpu0 gets much higher NORMAL load than cpu1. That is approx. 23% vs. 1%, what as sum was to be observed as equalized between the cores with all the previous patches (so, each showed pretty ~12%). So far, I haven't seen that this negatively affects performance or interactivity -- but I'm a bit in doubt that this imbalance was really intended behaviour.

BR Manuel Krause
ReplyDelete
Replies
jwh7January 8, 2016 at 12:35 PM
Should NORMAL_POLICY_CACHED_WAITTIME perhaps be set at boot time, as some function of max CPU speed? Since you said the caching timeout "some how depends on the cpu performance to complete the tasks on time".
ReplyDelete
Replies

Add comment