Monday, December 28, 2015

VRQ v4.3_0466_4 test patch

When BFS 0466 came out and I rebase -gc and -vrq branch upon it and done some sanity tests. In short, 0466 improve throughput when workload >=100% for both bare bfs code and -gc branch, it's 2m34.6s for 0466 -gc comparing to original 2m36.7s at 300% workload. It's good to see CK is improving bfs as he stops adding new feature in bfs for a long time. Thought it's known now that bfs 0466 cause regressions of interaction and CK release bfs 0467 to address this issue by manually setting the schedule option.

When continue work on the -vrq branch, I found there is regression in performance sanity tests for bfs 0466 code changes. I use bisect to found out the commit "bfs/vrq: [3/3] preempt task solution"  contributes to the regression. The original design of task_preemptable_rq() function, is to go through all cpu to find out the highest priority/deadline task which is running and set it as the target rq to be preempted. In this way, just the highest priority/deadline task got preempted. But go through "all" cpu to found out the best one seems to be an overhead which cause the regression. So, alternatively, it is now changed to select the first cpu which running the task with priority/deadline higher than the given task as the cpu/rq to be preempted. With this code change, the best benchmark of sanity result is recorded for -vrq branch.

After removing performance bottleneck, it's time to handle the interaction issue. In original bfs, sticky task are are (a) not allow to run on cpu which is scaling, (b) cpu affinity by adjusting the deadline. Look back the bfs 0466 code changes, it makes not only sticky tasks are cpu affinity, but *ALL* tasks are cpu affinity. In such way, it improves performance but also impacts the interaction at the same time. When ck release bfs 0467 to address the interactivity issue, it introduced a run-time option to do the switching work. And considering in -vrq, the sticky task has been replaced by cache task mechanism and scost and cached timeout are introduced to control when the task should be cached,  I decided to use existed codes in vrq to balance the task performance and interaction.

First of all, to mark all tasks switched out of cpu "cached", previously only part of tasks which still need cpu(in activate schedule) are marked "cached".
Secondly, mark all new fork task "cpu affinity", based on the test, this also contributes the performance improvement.
Thirdly, after the bottleneck is removed, the SCOST design is truly be tested. It turns out it is not working as expected(huge threshold doesn't impact performance) at least for my sanity test pattern(all gcc share binary, only PSS are difference among all gcc threads running at the same time). It looks like that SCOST may be not a good design, in other word, it may be a bad design, because the threshold is tested under some certain pattern, for other pattern, it may impact the performance or interaction. The SCOST code is still existed in this patch, but it will not be functional at all, and will be removed when I clean up the commits.

Now, the only control of the caching tasks and "cpu affinity" tasks is the cached time-out, and it's a per task policy setting, for example, batch/idle tasks has unlimited cached time-out, as user don't care about their interactivity. In implement, the unlimited time-out is set as 1 second. For rt tasks, the time-out is set as default-rr-interval(6ms). For normal tasks that users most likely run, the time-out setting is depended on the preemption model kernel config, when it is configured as CONFIG_PREEMPT_NONE which means the machine tends to be used  as server and doesn't care task interactivity, the cached wait time is unlimited, otherwise the time-out is set to defautl-rr-interval(6ms).

Interactivity tests has been done is normal policy mpv h264 play-back with no frame drop while normal nice 19 300% compiling workload at the background.

Batch policy 300% workload compile benchmark:
Previous vrq -- 2m37.840s
Current vrq -- 2m33.831s

Idle policy 300% workload compile benchmark:
Previous vrq -- 2m35.673s
Current vrq -- 2m35.154s

Normal policy 300% workload compile benchmark:
Previous vrq -- 2m37.005s
Current vrq -- 2m36.633s

The result is ok and the new vrq patch is ready for user testing, the all in one patch file for kernel 4.3 is uploaded at bitbucket download. It will need more time to clean up the commits before update the git, I'd like to finish it during the new year holiday.

Happy New Year and have fun with this new -vrq patch, your feedback will be welcome.

BR Alfred

36 comments:

  1. Interesting stuff as always Alfred; I've been running ck's 467 on 4.3.3, and hope to try out vrq this week as well. Thanks! Looking forward to Manuel's feedback... :)

    ReplyDelete
  2. Here are my observations:
    On my usual setup, running firefox, mpv video playback, worldcommunitygrid clients (as SCHED_BATCH) on my Core2duo this patch re-introduces imbalanced cpu load of the SCHED_NORMAL tasks between cpu0 and cpu1, what I observe in gkrellm. Cpu0 gets much higher NORMAL load than cpu1. That is approx. 23% vs. 1%, what as sum was to be observed as equalized between the cores with all the previous patches (so, each showed pretty ~12%). So far, I haven't seen that this negatively affects performance or interactivity -- but I'm a bit in doubt that this imbalance was really intended behaviour.

    BR Manuel Krause

    ReplyDelete
    Replies
    1. Some more observations:
      Together with both gkrellm + top: The 24% on cpu0 is mostly the firefox. First I was in doubt, whether newly added SCHED_NORMAL tasks would be delegated to cpu1 (to ease cpu0's load) and it seems to properly be like that. So the firefox is staying @ cpu0 and not bouncing. That's what I mean to have understood from CK's explanations about 0466/7, and it's the desired behaviour? Please, correct me, if I'm wrong!
      I don't need cpu0/1 graphs that show "pretty" equal loads, but a seemlessly operating desktop with low overhead regarding throughput.
      That goal seems to be reached.
      Although it's a quite different view on gkrellm's cpu0/1 charts now and judging them atm.

      Regarding interactivity or impact on it from high disk i/o, shm + swap i/o, I can't tell anything negative so far. Maybe, I'm again going too enthusiastic with this, but this patch even eases flickering of video playback, when new windows are opened (what I've had with all former patches, but put into the i915/Xorg drawer).

      Best regards, Alfred -- thank you for your good work,
      and a successful and happy New Year 2016 to all of you,

      Manuel Krause

      Delete
    2. Opening two additional firefox sub-windows (from the original one) with two different flash playing contents makes cpu0 going over 75% normal load and cpu1 up to 40%, but the latter staying there. Of course, the playback then flickers (lacking gfx hardware) and noone needs this test, but somehow there needs to be more equalization to the other cpu core... IMO

      BR Manuel Krause

      Delete
    3. Thanks for testing. The imbalanced cpu load should be current design intention. Cpu affinity is the core idea of 0456/7, this -vrq release use this idea and balance the performance/interactivity by existed caching mechanism.
      Yes, in current implement, I believe the tasks are too sticky to original cpu, in some cases, they should move to other cpus. There are a few things I'd like to try but of course, under the condition without regression in performance. And it's not likely in this release cycle, reset of time in this cycle I'd just want some clean up work like what we would do in the end of a year. :)

      BR Alfred

      Delete
    4. Mmh... maybe this is of interest:
      Manually setting firefox' affinity to 0x2 makes it attach to cpu1 (expected), and then switching back with 0x3, firefox goes back to cpu0, although cpu1 always shows less important load than cpu0.
      Also, playing a bit with kernel compiling on top, "make -j1" attaches to cpu0 and almost leaves cpu1 out of the game.
      So, after these observations, there is most probably some mis-leading algorithm in the code, to always give new? processes priority for cpu0, without knowledge of the "most idle" cpu. Hope, this wording is understandable.

      Seems like 2016, too, would get exciting with -VRQ. ;-)
      I wish you "Happy Cleaning" :-)
      BR Manuel

      Delete
    5. @Manel
      In current implement, IMO, at lease two design logic makes VRQ tends to use first cpu available(in general terms), but from the test result, no performance or interactivity regression was found.
      VRQ does have knowledge of "most idle" cpu, that's the logic in the first part of task_preemptable_rq() function. And it's unchanged in this release.
      I have done below test.
      1. schedtool -a 0x01 -e firefox, let firefox occupy some(but not 100%) of cpu0.
      2. start "nice -19 make -2" kernel compile
      compile workload occupy cpu1,cpu2,cpu3 and more less cpu time on cpu0, but total cpu time of cpu0 is not 100%.
      I think that prove new tasks still attach to idle cpus. You can have similar test on our site.

      Thanks for your feedback, again.

      BR Alfred

      Delete
    6. Yes, Alfred, of course, your test is supposed to show these results. And, yes, it shows the same behaviour on my 2 cores. With my test I wanted to show the opposite, the other side of the medal, the imbalanced delegation. BTW, I usually run the kernel compilation without nice -19, and with "make -j2" --- the above -j1 test was, as I said, to show that cpu0 gets preferred for every new(?) SCHED_NORMAL process, and there is low tendency to switch to cpu1.

      BR Manuel Krause

      Delete
    7. Forgetful me: With my kernel compilation -j1 there were always two worldcommunitygrid clients running in the background as SCHED_BATCH. Don't know if that info is needed.
      Manuel

      Delete
    8. @Manuel
      In your test case, how much cpu time the background workload takes on both cpus? And kernel compile "make -j1" takes cpu0 while left many free cpu time on cpu1?

      Delete
    9. @Alfred: Can you please give me a short advice on how to gather your wished data most properly (to have some usable average values -- I don't want to estimate from watching gkrellm including the spikes ^^).
      Once knowing, I'd also get these values for some earlier -vrq kernels as well.
      BR Manuel

      Delete
    10. @Manuel
      I don't gather actual data, I'm using htop may just like yours.
      For the imbalance behaviours, as it is the design intention of bfs 0466 and current vrq, if a task affinity to one cpu, that will keep that cpu busy and not scaling, and take advantage of cpu caching mostly to make the most performance boost.
      I had tried a few new things to dispatch task to other cpu/rq rather than the first preemptable one, but it turns out that it doesn't get good enough result to replace the current design yet.
      So I'd like to keep current design for a while till I find a better replacement. Imbalance behaviours would be normal if no performance/interactivity regression. But if you found the imbalance is obviously wrong or it cause regression, pls report to me.

      Delete
    11. Sorry for not having supplied test data so far.
      Today I've tried the patches from your actualized -vrq branch and it got even worse. IMO, at first, it's blocking cpu0 too much. Graphs are looking worse than with CFS -- spikes for each cpu0/1. After a while there occurs a switch from cpu0 to 1 or vice versa, now.
      After a look into the code, you've taken the _non_ "interactive" approach from Con's 0.466 and adjusted some things first, and then changed some things from your former own good working ideas to the bad.

      Yes, I'd like to go back to your first 4.3 VRQ without any influence of BFS 0466.

      BR Manuel Krause

      Delete
    12. @Manuel
      If you want most interactivity. Please firstly check your preemption model kernel config is *NOT* set to CONFIG_PREEMPT_NONE. In this case, the normal priority tasks will use 6ms caching timeout, this setting works for me but may not work on your machine, as it some how depends on the cpu performance to complete the tasks on time.
      You can try to reduce the caching timeout setting by editing the below line in bfs.c(change the DEFAULT_RR_INTERVAL to 5,4,3... etc) and see which value works best for you, and I also need that info to further tune the current design.

      6 /*
      5 * Normal policy task cached wait time, based on Preemption Model Kernel config
      4 */
      3 #ifdef CONFIG_PREEMPT_NONE
      2 #define NORMAL_POLICY_CACHED_WAITTIME UNLIMITED_CACHED_WAITTIME
      1 #else
      162 #define NORMAL_POLICY_CACHED_WAITTIME DEFAULT_RR_INTERVAL
      1 #endif

      BR Alfred

      Delete
    13. I've let the machine run some time alone with your new code. Result is 50% firefox bouncing from cpu0 to 50% cpu1 and back, from time to time, unclear duration, and bouncing without any need, still preferring cpu0. No equalization.

      Of course I have:
      # CONFIG_PREEMPT_NONE is not set
      # CONFIG_PREEMPT_VOLUNTARY is not set
      CONFIG_PREEMPT=y

      How/why should I set the rr_interval in the kernel code? The code section you posted looks like just the same as on here.
      Isn't it sufficient to adjust it by e.g. "echo 5 > /proc/sys/kernel/rr_interval" ?

      BR Manuel Krause

      Delete
    14. Changing the rr_interval was never in charge to solve problems. I've tried it for many years from time to time, for no effort. BR Manuel Krause

      Delete
    15. @Manuel
      *NO*, I don't mean to adjust the rr_interval, I don't want to touch it at all.
      But in current code, the default time out of NORMAL priority task caching time out is default to the same of RR value, you need to adjust it to see how it works with your machine. There are no run-time interface to modify this, you need to edit bfs.c file and re-compile the kernel. The line number is #162 in bfs.c and you need to change DEFAULT_RR_INTERVAL in line 162 to 5, then 4, then 3 etc to see which value best fit your system.

      BR Alfred

      Delete
    16. To be more clear, here is the patch to adjust the value to 5, then u can try others.

      diff --git a/kernel/sched/bfs.c b/kernel/sched/bfs.c
      index 12b0a2b..6f0b585 100644
      --- a/kernel/sched/bfs.c
      +++ b/kernel/sched/bfs.c
      @@ -159,7 +159,7 @@ int rr_interval __read_mostly = DEFAULT_RR_INTERVAL;
      #ifdef CONFIG_PREEMPT_NONE
      #define NORMAL_POLICY_CACHED_WAITTIME UNLIMITED_CACHED_WAITTIME
      #else
      -#define NORMAL_POLICY_CACHED_WAITTIME DEFAULT_RR_INTERVAL
      +#define NORMAL_POLICY_CACHED_WAITTIME 5
      #endif

      /*

      Delete
    17. And one more thing I'd like to point out that this adjustment of normal priority caching timeout value just help to make the normal priority tasks more "interactive", it's *NOT* to address task cpu affinity behaviors. So please just focus on normal priority task's interaction.

      My current standard of normal priority interaction FYI is
      mpv h264 play-back without frame-drop and 300% nice -19 kernel compile workload at the background.

      BR Alfred

      Delete
    18. Hi, Alfred!
      Thank you for your explanations. Now I got it, maybe I should have thought a little more before writing. I want to add, that I don't see any interactivity issues at all, even not with unniced kernel compile + h264 playback. So, this is not my problem and I seem to have no need to change NORMAL_POLICY_CACHED_WAITTIME for now.
      I'm still very much concerned about the imbalanced cpu affinity thing and, by coincidence, made an observation today, that may help you to fix it. It is triggered, when having SCHED_BATCH (same with IDLEPRIO) tasks running, like my worldcommunitygrid clients. After stopping them, both cores look loaded equally (even adding kernel make -j1). When running them, cpu0 gets running the NORMAL tasks + some BATCH/IDLE %, and cpu1 the rest BATCH/IDLE tasks at 100%. Adding kernel make -j2 then also adds to cpu0 only, with BATCH/IDLE at cpu1 at 100%. That's not o.k. IMO. And this reminds me of a behaviour in the very first days of the -gc patchset, maybe you remember.

      I hope this helps, BR Manuel

      Delete
    19. @Manuel
      My bad, it turns out that I missed one but most important commit in my git stash before push to the git repository.
      The commit is https://bitbucket.org/alfredchen/linux-gc/commits/ad923b521f2517d89f2af22c9e648faf6b2942b0?at=linux-4.3.y-vrq , now it was pushed to git.
      Pls fetch it back and test again, this time, default 6ms should be good enough.

      BR Alfred

      Delete
    20. @Alfred: O.k., now with the fix, -vrq behaves like with the all-in-one patch again, meaning the problems with SCHED_BATCH/IDLEPRIO described above have gone.
      As I've found scrolling in firefox getting stuck relatively often, I've tried to lower NORMAL_POLICY_CACHED_WAITTIME to 5 in a first round, and it now feels better. Don't know whether it's the perfect setting, but it seems to be sufficient for me.

      BR and thank you,
      Manuel

      Delete
    21. @Alfred:
      After these findings I want to ask two followup questions:
      1) Is it somehow possible to patch the NORMAL_POLICY_CACHED_WAITTIME value to a runtime configurable parameter, intermediately? I don't know how to do this on my own. I want to test different values without changing the current bootup+load situation each time.
      2) How can I persuade you to find a better solution for the cpu affinity equalization? On here, it seems like processes suffer from sticking to one cpu, also in matters of performance. E.g. my firefox, sticking @cpu0, shows low flash video playback performance within (frame dropping), when adding it (child processes also affected?).

      I'm absolutely not convinced of your integration of CK's 0466 changes. The same applies to CK's actual approach. Seems like he only wants to drop work and/or criticism (with "interactive"). Your previous -vrq were running better (SCOST). I don't care about some centisecs of kernel compile time advantage, and, that had never been the target of CK/BFS for years.

      BR Manuel

      Delete
    22. @Manuel
      Thanks for continuous testing with -vrq. Please find your best NORMAL_POLICY_CACHED_WAITTIME value and report back, that will help with the calculation. And I also want to know your normal cpu workload when testing.
      For your questions:
      1) I'd consider to make it configurable if no auto-adjustment solution come out in the next development cycle.
      2) For cpu affinity, in some cases, current formula will cause task stays waiting for original cpu when smt/llc cpus are also busy. I have plan to adjust the formula but thing got to be done one thing at a time. And I believe most cases issue is not cause by cpu affinity as it help with performance boost by keeping cpu avoid idle state and use cache efficiently. For your use example, if flash play back requires 80% cpu, so no matter it uses 80% cpu0 or 40%cpu0+40%cpu1, they will be both fine(and 80% cpu0 is preferred IMO). But if it takes 60% cpu0 and just 10% of cpu1(while cpu1 have available cpu time), which is the bad case we need to address.

      I am not going to integrate bfs0466, as I have explained, the core idea of 0466 have been adapted in this -vrq release and the interactivity/performance issue is balanced by existed -vrq functionality. SCOST and caching timeout are two methods to control task caching in my original design, they control task caching in the same way. Now, SCOST is not a good design that I have explained in this post and has been wiped out. Believe me the caching time-out method can do the same job what SCOST had done, just lower the value(below 1ms) which will give up more performance for interactivity. But now we need to find out the pint that balance each other, as I don't want to lose too much performance for the daily NORMAL policy tasks.

      BR Alfred

      Delete
    23. @Alfred:
      Thank you very much for your comprehensive additional information.

      As until now, for now, and during the coming weeks I won't be able to provide proper performance related benchmarks like you do.
      The only things I can supply are results of "good"/ "bad" working conditions under certain everyday's usage circumstances. Plus, maybe step-wise testing through NORMAL_POLICY_CACHED_WAITTIME when I find time. A non-compile and non-boot-parameter depending runtime tunable would help to save time. ^^
      Currently I'm at 4. Latency doesn't seem to get better/worse than with 5, and there are no remarkable performance issues. Sidenote: {Somehow "4" cooperates better with TuxOnIce resuming, but that may also be again a "one-boot-wonder". At least it appears to be safer than with 6 and 5, so far.}

      My report of the flash video playing within firefox should not be understood as benchmark. Only to show the _possible_ actual design misfits. And it's quite difficult for me to estimate loads of cpu0/ cpu1, NORMAL/ BATCH, although comparing top and gkrellm simultaneously. Here a try:

      Normal use:
      cpu0: ff: 30% wcg: 60% |
      cpu1: ff: _?_ wcg: 96% |- unclear 14% SCHED_NORMAL

      With flash from streaming:
      cpu0: ff: 85% wcg: 15% |
      cpu1: ff: 25% wcg: 60% |- unclear 15% SCHED_NORMAL
      _
      wcg = two worldcommunitygrid clients as SCHED_BATCH
      ff = Firefox
      unclear = cannot devide to either cpu0/ cpu1

      Can be, that this doesn't help you at all, but your answer could improve my testing.

      BR Manuel Krause

      Delete
    24. @Manuel
      How about the cpu affinity of the two wcg clients, one is just allow to run on cpu 0 and another is running on cpu1? Or both of them are allowed to run on all cpus?

      BR Alfred

      Delete
    25. @Alfred:
      They're both 0x3 regarding affinity, +19 nice and SCHED_BATCH. They get started from the started boinc-client.service script. From my longterm observations on here, each of them attaches to one cpu core (so it's possible to identify them).

      BR Manuel

      Delete
    26. @Alfred:
      Today, since 21h, I'm testing NORMAL_POLICY_CACHED_WAITTIME (8), to try the other direction, too. Indeed, interactivity seems to suffer, compared to 6 and especially 4.
      With this setting of 8, I see additional things: With the previously described circumstances, ff + 2*wcg + smplayer-mpv playback without flash video, the Firefox task does bounce from cpu0 to cpu1 from time to time, not regularily in intervals from 1-20s, leading to significant spikes watched in gkrellm (means, what goes to cpu1 "misses" equally at cpu0 for some seconds), while the overall loads of the relevant processes stay equal as observed in top. I don't see any reason for this bouncing and I haven't seen this with 6, 5 and 4.
      Maybe this is interesting for your further adjustments. ;-)

      Atm., the setting 4 is my overall favourite.

      BR Manuel

      Delete
    27. @Manuel
      Thanks for testing. I'm busy with 4.4 sync up this week. Two suggestion for your testing and usage
      1. Consider wcg clients run in IDLE policy, which will give more cpu time to ff and other NORMAL tasks when they need cpu power.
      2. 6ms is the max caching timeout value I currently suggested for interactivity, so no point to try values over 6ms.

      BR Alfred

      Delete
    28. @Alfred:
      1. Changing them to IDLEPRIO doesn't change anything vs. BATCH regarding more cpu for NORMAL tasks.
      2. With the setting of 8 I only wanted to prove that your decision is right. :-)

      I wish you happy syncing ;-)
      BR Manuel

      Delete
    29. @Alfred:
      Sorry to disturb your syncing... I've made another observation regarding my previous post.
      Atm. I have two different wcg projects running. Sometimes 2 subclients of the same name run == both from one project, and sometimes 2 of the 2 different projects. They're all normally started as BATCH (like in the longterm past).

      If I manually (schedtool) change the both differently-named subclients to IDLEPRIO no change happens. NORMAL tasks stay sticky at cpu0.
      If I change the same-named subclients, when they're up, to IDLEPRIO, everything looks equalized == balanced. NORMAL vs. "IDLEPRIO" AND on both cores.
      If I change back same-named "IDLEPRIO" wcg tasks back to "BATCH", it's looking unbalanced like with first BATCH run, unbalanced.

      I don't know how to understand this. Please, find time to investigate this.

      BR Manuel Krause

      Delete
    30. Dear Alfred,
      I still cant't explain this behaviour for myself. Today the same situation of two same named wcg subclients happened, but I wasn't able to reproduce the results as described above, today. Namely, the IDLEPRIO/ BATCH changes don't change things today (no affinity equalisation), but I swear, I've experienced it yesterday.

      I don't know what happened, and haven't had any errors in the logs.

      BR Manuel

      Delete
    31. Grrr... :-( forget my last message. Now I've reproduced it. I'm still trying to figure out, how this evolves.
      My last idea of the *same-named* subclients should be considered > /dev/null as it isn't reasonable, not logically provable -- it just only looked like that by coincidence.

      The scenario of working schedtool IDLEPRIO setting with a resulting affinity equalization (or not) seems to depend on _when_ I change a BATCH task to IDLEPRIO in the presence of one or more NORMAL tasks, here mainly ff, that gets more cpu0 demand by adding a flash stream plaback within, and then gets near to/ above using 100% cpu0 to want using cpu1, too. (I can add, that when this equalisation then occurrs, playback stuttering in flash streaming decreases and interactivity in the rest of the system increases.) Sidenote: Currently using NORMAL_POLICY_CACHED_WAITTIME (4).

      Atm. I don't get the timeline of this phenomenon reproduced correctly.

      BR Manuel Krause

      Delete
    32. @Manuel
      Your last observation sounds more reasonable. My suggestion is to lower the NORMAL_POLICY_CACHED_WAITTIME to 3,2 and 1. In previously, the cache time out is about 1/16 ms or some like that when using SCOST. I don't have time to do detail tests on my old notebook(which use similar hw like your system), but I have done the mpv h264 + 300% nice -19 compile workload test on it, and the result is good and no frame drop.

      BR Alfred

      Delete
  3. Should NORMAL_POLICY_CACHED_WAITTIME perhaps be set at boot time, as some function of max CPU speed? Since you said the caching timeout "some how depends on the cpu performance to complete the tasks on time".

    ReplyDelete
    Replies
    1. Yes and that's the plan I current have, but I need tested data to make up the formula, there will be some input parameters, such as cpu performance/ cores/ workload etc.

      BR Alfred

      Delete