Sunday, October 9, 2016

VRQ patch v4.8.1-vrq0 released

VRQ patch v4.8.1-vrq0 is released, all-in-one patch is available. linux-4.8.y-vrq branch has been pushed to bitbucket and github.

What's new
1. Sync up with 4.8 mainline scheduler code changes.
2. Introduced skip list as queue data structure. For detail information about the implementation of skip list, please reference to my previous posts(#1, #2 #3) and also the original skip list design idea from CK.
3. Based on the feedback of users, sticky and caching features are turned off for NORMAL policy tasks, which give max interactivity experience for desktop&gaming usage(commit is here). Further tuning may still be needed, but it looks that it's the best choice out of the 4 debug patches, based on the feedback information so far.
4. Workaround fix for a boot-up issue caused by preempt task solution.
5. A v2 fix for smp_processor_id() preempt code usage in smpboot_thread_fn().


PS:

The policy based sticky/caching feature is almost done, unlike the interactivity switcher in original BFS, this feature provide throughput and interactivity at the same time. "At the same time" not means a task can both throughput and interactivity all together, it means that throughput tasks and interactivity tasks can run all together. Just assign the right policy to the tasks.


Enjoy VRQ for 4.8 kernel, :)

BR Alfred

22 comments:

  1. @Alfred,

    please rescue my previous comment, thanks.

    Br, Eduardo

    ReplyDelete
    Replies
    1. Bellow is the eaten post, ;)

      ***
      Hi!

      I took all-in-one patch for 4.8.1 (VRQ0), the same config as always (but NO BFQ yet) and first results of testing are there (as always): https://docs.google.com/spreadsheets/d/1EayezAsGlJdXjZbS3b9m7YtvtRF-DJ3xrT3hYCvfymQ/edit?usp=sharing

      Please take a look, if interested. It seems that VRQ results are "mixed".
      I have updated the system, so I have marked all results from before update as "set 1" and after update "set 2", sets are not directly comparable to reveal scheduler behaviour.
      The most correct comparison for scheduler behaviour is page "Performance (DRI3), SET 2", BUT please note that VRQ is for 4.8.1 but latest MuQSS I tested is 4.7.7, maybe kernel is making a difference. Dunno yet.
      I'll update results when I'll compile MuQSS for 4.8.1 (so apples to apples comparison with VRQ), but I'm not quite sure when, coz it seems there are loads of fixes for MuQSS in "pending", I'll wait for them to settle down a little.

      I tested D3 as well and my feeling is that this is the worst stuttering I've had so far :( Dunno what it is related to, but it's bad and happens only with VRQ (double checked against Ubuntu standard / BFS / MuQSS).

      Br, Eduardo
      ***

      Delete
    2. IMO, from the test result, all v4.7 kernels are with close results and only one 4.8 sample, so let's put it low priority.
      Focus on the D3 issue, again, only one 4.8 sample here, and it's hard to tell whether it is introduced in 4.8-vrq0 or other component changes during 4.7 to 4.8.

      As I remembered, you have tested D3 playing with 4.7 sl checkpoint tags kernels, and reported that #5 is good and #6 is stuttering(as expected), have you tried #6(4.7_0472_sl_new_vrq_full) + the _s0c0 patch which I have recommended? There is very little scheduler code changes from that to this new 4.8_vrq0 patch, we definitely need to check the result on 4.7 kernel to isolate other component changes in 4.8.

      BR Alfred

      Delete
    3. @Alfred,

      Con release mux 110, I'll compile it for 4.8.1 to see how it fares, so we'll have another result for 4.8.1. I'll throw in mainline version for good measure.
      I did not try 4.7 + #6 + s0c0, I'll compile it, hopefully, some time this week, and will try to run D3.

      Br, Eduardo

      Delete
    4. @Alfred,

      I added mux 110 results to the gsheets, the difference between both of 4.8.1 results is 0.15% (I sneezed during benchmarking, so it sort of made an impact on the results :D). Performance wise we are good!

      I'll compile Your suggested changes for #6 and let's see how that fares.

      There is one more thing :), which I'm not very surprised, but still a good reference how scheduler can impact performance.
      I compiled 4.8.1 VRQ and used it on my laptop, compilations are done in VBox VM which had default mainline kernel. You may remember that for me compilation takes about 2hrs (~2:15 a bit more precisely) and You said it's sorta long, which it was. So I compiled Cons mux 110 and decided to install it in VM. Mux is evolution of BFS, so in that regard it does not matter whether I installed 4.8 VRQ or MUX there.
      So I started to compile 4.7.7+mux110 for my tests and it finished in ~ 1:15, which is almost 2 TIMES faster! This is one real-world non-scientific non-hard-to-measure improvement.
      Thanks for Your work.

      Br, Eduardo

      Delete
    5. @Alfred,

      I tested the combination of 4.7.7+v4.7_0472_sl_new_vrq_full.patch+4.7_vrq_test1_debug_s0c0.patch. In D3 it stutters, not as bad as 4.8.1, but quite bad, I would say standard bad :)
      I double-checked that I'm running right patch and debug patch.
      In addition I added my Unigine testing results in google sheets as well.

      Br, Eduardo

      Delete
    6. @Eduardo
      Thanks for the D3 testing. It finally turns out commit between #5 and #6 contribute to the D3 playing issue. Not the stick/caching feature which I once expected.
      There are 12 commits between #5 and #6, and 4 iteration to find out which is the bad one, can you use bisect yourself to find it out or do you want all in one patch for each iteration?

      Delete
  2. @Alfred,

    It would be easier and more effective for me and most likely for You to give me exactly what You want me to test. As I said I'm not familiar with bisecting really :(
    You can send patch by e-mail and we can write summary and findings here, or You can send me instructions how to bisect by this particular example, so I learn smth (fast and effective) :)

    Br, Eduardo

    ReplyDelete
    Replies
    1. @Eduardo, Alfred:
      To explain/ learn bisecting isn't that difficult. I can look for the shortlist in my docs that helped me, regarding the related "git" commands, if you want.
      In this special case -- please be led from Alfred's advice -- should not be understood as false excuse -- as I atm. don't see clearly, which of the 12 mentioned, but yet unnamed, commits rely/ depend on each other.

      Normally bisecting is done with a last known good commit (A) and an actual bad (Z), and then taking back all commits/patches til the numbered middle (M) position. Test the result, see if the error still exists: if yes, then do the same procedure within (A) to (M), what is your new (Z). Otherwise (M) is your new (A), and you'd do the same test, with again the last half of the remaining patches removed. Repeat this, until the result: Finally, there should be only one commit, that causes the issue. A brute force method somehow, dumb and time consuming, but very effective.

      BR, Manuel Krause

      Delete
    2. @Eduardo
      Have sent you the first bisect patch, please have a check and enjoy your testing(compile and play D3) ;)

      Delete
    3. @Eduardo & @Alfred:
      Any news on this bi-secting front?
      BR, Manuel Krause

      Delete
    4. We found which commit introduce the D3 playing issue, but my first work around debug patch doesn't seem to work as expected, I'm still looking at it.

      BR Alfred

      Delete
    5. @Alfred,

      BIG discovery thanks to "anonymous" post in CK forum.
      Tests with VRQ (latest You sent me):
      1. Diablo 3 + "ondemand" - does not stutter, FPS up to 60.
      2. Diablo 3 + "performace" - stutters like hell, FPS up to 15.
      3. Diablo 3 + "cool n quiet" disabled - does NOT stutter, FPS up to 90.

      Point 3 is a BIG surprise, because Cool N Quiet means NO dynamic frequency scaling. So effectively it's about the same as "performance" governor, except it does not report to OS that CPU can scale frequency (as far as I understand about those _PSS BIOS "things").

      Hope this helps.

      Br, Eduardo

      Delete
  3. Hi Alfred,

    here are some comments:
    1. A question. Are you cheating? With vrq I have a top value in idle (kde, firefox running) between 0,00 and 0,10. With BFS I have 0,5..0,6 and with normal CFS a value of around 0,4. So I am really happy with this. Running on i7 with intel_pstate and powersave.
    2. Using your git tree. Getting following result with dmesg: "BFS enhancement patchset v4.8_0472_vrq0 by Alfred Chen". Is this correct or should it be v4.8.1?
    3. And now a big bug. Tested different CPU freq scaling drivers and governors. With ACPI and schedutil the computer dies instantly. (You can test it with "tee /sys/devices/system/cpu/cpu{0..7}/cpufreq/scaling_governor <<<schedutil", using the zsh here). If the schedutil is the default governor, the computer doesn't start at all.

    Regards sysitos.

    ReplyDelete
    Replies
    1. @Mike
      1. If you have watch the kernel code about this system loading, you may know from kernel designer's view, these three digital number are total useless but from system operator's view, they still implement it. IMO, I used to check it to see how many task are running and if it is not obversely wrong, that's OK to me. So, no cheating here, but the calculation *may* different from what BFS or CFS does, I don't have time to check the calculation. If this make the number looks less than BFS/CFS, it denfinitely not intended.
      2. It's correct, it usually just for the whole kernel release, if there some minor kernel release cause any update, I'd just bump up the vrq release number, vrq1, vrq2 ... etc.
      3. It is a known issue, as schedutil comes out in 4.7(?) and benchmark show it is not better the old good acpi governors. I know CK has cpufre_trigger changes may fix this, but I'd like to wait the code changes to settle down at that time, then, more new bfs/muqss changes comes out. So, it is marked as low priority and may be in the bottom half of my todo list if I print it, ;)
      IMO, acpi governers are the best fallback.

      BR Alfred

      Delete
    2. Hi Alfred,

      thanks for the explanation. Wasn't aware of the useless top values. So the powertop and i7z utilities showing the idle times in c7 states are more usefull?

      I was testing the different governors , seems that the intel_pstate is the best one here (supporting the boost frequencies). So not worry about the schedutil. But maybe you could disable it during the make config?

      But anyway, thanks for your continuous work.

      Regards sysitos

      Delete
    3. @Mike,

      sorry for jumping in :), here is my experience with this.
      Best for me (Intel i7 mobile quad + HT) is ondemand goveror, it does not keep very high frequencies when not needed, which helps with power consumption obviously, and supports boost frequencies since long time ago. I play games, etc. with performance governor.
      Alfred does not provide make config (or he did?), You can disable it by Yourself in config when preparing the kernel. I have it disabled, coz I really don't need it, I don't see benefit yet.
      I use i7z to measure times together with top to see which processes are actually doing what (but not the load values), bet best easy tools :)

      br, Eduardo

      Delete
  4. @Eduardo,

    thanks for your information.
    Yes, I know, I can disable the schedutil in config, no problem so far. But there was a long way to go, to find out, that there is a conflict between schedutil and and Alfreds modified BFQ scheduler :(
    I use the ondemand governor here as default in config too. But with the intel_pstate driver (default for all new intels?), there are only the two governors performance and powersave available, learned this too ;) With the old ACPI driver, there are all these additional governors, you can choose from. But there is no Hyperspeed or Boost for it, so I get here only frequencies between 800 MHz - 2.8 GHz, with intel_pstate there is 800 MHz - 3.50 GHz. Checked by "cpupower frequency-info" and monitored ("watched") by grepping /proc/cpuinfo.
    Using top, i7z, grep and even the powertop program, to check the idle values and the power consumption with all the configs and settings and governors. But doesn't look really deterministic or scientific to me :). Nearly the same results as if I hear at my not so quietly CPU fan :D

    But anyway, nice to know, that you are always learning new things about linux. And that you must sometimes forget things, that were true all the time before (e.g. top values).

    Thx sysitos.

    ReplyDelete
  5. @Alfred:
    Do you plan to stay with old BFS/VRQ or do you do the switch to Con's MuQSS, upon which your improvements are highly appreciated too?!

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel
      Neither way I'd go, :)
      In fact, I tried to apply my commits upon MuQSS but have trouble after a few commits. So, here is the plan I am working on, I am on a -test branch which divided from a early VRQ commit and deploy "grobal run queue"-ness feature upon it.
      I just finished the lock strategy update, data structure changes and principle rules modification. I am planning to the first release soon to check if it break something before making more features upon it. You are welcome to join the test.

      BR Alfred

      Delete
    2. @Alfred:
      Hey, come on, join the current multiqueue hype ;-)
      and help improving Con's MuQSS! Some hours ago he also described possible fields of improvement and an invitation to people to contribute. And you have proved in uncountable months, to be intelligent and capable to collaborate with Con.

      Single queue is so... ...2015 ;-)

      Please read this with a portion of humour and irony,
      as you are a free man,
      BR Manuel Krause

      Delete
    3. @all
      Sorry for the late reply. Was busy resolvsyg issues on the test branch I have talked about, and still working on it, :)

      Delete