Wednesday, January 13, 2016

First BFS/VRQ patch for kernel v4.4

Here is the all in one vrq patch for the latest linux kernel v4.4.

What's new:
1) Sync up with upsteam schedule code changes.
2) Remove original SMT_NICE code in BFS, something new incoming.
3) Quick path for best_mask_cpu(), which improve performance when workload<100%.
4) Minor refines.

I'd like to wait for other patches(BFQ etc) and do some commit merges before pushing the code to git. Meanwhile, of course, the most important, I'd like to hear your feedback about this patch on v4.4 and see if any adjustment is needed.

Having fun with VRQ in this new kernel release and the 2016.

BR Alfred

Edit:
Thanks pf for testing and reports back. I have update the code change the link to https://bitbucket.org/alfredchen/linux-gc/downloads/v4.4_vrq_1.patch

Heads-up:
Please be notified that current vrq may failed to reschedule in some rare cases, specially when system boot up/reboot/shut-down and suspend/resume. I am looking back what code changes introduce the issue.

Updates:
Looks like there are 2~3 issues in the field I'm hurting. One is about 1sec boot up delay shows in dmesg, and fix is done. Another is suspend/resume issue, I have bisected and found the commit, the issue is not related to bfq v7r10, fixing code is ready and need more time to verify it then see if any other commits cause suspend/resume issue back to the latest commit. The third issue is unable to shutdown, hopefully the fix of second issue also help with this.

Another heads-up:
Remember the "unplugged io" issue in bfs? Since mainline code changes, it also impact the fix code for this issue. So I have removed one condition checking in the fix code because that is never be true in current version. But anyway, please re-check the "unplugged io" issue, as which I can reproduce in my machines to verify it.

35 comments:

  1. BFQ prerelease (r10) is already available for 4.4.

    https://github.com/linusw/linux-bfq/tree/bfq-v7r10

    No sync with original BFSv467?

    ReplyDelete
    Replies
    1. Thanks for the info. @pf.
      I have tried to balance performance/interactivity using caching timeout mechanism in vrq, so it's no point to sync up original bfs 0467 which tries to solve the issue in another way. You can check my previous post for 4.3 for detail information.

      Delete
    2. @pf
      BTW, is the project in github you provided the official git of bfq? I used to check bfq updates from http://algo.ing.unimo.it/people/paolo/disk_sched/sources.php

      BR Alfred

      Delete
    3. Not quite. AFAIK, that GitHub project is Linus' Walleij of Linaro playground. They are interested in BFQ mainlining, so Paolo Valente, BFQ developer, pushed preparation patches there. I.e., project is unofficial, but patches are official, although pre-release.

      I've started building 4.4-pf0 with BFQv7r10 and your BFS port.

      Delete
    4. Just to note:

      ===
      [ 605s] kernel/sched/bfs.c:5457:1: warning: 'rq_set_schedule' defined but not used [-Wunused-function]
      [ 605s] rq_set_schedule(int cpu, int sched)
      [ 605s] ^
      ===

      Delete
    5. And final error:

      ===
      [ 8443s] ERROR: "task_cputime_adjusted" [arch/x86/kvm/kvm.ko] undefined!
      [ 8443s] scripts/Makefile.modpost:91: recipe for target '__modpost' failed
      ===

      Delete
    6. I believe, you need to do EXPORT_SYMBOL_GPL for task_cputime_adjusted, because such an export was deed somewhere between 4.3 and 4.4.

      Delete
    7. *added.

      Here is my attempt to fix it: https://github.com/pfactum/pf-kernel/commit/1ae91e2def519a1addedcabf86d1e21cf65b8918

      Delete
    8. rq_set_schedule() is my new api. I'll clean it b/c it has been used yet.
      Thanks for the fix for exporting task_cputime_adjusted, I do have kvm module enable in kernel config, but I doesn't meet the final error, as I can see the caller is hyper-v related, I have to double check my kernel config. Sure I'd include your fix in next vrq release.

      BR Alfred

      Delete
    9. Booted OK.

      ===
      [~]$ uname -a
      Linux spock 4.4.0-pf0 #1 SMP PREEMPT Thu Jan 14 14:16:28 UTC 2016 x86_64 GNU/Linux
      [~]$ dmesg | grep Alfred
      [ +0.000002] BFS enhancement patchset v4.4_0466_1_vrq by Alfred Chen.
      ===

      64-bit Arch packages are here: https://build.opensuse.org/package/show/home:post-factum/linux-pf-testing

      Delete
    10. @pf
      Thanks for testing and I'd like to hear the feedback of interactivity to decide the default cache time-out value.

      Delete
    11. @Alfred:
      Regarding the BFQ -- you can also take the four v7r8 patches from the tar.gz at the bottom of this post: https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/ye-RA9y-uBY

      4.4.0-vrq is up and running fine with BFQ v7r8 and tuxonice.
      I don't see differences in behaviour, also means _no_ regression, so far.

      Thank you and BR,
      Manuel Krause

      Delete
    12. @Alfred, dunno, no interactivity issues so far. Looks good.

      Delete
  2. @Alfred:
    I'm continuing testing the NORMAL_POLICY_CACHED_WAITTIME, now with this kernel, first at (4) to compare my last 4.3.3 and 4.4.0, and now 4.4.0 at (3). For now, I'm undecided, whether it gets better/worse in both directions interactivity/performance.

    I've now also re-tested the issue of my last posting from the previous blog-thread. It's really difficult to reproduce. But I see that there is a problem with setting nice values and SCHED_* policies (schedtool). In a certain row of commands they show a difference in gkrellm display as niced processes, although they aren't (no matter, SCHED_IDLEPRIO or SCHED_NORMAL). Even as SCHED_NORMAL and schedtool -n -19, there is no performance/interactivity impact vs. -n 19. I consider this as a design error.

    What I now want you to do, is to thoroughly check your code for these paths. In my actual experience the nice & policy paths are not 100% functional.

    BR Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel
      Thanks for the feedback. As I remembered, there is no code changes for those code path recently. Would you please detail commands/log you used to trigger your issue. So I can reproduce/verify on my side.

      BR Alfred

      Delete
    2. @Alfred:
      Unfortunately I had to go back to 4.3.3 with last -vrq, due to unpredictable VT switching issues when rebooting or hibernating, introduced into 4.4.0 i915 gfx driver. The issue described above is also present in your latest 4.3.3 related -vrq code. I havent't retried with your -vrq before the major 0466 related changes, so far.

      I give my very best to describe the commands, but as I don't know if you have the same programs available, I keep it as common as possible.

      * Start with two similar separate processes as SCHED_BATCH and nice 19 (lowest nice!), that run on each core separately {here: 2 wcg subclients}
      * Add a process, maybe a firefox, with SCHED_NORMAL, nice 0, and drive it's cpu0 load to at least 80% on one core (better with overlapping to cpu1) at least for the test duration {here: ff with flash live streaming}
      * Checks to be done: continuously observing top & stepwise with: schedtool `pidofproc "batch_test_processes"`
      * Try this row of commands:
      # schedtool -n -19 -D `pidofproc "batch_test_process0"`
      # schedtool -n -19 -D `pidofproc "batch_test_process1"`
      # schedtool -n 0 -D `pidofproc "batch_test_process1"`
      # schedtool -n 0 -D `pidofproc "batch_test_process0"`
      # schedtool -n 0 -N `pidofproc "batch_test_process0"`
      # schedtool -n 0 -N `pidofproc "batch_test_process1"`
      * Recheck with: # schedtool `pidofproc "batch_test_processes"` vs. top

      # schedtool -n -19 -N `pidofproc "batch_test_process1"`
      # schedtool -n -19 -N `pidofproc "batch_test_process0"`
      * Recheck with: # schedtool `pidofproc "batch_test_processes"` vs. top

      I also want to appeal to your own creativity regarding testing.
      It could be useful for reproducing this, that you conduct this testing on your old 2core machine to keep the hw difference as low as possible.

      BR, Manuel

      Delete
    3. @Manuel
      I have done a quick test using your reproduce steps, but all seems to be normal. So what have you seen in schedtool output vs top, anything mismatched?
      BR Alfred

      Delete
    4. @Alfred:
      The problem I see is that everything is a bit too "normal". ;-) I don't see changes to take effect. Although processes are reported as NORMAL or IDLEPRIO in schedtool and the nice value in schedtool and top, there is no change in the effective cpu bandwidth usage of the changed processes vs. the main NORMAL process (ff). (I assume, that schedtool and renice are working on here, when the changes show up in top.)
      Also, when watching system/normal/idle cpu usage for each core in gkrellm (Do you also use sth. like that?), the changed former BATCH processes don't show up as normal load, no matter of their set nice level or scheduling policy. Only sometimes this can happen, but I'm absolutely not sure by what command sequence.
      Since yesterday I'm testing the former 4.3.3-vrq with the SCOST approach, and it shows the same behaviour.
      In my opinion, the BATCH/ IDLEPRIO and niced +19 cpu hogs of wcg should give more cpu bandwidth to the NORMAL tasks like ff, to not allow frame stuttering with flash video playback within ff, but that's what occurs. And adjusting them to NORMAL and niced -19 should impact playback, what doesn't occur.

      BR Manuel

      Delete
    5. @Manuel
      I guess you are trying to find the edge of hw capacity under certain usage. But failed to so do. For example, if FF only use 40% cpu, no matter what policy/nice level the back-ground workload is(below FF's policy and nice level), it won't go to 60%. The back-ground work load will occupy the rest of cpu time.

      To verify task policy works, you can start a mpv play-back with IDLE policy, first without any other workload, it can be played smoothly. Then, simply start a 100% compile workload, the compile jobs will take almost all cpu time and mpv will be frozen. Stop the compile jobs, mpv will continue.

      BR Alfred

      Delete
    6. @Alfred:
      Thank you and for your explanation. This, and the suggested test (which showed your results), made it very clear, that I had misunderstood some things with nice and policy settings, before.
      Though using Linux for many many years now, I'm still learning. Hopefully ;-)

      Maybe I'm still confused by how the newer -vrq delegates processes to 2 of 2 cores.
      Btw., I'm still continuing the CACHED_WAITTIME tests, but that's worth another posting.

      BR Manuel Krause

      Delete
  3. I'm experiencing a larger input latency with this version when playing games in wine. Playing games in wine involves a lot of context switches per frame about 400 every 4ms. I have to set 'NORMAL_POLICY_CACHED_WAITTIME (0)' to have acceptable latency in this version, the previous versions worked fine with the default value for NORMAL_POLICY_CACHED_WAITTIME.

    ReplyDelete
    Replies
    1. @Anonymous
      Thanks for testing. Which previous version you mean? As the default value for NORMAL_POLICY_CACHED_WAITTIME has been changed during the releases(6ms), I want to find out what really cause your input latency.

      BR Alfred

      Delete
  4. your current linux-4.3.y-vrq branch

    ReplyDelete
    Replies
    1. Interesting, besides the 4.4 sync up changes, there are just three new added commits comparing to the linux-4.3.y-vrq branch.
      If you want help to find out which one introduce the input latency issue, pls send me a email. I'll provide your some debug patches to found it out.

      BR Alfred

      Delete
  5. @Alfred:
    Regarding NORMAL_POLICY_CACHED_WAITTIME and your sentence in the previous blog thread "In previously, the cache time out is about 1/16 ms or some like that when using SCOST." -- How should we understand it? Setting NORMAL_POLICY_CACHED_WAITTIME to 1 means one ms now? How is it comparable to the 1/16 ms in the SCOST approach? Shouldn't there be finer granularity for WAITTIME below 0 vs. 1/16 ms?

    I've tested the NORMAL_POLICY_CACHED_WAITTIME with 8, 6, 5, 4, 3, 1, 0. Only 2 is missing in the suggested row now. What I've seen, is, that lowering the value is better for interactivity with the desktop + video playback and doesn't harm everydays' throughput. Setting 8 really hurts interactivity! I don't know what 1000 ms may lead to. When setting it more and more in direction to (0), I've watched, that processes are more easily switching to my second of two cores, tending to equalize cpu0 vs. cpu1. So, this setting also affects the cpu affinity for processes.

    Best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel
      Once SCOST was still existed, I have built a debug kernel to profile how long the cached task get switched in. That shows about 80% cached tasks switched in again within 1/16 or 1/8 ms(I don't remembered clearly). That's the story you may want to know about.
      Now, we are using caching time out to balance performance/interactivity, and please *FORGET* the SCOST and the data about it, it should never be the best reference setting. In current stage, I want the NORMAL policy cache time-out to be fixed in ms level. Earlier last week, I still have the idea to auto adjust the value by a formula, but after careful though, it may be too complicated to archive its goal. ms level is not good enough for desktop apps and rt task, but, let it be in this stage, it takes time to finalize these settings. In this release, there are higher priority jobs to be done.

      For your tests, I'd like to suggest you to set your wsg clients in IDLE policy, in this way, the normal tasks for interactivity could get more cpu time if they needed. And try to find out 6,5,4,3,2,1, which ones are acceptable for you.

      BR Alfred

      Delete
    2. Mmmh. It took me a while to find out a way to start the boinc-client and the 2 child wcg processes at SCHED_IDLEPRIO and nice +19, continuously as new default.

      I now must admit that it makes an effective difference, when they're (a) started with these settings from the very beginning vs. (b) the old SCHED_BATCH and the nice +19, set with default start script, vs. (c) later adjusting them to IDLEPRIO via schedtool manually, when up and running.

      (a) changes in the way that NORMAL processes more easily go to cpu1, plus affinity equalization
      (b) changes in the way that NORMAL processes stick/ attach more to cpu0,
      (c) results like (b) -- so doesn't schedule correctly IMO

      Why is the SCHED_BATCH so far away from SCHED_IDLEPRIO ? Sidenote: I don't want to retest all these NORMAL_POLICY_CACHED_WAITTIME kernels again. Only if needed.

      This was recorded with the 4.3.3-vrq patches from your repository + BFQ v7r8 + TuxOnIce. If there were important fixes for the 4.4-vrq, also suitable for 4.3.3, please let me know (email)!

      BR Manuel Krause

      Delete
    3. ...with current NORMAL_POLICY_CACHED_WAITTIME (2)
      BR Manuel Krause

      Delete
    4. @Alfred:
      I've continued testing during last week, regarding NORMAL_POLICY_CACHED_WAITTIME, but my results are only based on subjective impressions, and only done on the "older" 4.3.3 & 4.3.4 kernels with the current related bitbucket committed patches (atm of writing) + BFQ v7r8 +TuxOnIce.

      As you don't seem to like setting 0, I'd suggest 1 or maximum 2 for my system.
      The lower the value -- the better the interactivity.

      What may be somekind of throughput related -- correct me if I'm wrong with this, running the boinc-client and the wcg clients as SCHED_IDLEPRIO (vs. SCHED_BATCH all time before), didn't affect their overall rate of workunits processed. As I've already indicated before, this mainly only changes the cpu affinity switching ability (bettter with wcg as IDLEPRIO).

      And -- changing NORMAL_POLICY_CACHED_WAITTIME in direction to 0 does bring much more value for interactivity than throughput suffers (if at all), in my experience of everydays use.

      I'm looking forward to your improved revision,
      please, please, keep up your good work!

      BR Manuel Krause

      Delete
  6. Regarding your "Heads-up" edit of the top message, mentioning rare reschedule failures, please, also be aware, that there are more possible issues with 4.4.0:

    Especially users of the BFQ I/O scheduler v7r10 should have a look at the latest posts on https://groups.google.com/forum/?fromgroups=#!forum/bfq-iosched
    and at the bottom of this here: https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/9N1QL9E-KH4

    Best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel
      Thanks for the info. I'll disable bfq to isolate the issue for debugging.

      Delete
    2. @Alfred:
      Your last two "Heads-up" edits really make hope, that some random failures (without evidences) of BFS/vrq get fixed now. Thank you very much for your efforts! Don't forget to bring a patch online. ;-)

      Unfortunately I wasn't affected by the "unplugged io issue" and so cannot contribute to this.

      BR Manuel

      Delete
    3. Btw., Paolo Valente has released the bug-fixed new release of BFQ I/O patches, now as v7r11, this afternoon: Linked path: http://algo.ing.unimo.it/people/paolo/disk_sched/patches/4.4.0-v7r11/

      Announcement here: https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/zljl8ulI1k4

      BFQ I/O patches v7r10 are withdrawn due to their issues and should NOT be used!

      BR Manuel Krause

      Delete
  7. Hi Alfred,


    thanks for your continued work on BFS & VRQ !


    There appears to be a scheduler issue with SCHED_ISO tasks:


    My current "best" test for this is GRID Autosport's built-in benchmark mode


    When running it with default options,

    the game experience is buttery-smooth (well, as smooth as it can be with mid/high settings on a GTX 760 @1920x1080 - min 53 fps, max 76 fps, average around 63 fps),

    this wasn't the case with 4.3 kernel - so you significantly improved BFS-VRQ from 4.3 to 4.4 !


    With schedtool -I -e

    max is about the same 73 fps, min - however 39 fps and average around 56 fps.

    Now you could say that 39 fps isn't bad, granted - however there's several total "stuttering" occurrences where the whole screen content stands still and then continues,

    this occurs at least 2-3 times during the benchmark, totally out of the blue and randomly - and, as you can imagine, this is pretty deadly during long-term races (especially "Endurance").

    I've observed similar behavior when running mpv via schedtool -I -e with vapoursynth scripts and 60fps movies on the 4.3 kernel


    caveat: (especially) chromium or other programs accessing the gpu, need to be closed otherwise the stuttering become even more frequent (3+ times on the benchmark and for longer time - sometimes even several seconds at once)

    So ... is this more of a contention issue (throughput) or a (re)scheduling issue ?


    Anyway - hope this gets you on the right track to track this down and abolish it,


    BFS is really needed since even with a tweaked CFS scheduler, the experience is BFS is simply superior (high i/o, high load, general work with desktop, etc. etc.)


    Thanks !

    ReplyDelete
    Replies
    1. @kernelOfTruth
      Thanks for providing your test result. I'm currently working on stability issue. As I can see, there will be caching time-out setting changes for all task policies. So, please re-test when new release comes out. Meanwhile, if you have time, you can try to back to v4.3.1-vrq tag on my git repository or even earlier version like 4.2 etc and test with the iso policy, it could be useful references when testing with current implementation.

      BR Alfred

      Delete