Sunday, July 9, 2017

VRQ 0.96d release

VRQ 0.96d is released with the following changes

1. smt sensitive scheduling improvement, which reduce some migration overhead.
2. Fix livepatch compilation issue.

This is bug fix and smt sensitive scheduling improvement release.

Enjoy VRQ 0.96d for v4.12 kernel, :)

code are available at
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.12.y-vrq
and also
https://github.com/cchalpha/linux-gc/commits/linux-4.12.y-vrq

All-in-one patch is available too.

BR Alfred 

59 comments:

  1. @Alfred,

    compiled fine, will test soon on i7 @ work.

    Br, Eduardo

    ReplyDelete
    Replies
    1. @Alfred,

      one day usage on i7 and gaming on Ryzen for 4hrs (wine and native) works fine.
      I did not test performance yet, but my subjective feeling is that it's now better using "ondemand" for all scenarios, even for gaming it's good.

      Br, Eduardo

      Delete
  2. @Alfred
    one day of use on Ryzen and i7 machines without problem. I tried some tests with memory allocation and vrq scheduler actually performs better than cfs, so my previous suspicion seems to be wrong. I will keep looking. Thanks for the updates.

    BR, Dzon.

    ReplyDelete
  3. @all
    Thanks. Till now, the feedback are positive. I am still working on reducing overhead of smt sensitive scheduling. Will announce another release when it is done.

    ReplyDelete
  4. hi i have one question what is the best settings for CONFIG_HZ ?? i never interesting with this settings but today i read that MuQSS recommend tick rate 100 for his patch and i can`t find info about recommend settings for VRQ

    ReplyDelete
    Replies
    1. I don't have recommended HZ for VRQ. Just follow general rules, high HZ for interactivity and low HZ for less interactivity(for example servers).

      Delete
    2. Thx for quick replay so if correct (i use laptop) for better battery life better will be set 100 then 1000? or maybe this settings there is nothing in common with battery?by the way i noticed on VRQ better battery life on my i3 skylake so i am use with pleasure this patch on kernel 4.12!

      Delete
    3. Depend on what you want to trade off, 100HZ should has longer battery life than 1000HZ, but you have to test how long it is(cpu is not not alway the major part of power consuming) and consider if you want to trade this longer period for 1000HZ interactivity.

      Delete
    4. @Krzysztof:
      Depending on your system timer's capabilities, you can also try non-decimal values to benefit of possible micro-optimizations (values being power of 2). These were described on Con Kolivas' Blog some months ago. He discarded them as too many people complained about errors vs. the decimal values allowed in mainline. I've tested many of them, and the lowest on my system, without errors, was 512HZ. I haven't noticed any negative impact vs. the activity-related-promoted 1000HZ.
      I'd attach a link to my little patch, I hope it works for you (need to select 512 HZ !) and feel free to edit it binary downwards (256, 128). Just replace the 512 value.
      https://pastebin.com/rdicHvvh

      Best regards,
      Manuel Krause

      Delete
  5. Works better than MuQSS on game server, thanks!

    ReplyDelete
  6. Sorry, for again posting some off-topic question -- now regarding the BFQ I/O scheduler:
    I'm a little confused about the current in-kernel status: Is the BFQ-sq (single queue, the known former one) also included, or only the newer mq-(multiqueue-) based one? If only the latter is in, does someone of you know, whether there would be an extra patch for the first?

    Thank you in advance for any info on this, best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel,

      This question might be better asked in https://groups.google.com/forum/#!forum/bfq-iosched. Paolo is quite active there.

      Br, Eduardo

      Delete
    2. @Eduardo:
      You're right, of course, and I'm reading this mentioned list. Without subscription to write there.
      Just thought to ask you, on here, before bothering Paolo himself. You may understand? :-?

      BR, Manuel

      Delete
    3. @Manuel,

      You may want to check this: https://groups.google.com/d/msg/bfq-iosched/2odL08qoPS0/qnqVLRwZAwAJ .
      I'm already building 4.12.2 + those patches :)

      Br, Eduardo

      Delete
    4. @Eduardo:
      Thanks for the heads-up for these new fixes!
      I've decided to opt for the future, meaning mainlined bfq-mq (now with 4.12.2), and to not bother with begging for the old stack. Currently, I'm giving the new one a try.

      Maybe of benefit for now: Thanks to post-factum/ Oleksandr from his https://pf.natalenko.name/news/ for a new udev rule explicitly setting bfq as the default I/O scheduler:

      $ cat /etc/udev/rules.d/10-bfq.rules
      ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/scheduler}="bfq"

      For those, not yet having set MQ as default in kernel .config, these kernel command-line appends are needed: "scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1".
      You can check with "cat /sys/block/sd*/queue/scheduler", whether it's set for your harddrives. (You may all know this already, I just added it for completeness.)

      Best regards to all of you on here,
      Manuel Krause

      Delete
    5. @Manuel,

      Just a note, if one happen to have NVMe drive and want to use BFQ for whatever reason, like me, path is a tad different: /sys/block/nvme0n1/queue/scheduler .

      Br, Eduardo

      Delete
    6. @Eduardo:
      Thank you for adding this path info, and for encouraging me to test it.
      In my non-benchmarked usual use I don't see a difference to the 4.11 kernel, what is a good sign IMO, including well working VRQ for 4.12, thanks to Alfred's work.

      BR, Manuel

      Delete
  7. @Manuel,

    Personally me, I currently gave up on extensive tests or use of "bfq-sq" or "bfq-mq", that is due to unexplained crashes. Once in a while I use bfq for my Ryzen system, but when it crashes due to bfq or other fancy stuff I do there (including VRQ), I switch back to deadline. The thing is that I do not even know which software is responsible for a crash, so I'm trying to isolate things :(
    Of course, bfq matters a lot to maintain an interactive system on rotational hw like my Ryzen system, I feel that everytime I switch away from bfq.
    When I prove to myself that other software on my system is rather stable, I'm back to bfq testing fun.

    As far as I know, everything is in so called "algodev", You probably need to check commits there to assemble a stable patch. I haven't tried that myself for some time now. In addition, as far as I understood, there are bfq renaming thing going on which adds a bit more confusion.

    If You figure everything out, please share Your knowledge ;)

    Br, Eduardo

    P.S. But for VRQ, it's good now, all seem to run nice! Thanks Alfred!

    ReplyDelete
    Replies
    1. I used to use bfq b4 4.12, now I switched to noop as my two working machines are on SSD(maybe deadline is a better choice, but I already have benchmark data running on noop, too keep things comparable, I will try deadline after my works are done), and mainlined bfq-mq need to select too much other kernel options to be enabled. So let me know if you guys have figure out the simple way to bring back the bfq, :)

      Delete
  8. Alfred,

    I have a report regarding failed build on 32-bit machine.

    Log: https://gist.githubusercontent.com/Pro-pra/6aed3990932d5b6906fc30e71a9ef8ee/raw/75d92c8e136b6e4b92b619c69d2e270a66fe8f65/err.txt

    Config: https://gist.githubusercontent.com/Pro-pra/6aed3990932d5b6906fc30e71a9ef8ee/raw/75d92c8e136b6e4b92b619c69d2e270a66fe8f65/DOTconfig-4.12

    I assume this happens due to name collision. There are 2 distinct declaration, "raw_spinlock_t sched_cpu_priodls_lock" and function that acquires this spinlock, sched_cpu_priodls_lock(), with the same name.

    Proposed patch from me: http://ix.io/ywT

    Could you please check this?

    Thanks.

    ReplyDelete
    Replies
    1. @pf
      Thanks for reporting this. I must changed the name some time ago and haven't verified it with 32bit kernel build. My only board running 32bit is the raspberry pi 2 and it is still running a 4.6 kernel. It should have some love, :)
      I'll include your patch in next release.

      Delete
  9. i know that this patch is for kernel 4.12 but i try install VRQ on kernel 4.13 rc and i have only 1 error

    Hunk #1 FAILED at 15.
    1 out of 1 hunk FAILED -- saving rejects to file kernel/sched/Makefile.rej

    cat Makefile.rej

    --- kernel/sched/Makefile
    +++ kernel/sched/Makefile
    @@ -15,13 +15,17 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
    CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
    endif

    -obj-y += core.o loadavg.o clock.o cputime.o
    -obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
    -obj-y += wait.o swait.o completion.o idle.o
    -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o
    +ifdef CONFIG_SCHED_BFS
    +obj-y += bfs.o
    +else
    +obj-y += core.o idle_task.o fair.o rt.o deadline.o stop_task.o
    +obj-$(CONFIG_SMP) += cpudeadline.o topology.o
    obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
    -obj-$(CONFIG_SCHEDSTATS) += stats.o
    obj-$(CONFIG_SCHED_DEBUG) += debug.o
    obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
    +endif
    +obj-y += cputime.o wait.o swait.o completion.o idle.o clock.o loadavg.o
    +obj-$(CONFIG_SMP) += cpupri.o
    +obj-$(CONFIG_SCHEDSTATS) += stats.o
    obj-$(CONFIG_CPU_FREQ) += cpufreq.o
    obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o

    ReplyDelete
    Replies
    1. Don't go so hurry for next kernel release, :). There are always lots of scheduler changes in mainline from release to release, so we always need to pick-up those sync-up chagnes and adapter them in VRQ. I usually start sync-up work at rc6 or rc7, at that time, scheduler changes would be stable.

      Delete
  10. Huh, just got this WARN_ON: https://gist.github.com/a4db6d8909f825d0691370eda354e2b1

    From here:

    ===
    1358 WARN_ONCE(rq != task_rq(p), "vrq: cpu[%d] take_task reside on %d.\n",
    1359 cpu, task_cpu(p));
    ===

    Something you know about?

    ReplyDelete
    Replies
    1. (just to note, after this warning everything got stuck, I've recovered this message from netconsole)

      Delete
    2. @pf
      That's the most deadly case I want to avoid in the very early VRQ development, that means a task is in a run queue which it should not belongs to. So I put a WARN_ONCE here to check, once it happens, it will let the system keep running for awhile then there is a chance to capture this log just like you did.
      But I haven't seem this for quit a long time, my first thought on this is, the new smt code may broken the old rule in some senario I have missed. I will double check it.
      And I'd like to ask if the cpu in your machine has SMT capibility and you have enabled CONFIG_SMT in kernel config? Or "dmesg | grep -i vrq" will give the cpu topology setup info I need.
      And, is it easy to reproduce on your machine?

      Delete
    3. @pf
      As I double checked, there should be also WARN_ONCE code in dequeue_task() and enqueue_task(). Would you please send me a full dmesg log, so I can grep the log I am locking for.

      Delete
    4. ===
      [~]$ dmesg | grep -i vrq
      [ +0.002527] vrq: sched_cpu_affinity_chk_masks[0] smt 0x00000002
      [ +0.000003] vrq: sched_cpu_affinity_chk_masks[0] coregroup 0x0000000c
      [ +0.000008] vrq: sched_cpu_affinity_chk_masks[1] smt 0x00000001
      [ +0.000005] vrq: sched_cpu_affinity_chk_masks[1] coregroup 0x0000000c
      [ +0.000005] vrq: sched_cpu_affinity_chk_masks[2] smt 0x00000008
      [ +0.000006] vrq: sched_cpu_affinity_chk_masks[2] coregroup 0x00000003
      [ +0.000005] vrq: sched_cpu_affinity_chk_masks[3] smt 0x00000004
      [ +0.000005] vrq: sched_cpu_affinity_chk_masks[3] coregroup 0x00000003
      [ +0.016787] BFS enhancement patchset VRQ 0.96d by Alfred Chen.
      ===

      > And, is it easy to reproduce on your machine?

      No. It happened only once, I've tried to search something using "the_silver_searcher" tool, and it triggered crash.

      > Would you please send me a full dmesg log, so I can grep the log I am locking for.

      Current dmesg? Because I haven't full dmesg left from the boot where it crashed (but also haven't seen another warning, this is the only one).

      Delete
    5. @pf
      Just an update. I still can't find any scenario which trigger WARN_ONCE in take_task() without triggering WARN_INCE in dequeue_task/enqueue_task. Unless task struct is crashed while it is in run queue.
      So please keep an eye on this issue and try to capture full dmesg log next time.
      And I have a sync-up commit added to next release which I found it maybe useful when investigating this issue.

      Delete
    6. I keep netconsole running all the time with 4.12, so as of now it is full dmesg I was able to get.

      I'm OK applying all necessary commits to -pf to re-check. The issue is difficult to trigger, however. It happened to me only once, unfortunately.

      Delete
    7. ag utility (the_silver_searcher) is a fscking beast — it can trigger various issues here and there.

      Here is another stacktrace captured, again, only one WARN: https://gist.github.com/ff32664c82e0e5861fb7ed3d7aa18e67

      Delete
    8. Alfred,

      I'm able to reproduce this crap in a QEMU VM with -smp 4,maxcpus=4,cores=2,threads=2,sockets=1 (just like my laptop). So 1) it is not a HW issue; 2) likely related to SMT code (wasn't able to reproduce without SMT) and 3) I'll try to capture some vmcore.

      Delete
    9. Here are some panics from VM:

      https://gist.github.com/03a2be514cbd813ebf5513922c506a92

      I've noted that it is much harder to trigger panic if something is working in background (like dd if=/dev/zero of=/dev/null), especially if there 2 to 4 dd's. If there is only one dd, or VM is idle, panic happens instantly.

      To trigger panic I run "cd /etc; while true; do ag post-factum 2>/dev/null; done".

      Delete
    10. Important note:

      in stacktraces above I was able to trigger panic not only at kernel/sched/bfs.c:1359, but also at kernel/workqueue.c:2041 and kernel/sched/bfs.c:3685, and even some double fault.

      Delete
    11. Well, actually triggered without SMT as well, just with -smp 8, for instance. Same backtrace on panic.

      Delete
    12. OK, I've managed to find out that launching ag with --no-affinity option does not trigger panic.

      Setting affinity in ag corresponds to the following code:

      https://github.com/ggreer/the_silver_searcher/blob/master/src/main.c#L155

      Maybe, VRQ has some f*ck-ups with setting affinity properly?

      Delete
    13. …or it is a problem with locking in migration code path. In ag, threads are started first, then affinity is being set thread by thread triggering migration, and BAAM — panic occurs.

      Delete
    14. OK, here is my reproducer: http://ix.io/yF3

      Compile it like this:

      gcc reproducer.c -o reproducer -lpthread -D_GNU_SOURCE

      Then launch it and let it spin a little bit. It crashes my VM in 2 seconds.

      Delete
    15. @Oleksandr, @Alfred:
      Thank you for these in-depth testings. I hope Alfred is not hyperventilating atm.
      After you mentioned kernel/workqueue.c above, I'd add one "WARNING" from here:
      https://pastebin.com/N7sVATwZ -- happening only on first resume from TOI s2disk.
      As my kernel has non-official TuxOnIce in it and it also references i915, it's maybe totally unrelated.
      I thought I'd better ask on here before getting more grey hair.

      BR, Manuel Krause

      Delete
    16. @pf
      Thanks for reproducing the issue.
      Based on current info, the issue is triggered when calling set_cpus_allowed_ptr() API, it must be some complicated race scenario with other codes, and it is not SMT related.
      Good news is now we can reproduce it, I can try to reproduce it tonight, then try to isolate some possible code path. Will keep you updated.

      Delete
    17. Just a quick update, it is very easy to repoduce with @pf's reproducer, so that means no game play tonight, :)

      Delete
    18. I'm very happy with crashing reliably your system :D.

      Delete
    19. If it helps, I've replaced WARN_ON with BUG_ON and prepared vmcore. Check it here:

      https://natalenko.name/myfiles/bfs_vrq_crash/

      README file describes everything there.

      Delete
    20. Updates:
      1. Tried other tools using cpu affinity like mprime, which running fine for 20+mins, and manually check with set_cpus_allowed_ptr() API using taskset, works ok.
      2. Look closer to the reproducer code, cpu affinity tasks wake up may cause the issue. That means it is related to ttwu code path.
      3. Isolated other code path like take_other_rq_task, policy balance and new smt scheduling, this takes longer to trigger the issue. There may be other code cause higher trigger rate for the issue in these isolated code path, will check later.
      4. Types of oops are random, but now, there is WARN_ONCE in dequeue/enqueue, which make me believe ttwu select a cpu that these cpu affinity tasks should not run on.
      5. Added quick workaround code for these cpu affinity tasks in ttwu code path. This help with oops, but the reproducer still frozen the system after running 2min+, with no crash log print out.

      IMO, there may be more one issue here, one is ttwu select a cpu the cpu affinity tasks should not run on, the workaround code works to fix it, but in deed, the root cause in the original ttwu code path should be found and it may help with other issues.
      Another issue is reproducer still frozen the system with no log. Well, one thing at a time, first come first serve.

      Delete
    21. I can just confirm that indeed sometimes it hangs without any log message, and sometimes it goes crazy with unwinding kernel stack if panic on warning is set.

      Delete
    22. Updates:
      After some testing, I believe there are double faults, fortunately I think I have fixed one of them, now the crash log is solid, like "vrq: cpu2 take task[2031] 0x08 reside on 3, curr[19], -1 running". Task 2031 is one of the reproducer thread, only runs on cup 3 and its affinity is 0x08, that looks both good but it is in run queue of cpu2 and the rq nr_running count is -1, that's terrible wrong.
      As this log just indicate the wrong situation, I'd need more debug load to find out WTH the affinity task get into a wrong run queue(and WARN_ONCE in enqueue_task() is not triggered, crap)

      Delete
    23. Also an update from me: I've encountered complete freeze with v4.11 and MuQSS on ag invocation, and currently suspect MuQSS to be affected too, although I'm not able to reproduce it yet with my reproducer (maybe, I need to spin it for more time).

      Delete
    24. OK, I've reproduced the issue with MuQSS as well. Now I'm going to grab proper stacktrace from it and vmcore from debug kernel and compare…

      Delete
    25. @pf
      I think I may have fingured it out. The reproducer is running for 10min+ on my debug kernel, and I will let it run over night and hopefully still cab see a living system tomorrow. Time for bed.

      Delete
    26. Thanks, Alfred.

      I've filed similar bugreport to Con. If you are interested, check it here:

      https://natalenko.name/myfiles/muqss_crash/

      and here:

      https://ck-hack.blogspot.cz/2017/07/electric-distraction.html?showComment=1501087358795#c102036580742833059

      Delete
    27. It has been 6 hours+ and reproducer is still running. So the issue has been fixed, in this debug load. Now I have to re-enable the isolated code paths and retest it, if everything goes smooth, there will be a bug fix release before the weekend.

      Delete
    28. Retested and verified pass. Here is the fix for this cpu affinity issue
      https://bitbucket.org/alfredchen/linux-gc/downloads/affinity_fix.patch
      You can try it before 096e officially release.

      Delete
  11. Hi all,
    chiming in just to warn you there is definitely some conflict between vrq and bfq. I tried bfq and bfq-mq and in two cases on both systems (Ryzen and i7) defrag in virtual Win10 machine hardlocked the host system and corrupted data. Cfs with bfq seems to be working and vrq with mq-deadline too (almost 2 weeks on both machines with vrq).
    I randomly achieved one peculiar case (on i7) where only the guest froze and when I switched io scheduler from bfq to mq-deadline guest resumed operation. I was not able to find anything useful in logs.

    BR,
    Dzon

    ReplyDelete
    Replies
    1. Currently, the BFQ team/ community is investigating one or more BUGs, so it might be useful to check recent active threads on https://groups.google.com/forum/?fromgroups=#!forum/bfq-iosched. And, maybe it's just the old story, that BFS/VRQ is likely to trigger such bugs earlier and more often.
      I've only applied the fixes from the thread mentioned by Eduardo some posts above, not experienced errors so far.

      BR, Manuel Krause

      Delete
    2. @Manuel,
      thanks for explanation and reference. It is probably what you are suggesting that there is just higher chance to trigger existing bug in combination with vrq. I will keep an eye on the bfq bug threads and, if I think of some safe test environment, test the patches.

      BR,
      Dzon

      Delete
  12. Thanks Alfred.
    The usual throughput tests are here :
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    VRQ throughput is slightly better than CFS.
    Changing the timer value seems to have a rather low impact, either with VRQ or CFS.
    I've also ran interbench, for those who understand its results.

    Pedro

    ReplyDelete
  13. @all
    Thanks all for testing/feedback/benchmark in these two release. This scheduler is more stable than its previous release. There will be no new release this week, but here is the update.
    The improvement of smt sensitive scheduling is on going, more testing is required to finalize a timeout value.
    During the investigaion of pf's issue, I am inspired to have an idea to cut off enqueue/dequeue overhead, and now I am working on it.

    ReplyDelete
    Replies
    1. Good luck, a good hand and a good mind to you for the ongoing steps!

      BR, Manuel Krause

      Delete