Tuesday, January 17, 2017

VRQ 0.92 release

VRQ 0.92 is released with the following changes

1. remove printk in migrate_tasks()
2. vrq: refine normalize_rt_tasks()
3. vrq: Optimist ffb usage in skiplist_random_level()
4. vrq: introduce cputime.c
5. vrq: remove unused sched_domain_level
6. vrq: remove rq->timekeep_clock

Most are code clean up and little optimist. The major one is introducing mainline cputime.c, which help to reduce vrq scheduler main code size under 7k LOC and reduce the effect syncing up with mainline kernel scheduler code from release to release.

Enjoy VRQ 0.92, :)

code are available at
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.9.y-vrq
and also
https://github.com/cchalpha/linux-gc/commits/linux-4.9.y-vrq

All-in-one patch is available too.

BR Alfred 

38 comments:

  1. @Alfred:
    VRQ 0.92 works as well on here as the predecessor. Not worse but also not better.
    The compile time load balancing is still an issue (make -j2 on dualcore).

    BR, Manuel Krause

    ReplyDelete
  2. Hey Alfred; x64 built fine, i686 UP failed though:
    CC kernel/sched/cputime.o
    kernel/sched/cputime.c: In function ‘read_sum_exec_runtime’:
    kernel/sched/cputime.c:319:18: error: storage size of ‘rf’ isn’t known
    struct rq_flags rf;
    ^~
    kernel/sched/cputime.c:322:7: error: implicit declaration of function ‘task_rq_lock’ [-Werror=implicit-function-declaration]
    rq = task_rq_lock(t, &rf);
    ^~~~~~~~~~~~
    kernel/sched/cputime.c:324:2: error: implicit declaration of function ‘task_rq_unlock’ [-Werror=implicit-function-declaration]
    task_rq_unlock(rq, t, &rf);
    ^~~~~~~~~~~~~~
    kernel/sched/cputime.c:319:18: warning: unused variable ‘rf’ [-Wunused-variable]
    struct rq_flags rf;
    ^~
    cc1: some warnings being treated as errors
    make[2]: *** [scripts/Makefile.build:293: kernel/sched/cputime.o] Error 1
    make[1]: *** [scripts/Makefile.build:544: kernel/sched] Error 2
    make: *** [Makefile:988: kernel] Error 2

    ReplyDelete
    Replies
    1. @jwh7
      Thanks for reporting. Will be fixed in next release.

      Delete
  3. Good job.
    Less latency than MUQSS.
    Almost half. :)
    Not bad.

    ReplyDelete
    Replies
    1. ^Let me add: Thanks to you Alfred and Manuel for recommending.

      Delete
    2. @Anonymous:
      Thank you for informing us about your experiences.

      After using the skiplist-VRQ for some weeks now I'm really convinced of it's superiority vs. MuQSS with my system and usage pattern. VRQ's advanced interactivity doesn't lead to performance or throughput drops for me -- that's the best point to favourise it over MuQSS imo.

      @Alfred: Keep up your good development work and thank you!

      BR,
      Manuel Krause

      Delete
    3. ^ After some testing it appears latency is close to half of MUQSS but MUQSS has way better multi-core utilization/load-balancing across cores.

      Delete
    4. @Anonymous:
      The load balancing issue is known to Alfred and he is surely working on it atm.
      Let's give him the needed time.

      BR, Manuel Krause

      Delete
    5. @Manuel and @Anonymous
      Thanks for sharing your experiences.
      Some thing I need to point out is that the "task policy fairness" is kind of a balancing issue. For example, 2 idle prio tasks running in the background on a two core system, and start 2 normal tasks(eg, kernel compile). Ideally, the normal tasks should take all(most) cpu times and suppress the idle prio tasks. But in current VRQ, it is failed in the above case.
      I already have work out patch to fix it, but I am afraid that it may introduce some overhead(haven't test yet). I will give it further test and release a debug patch next week. Then I will work on an option 2 solution after my holiday.

      Delete
    6. @Alfred, blogspot ate my post, please bring it back :)

      @Anonymous, since You say VRQ has 2 times less latency, can You please confirm that kernel config is the same for both kernels tested (like HZ values, preemption model, ...)?

      Br, Eduardo

      Delete
    7. @Alfred:
      Let me estimate: The overhead for "task policy fairness" load balancing in upcoming VRQ would eat up either performance or interactivity, so that we'd finally land at mostly equal values versus MuQSS? Wouldn't surprise me at all.
      My wish is, that you keep most weight on interactivity, to make a difference vs. MuQSS.

      BR, Manuel Krause

      Delete
    8. The eaten post by Eduardo as below
      Anonymous has left a new comment on your post "VRQ 0.92 release":

      Hi,

      I have been using 092 since it was available. I wanted to have more testing time to report the results.
      To me, at least on skylake laptop, kernel behaves very good, battery life is better than muqss, this is main driver (on laptop) why I'm using it.

      All seems to be very good, interactivity is good, no crashes, etc. except... cpufreq on intel seems to be broken (again?) in this version on skylake. It's the same stuck frequency problem for me again, this time it was stuck at 1.2GHz, nothing, except reboot, is helping. Setting performance, or specific HZ or anything I did - no results, had to reboot.
      On the other hand, to my big surprise intel_pstate started to behave, I'm using it + VRQ right now and all seems to be quite good. Will test it more, maybe finally that is fixed.

      On Phenom all is fine, no complaints at all, but I'm not doing performance tests anymore.
      If there will be substantial performance improvements, Alfred let me know, I'll consider testing couple of kernels with couple of games again.

      Thanks for Your work and best regards
      Eduardo

      Delete
    9. @Eduardo
      What's the lowest cpu freq of skylake cpu? It should be 800Hz in previous generations. If it stuck in the middle freq of it's available cpu freq range. It is most likely not scheduler's fault, as VRQ(like original BFS), just plug two scheduler export api into cpufreq code to report that if the cpu is scaling or not.

      But, in any case this just happen at this VRQ release, with the same kernel base. You can try to find out which commit introduce this issue by applying commits one by one upon VRQ0.91 or use git bisect to help with you.

      Delete
    10. @Manuel
      It's too early to do any estimation. Beside the current solution, there still at least one more option in the list. But any way, I will keep in mind to keep VRQ simple and efficiency.

      Delete
    11. ^^ Yes, same kernel config, VRQ has roughly only 55% of MUQSS latency. Although multi-core utilization suffers in comparison.


      Delete
    12. ^ Regarding latency you can measure it yourself:

      Get rt-tests: https://www.kernel.org/pub/linux/utils/rt-tests/

      There's a tool: cyclictest.

      I use mostly: cyclictest --smp -p95 -m -N

      http://man.cx/cyclictest(8)

      Delete
  4. @Alfred,

    I have i7-6700HQ CPU. The thing is that "stuck frequency" is not visible when I boot the computer up and the problem does not show up right away. It just starts at some point after a boot, not right away, but to "unstuck" the CPU I have to reboot, that's why it's strange enough. Let's see what happens with pstate, so far it's 3 days and all is fine.

    As for Phenom (running the same kernel as for skylake), I have discovered a problem, not a performance problem I believe, but total usage values observed in "top" does not correspond to individual task CPU sum. The situation I encountered was the overall system usage shows half of cores are busy (which it was, I believe), but when I check tasks, there are 20% usage at one cpu only. Might be a thing of adapting mainline task accounting. Will check it further, when I encounter it again, is there anything specific to look for?

    Br, Eduardo

    ReplyDelete
    Replies
    1. @Alfred,

      this is example of miscalculations, if this helps:
      top - 23:48:56 up 1:31, 1 user, load average: 1,14, 1,23, 1,34
      Tasks: 266 total, 1 running, 265 sleeping, 0 stopped, 0 zombie
      %Cpu0 : 23,1 us, 5,6 sy, 0,0 ni, 70,4 id, 0,0 wa, 0,9 hi, 0,0 si, 0,0 st
      %Cpu1 : 25,0 us, 6,5 sy, 0,0 ni, 67,6 id, 0,0 wa, 0,9 hi, 0,0 si, 0,0 st
      %Cpu2 : 38,2 us, 1,8 sy, 0,0 ni, 60,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
      %Cpu3 : 34,8 us, 2,7 sy, 0,0 ni, 61,6 id, 0,0 wa, 0,9 hi, 0,0 si, 0,0 st
      KiB Mem : 8174980 total, 3961716 free, 1536248 used, 2677016 buff/cache
      KiB Swap: 10239996 total, 10239996 free, 0 used. 5721908 avail Mem

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      7716 eduardo 1 0 2197788 553052 125136 S 17,8 6,8 2:44.83 chromium-b+
      1 root 1 0 185448 5980 3872 S 0,0 0,1 0:00.81 systemd
      2 root 1 0 0 0 0 S 0,0 0,0 0:00.00 kthreadd
      3 root 1 0 0 0 0 S 0,0 0,0 0:00.00 ksoftirqd/0
      7 root 1 0 0 0 0 S 0,0 0,0 0:00.01 rcu_preempt

      Br, Eduardo

      Delete
    2. I see the error above: "systemd" ;)

      Delete
  5. Hi, I've found some time to run my usual throughput benchmarks of VRQ0.92. They are here :
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    I've put some colors to make the results more readable (hopefully).
    The reference kernel is the one on the first column. Following the value of the realtime difference between tested kernel and reference kernel, the colors are :
    - blue if difference is within 'realtime of reference kernel +/- maximum standard deviation'
    - green if difference is lower than 'realtime of reference kernel - maximum standard deviation'
    - red if difference is higher than 'realtime of reference kernel + maximum standard deviation'
    Overall best and worst are also shown ,if not in between +/- std dev.

    This time I used intel_pstate, and it seems you fixed the performance issue with this driver. The results are good except for sysbench oltp. I didn't follow the development of VRQ lately so you might be aware of that issue already.


    Pedro

    ReplyDelete
    Replies
    1. The poor results of sysbench oltp are also reproducible with acpi-cpufreq+ondemand.

      Pedro

      Delete
    2. @Pedro
      Thanks for the wonderful readable benchmark report.
      I have compared the sysbench result in 4.8 and 4.9 in your report, and it seems like a regression, if it is not introduced by recently policy related code changes.
      So I'd like to ask you to run benchmark in BATCH policy(using schedtool -B -e xxxx), as in VRQ, it's kind of int=0 mode in BFS/MuQSS. Normal policy tasks in VRQ are tuned to have better interactivity, which may suffer in throughput.
      But consider the gap between 4.8 VRQ0.89f and 4.9 VRQ0.92, it should be a real regression for sysbench. I'd find some time to double check it again.

      Delete
    3. @Alfred
      Thanks for the input.
      I ran the sysbench oltp benchmark with 'schedtool -B -e' and got the same bad results.
      I also ran it on stock Archlinux 4.8.13 kernel and got bad results again (between 94s for 1thread and 20s for 40threads.

      I see that the 'mariadb' package as been updated several times in Archlinux repo since the last time I ran sysbench oltp on MuQSS and CFS, about a month ago. Sysbench oltp uses it.
      So I tried to upgrade the test database, and even reinstall mariadb, but the results are still bad.

      So I believe it is in fact a regression in mariadb and not in VRQ.
      I'll dig further when I have time.
      I should have checked that in the first place. Sorry for the noise.

      Pedro

      Delete
    4. @Pedro
      Thanks for your finding.

      Delete
    5. I've found that the regression in oltp is caused by this change in Archlinux's mariadb:
      https://git.archlinux.org/svntogit/packages.git/commit/trunk?h=packages/mariadb&id=d191e6c8ccde2faf3e7ce6b0beb77d0e0f29afe4

      I've rebuild the old mariadb package that was previously used (v10.1.19) and uploaded the new results. The results are close to CFS.

      I've also ran interbench. VRQ results are not that good, but I found interbench results difficult to understand.

      Pedro

      Delete
    6. @Pedro
      Thanks for the updated results and good working. It will be nice if you can add VRQ Normal policy clomn and VRQ BATCH policy clomn, which is kind of Int=1 and Int=0 mode in VRQ.
      Another sugression is adding make -j12 and make -j16 for the ffmpeg result when tasks number over rolls cpu number.
      From the result, I can see VRQ is not good at -j2 -j4 results, I can guess what's the cause but would wait for more evidence to prove it.
      Bench tools I am using are kernel compile, cyclictest, hackbench.

      Delete
    7. I've added the results for VRQ using Batch policy (using schedtool -B).
      Throughput is closer to CFS than VRQ using default policy, as expected.

      Also, the standard deviation in oltp when 'number of threads' >= 'number of logical cpu' is higher than usual. Maybe VRQ throughput can be improved further.

      I'll add 'make -j12' and '-j16' for the Linux 4.10 release. Thanks for the suggestion.

      On a side note, how do you use cyclictest ?
      I tried in the past and couldn't get consistent results between runs.

      Pedro

      Delete
  6. @Alfred:
    Atm I'm doing a disk(s) and partition(s) reorganising with gparted. I don't do it often, so this is no reference at all. But it behaves so well and fast with my the current setup, that I need to let you know.
    Kernel 4.9.5, VRQ 092, BFQv8r7, WBT7(from ck/pf) and my humble TOI port.

    And, besides of other posters' results, the improving 4.9.x kernels do improve reliability on here. Read it like: I don't want a kernel fail at every 2nd resume from disk because of speed issues from the bootup-speed people.

    BR, Manuel Krause

    ReplyDelete
  7. [OFFTOPIC]
    I've noticed several people on here that use shared graphics, comparable with mine on a HP laptop. In my case it's a GM45 using the i915 kernel module and the i965 from the xorg intel driver.

    My question:
    Do you know a way to find out what amount of memory/RAM is actually really allocated?
    From 'man intel', it's automatically allocated by needs, "VideoRAM" in xorg.conf thus ignored for my chipset.
    Kernel says: [drm] Memory usable by graphics device = 2048M.
    lspci -v says: Memory at c0000000 (64-bit, prefetchable) [size=256M]
    < Note: The latter is most likely the max. possible AGP aperture mem size and not the gfx'

    And that was all. The Xorg.0.log doesn't number memory sizes at all on here.

    Thank you in advance for helpful hints and links,
    BR, Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel,

      I have researched this at some level and as far as I got is that Your device can use up to 2GB of ram, it has 256MB of dedicated ram. You can not do anything about ram usage and cannot determine used size or set exact amount or min amount, etc.
      That's it. Unfortunately. Only thing You can do, if Your bios allows, how much mem is actually shared, not that max amount 2GB but real one.
      No good situation, I suppose, but that's it.
      Plz share Your findings as well :)

      Br, Eduardo

      Delete
    2. @Eduardo:
      Thank you for your info. Unfortunately, after my first searches, the rest of the internet doesn't provide more wisdom than you. And I doubt that digging into the huge documentation files from 01.org (the hundreds of PRM pages) would give me usable means.
      What I've found by coincidence by looking into the /proc and /sys folders was:
      /sys/module/i915/coresize
      I don't know anything about it's relevance regarding this topic but will keep an eye on it when changing system load and over time. After finding it and with ~22h uptime it shows a value of 1113739.
      Sadly my BIOS was made with the KISS strategy (to name it in polite words) and doesn't offer a knob for such a 'critical' ;-) setup variable.

      I'll keep investigating,
      BR, Manuel Krause

      Delete
    3. I think this is used memory of i915 module.

      Delete
    4. @Anonymous:
      I'm uncertain about the unit. You mean it's only the module's footprint in Bytes? Then it was ridiculous of mine to mention it as finding, shame.
      BR, Manuel Krause

      Delete
    5. ^ yes. On a sidenote...
      Updated kernel gcc compiler flags:

      KBUILD_CFLAGS += -O3 -frename-registers -march=native -mtune=generic -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4.1 -mno-sse4.2 -mno-sse4 -mno-avx -mno-aes -mno-sse4a -mno-3dnow -fno-builtin -pipe

      Delete
    6. @Alfred:
      Eaten post... Can you please bring it up?!
      Thx, Manuel Krause

      Delete
  8. Hi,

    I was testing 092 version for good (almost 13 days uptime) measure, now switching to mux to test how pstate behaves there.
    As for VRQ - stable, fast, all good except that process total CPU and separate process usage, which is not quite aligned (reported above, do anyone else are having this?), at least that's what I see.

    Keep up good work and thanks,
    Eduardo

    ReplyDelete
    Replies
    1. Yes, I got multi-core usage/load balancing issues aswell.
      Apart from that very fast/low latency.

      Delete
    2. @all
      I just get back from a long CNY holiday. And picking up topics slowly, :)
      Again, for the cpu not aligned behaviors, as I have explained before, it's kind of design intention of VRQ. It's not the same as the "task policy fairness" issue reported by Manuel. I will start another post to explain these two clearly once the debug patch(for task policy fairness issue) is ready.

      Delete