Thursday, December 10, 2015

GC and VRQ branch update for v4.3.1 and latency test

Finally it comes the first stable release for 4.3, and gc and vrq branch are updated with bug fixes during these few weeks.

*A non-return error when enable SMT_NICE(though SMT_NICE is not recommended for VRQ)
*Go through threads list with tasklist_lock held when cpu hotplugs. It's for both gc and vrq branch.

*Task caching scheduling PartIII, as usual I will write another post for it.

The gc branch for v4.3.1 can be found at bitbucket and github.
The vrq branch for v4.3.1 can be found at bitbucket and github.

One more thing, I would like to add more tests/benchmark for scheduling for a long time. And I finally found one yesterday, it is Cyclictest, you can check the detail on this wiki(it's a little old but it's a good start point). Based on my research, it is scheduler independent and use no scheduler statics.

Here is my first idle workload cyclictest result for v4.3 cfs, bfs and vrq. (I'm still playing with it)

4.3 CFS
 # /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.05 0.04 0.05 1/219 1504         

T: 0 ( 1499) P:80 I:10000 C:  10000 Min:   1831 Act:    2245 Avg:    2413 Max:   12687
T: 1 ( 1500) P:80 I:10500 C:   9524 Min:   1917 Act:    2965 Avg:    2560 Max:    7547
T: 2 ( 1501) P:80 I:11000 C:   9091 Min:   1702 Act:    2254 Avg:    2313 Max:   10650
T: 3 ( 1502) P:80 I:11500 C:   8696 Min:   1546 Act:    2297 Avg:    2274 Max:   13723

4.3 BFS
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.15 0.10 0.04 1/234 1540         

T: 0 ( 1536) P:80 I:10000 C:  10000 Min:   1437 Act:    2002 Avg:    1893 Max:   10912
T: 1 ( 1537) P:80 I:10500 C:   9524 Min:   1427 Act:    2010 Avg:    1907 Max:    7534
T: 2 ( 1538) P:80 I:11000 C:   9091 Min:   1402 Act:    1755 Avg:    1902 Max:   13059
T: 3 ( 1539) P:80 I:11500 C:   8696 Min:   1408 Act:    1878 Avg:    1866 Max:   12921

4.3 VRQ
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.00 0.01 0.00 0/226 1607         

T: 0 ( 1602) P:80 I:10000 C:  10000 Min:   1349 Act:    1785 Avg:    1647 Max:    4934
T: 1 ( 1603) P:80 I:10500 C:   9524 Min:   1355 Act:    1464 Avg:    1642 Max:   12378
T: 2 ( 1604) P:80 I:11000 C:   9091 Min:   1334 Act:    1926 Avg:    1676 Max:   12544
T: 3 ( 1605) P:80 I:11500 C:   8696 Min:   1350 Act:    1801 Avg:    1627 Max:   10989

Enjoy with gc/vrq on v4.3.1 and try cyclictest if you care about the latency and task interaction.

BR Alfred

Edit:
If you have failed s2ram/resume issue with this new gc/vrq release, you can try below 2 patches(one for gc and one for vrq) and see if it help with you.
4.3_gc3_fix.patch and 4.3_vrq1_fix.patch

45 comments:

  1. your cyclictest numbers are very high in all testst.
    How did you call cyclictest?
    Here is what I'm getting on latests vrq:

    sudo cyclictest --smp -p 80
    # /dev/cpu_dma_latency set to 0us
    policy: fifo: loadavg: 0.00 0.00 0.00 0/207 2623

    T: 0 ( 2620) P:80 I:1000 C: 13685 Min: 4 Act: 7 Avg: 6 Max: 293
    T: 1 ( 2621) P:80 I:1500 C: 9123 Min: 4 Act: 7 Avg: 7 Max: 309
    T: 2 ( 2622) P:80 I:2000 C: 6842 Min: 4 Act: 6 Avg: 6 Max: 278
    T: 3 ( 2623) P:80 I:2500 C: 5473 Min: 4 Act: 7 Avg: 7 Max: 147

    ReplyDelete
    Replies
    1. Oh, I used -N option to display in ns.

      Delete
    2. This immediatly leads to a seconds question.
      Why is my Avg. latency 4 times higher than yours?
      Any idea? Maybe there is a difference in .config? Can you pleas post yours?

      The huge difference in Max latency can be ignored. I did an ftrace and its because of my bios blocking the cpu. This happens only once in a second so it doesn't influence the Avg values significantly.

      Delete
    3. These tests are run on my testbed server which I used to set CONFIG_HZ_1000=y in kernel config. I think it's the major reason of the low avg latency. I don't have time to check the result on my other machines yet.

      Delete
    4. hmm, I'm also using CONFIG_HZ_1000=y, I think comparison of the configs might help.
      A factor of 4 is a really huger difference!!

      Delete
    5. I have sent you my kernel config by email. You can have a check.

      Delete
    6. @Alfred: I'm not the author of the previous Anonymous' postings, but thanks for sharing your .config with me.
      You both should find another way to compare your .configs.

      I usually sign my postings with

      BR, Manuel / Manuel Krause

      Delete
  2. :-) I just wanted to ask in the older thread, if there are changes ahead for your planned 4.3.1 branches' update...
    So there are significant changes?

    When I manually load the patches from bitbucket, will it be sufficient to only fetch the patches in:
    https://bitbucket.org/alfredchen/linux-gc/commits/tag/v4.3.1-vrq
    ?

    It would be very nice, if you could also provide all-in-one patches for this release, complete -gc and -vrq, for us lazy people. ;-)

    The last patch for vrq works flawlessly on 4.3.0 for me.
    During the last weeks I've spent some time on .config changes testing, in order to move timing relevant kernel/ module loading/ re-initialisation sections out of the process of resume-from-hibernation. You remember my problem of many random and unreproducible TuxOnIce resumes. I seem to have solved it by first (& first time) making use of your "Use prefered raid6 gen function" patch, and compiling btrfs (plus dependencies) into the kernel (what otherwise gave a "random: nonblocking pool initialized" message @ resume). But I left my GFX i915 as a module. Fine since ~1 week. As for now, I can't pin down to one particular change. I'm just glad about more than 30 failureless resumes.

    Best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. As for now I've taken the incremental VRQ commit from post-factum's pf-4.3 tree (Thank you, post-factum!) and trust that all changes went in. [https://github.com/pfactum/pf-kernel/commit/ce19ca43ea9e6e5c68c37da16018821f39901ca6]

      4.3.2 runs very well with it plus TuxOnIce and BFQ I/O v7r8.

      BR, Manuel

      Delete
    2. Well, 4.3.2 and bfs 0466. It will be enough reason for another update next week. :)

      Delete
    3. Unholy 4.3.3... :-( The huge BTRFS changes in that kernel revision change the way of drivers' initialization again, so the "random: nonblocking pool initialized" now again interferes TuxOnIce resumes from disk on here. Not nice. I'll throw out BTRFS completely, as I haven't used it at all until now.

      BR, Manuel

      Delete
  3. Hi, Alfred!
    Another question that arises from Con's updated BFS 466:
    Can the code change (look at http://ck.kolivas.org/patches/4.0/4.3/4.3-ck2/patches/bfs465-466.patch) also be relevant for your current -vrq ?

    And how would the code section look like then?

    Thank you and BR, Manuel

    ReplyDelete
    Replies
    1. I'm looking at ck's 0466 code changes. Based on my tests, it improves performance for workload>=100%. More test and code modifies are needed when adapter this to -gc and -vrq because this code change is very performance sensitive, and I want to make sure no regression on -gc and -vrq.

      Delete
    2. You're absolutely right. Adapting this for actual VRQ needs some love and care.
      Yesterday I've tried two, admittedly, simple versions of the relevant section -- both with negative result: First one lead to bad latencies even on moderate cpu load with video playback (mpv, and very bad with flash in firefox) plus worldcommunitygrid in the background, second one reintroduced resume failures.
      And on Con's blog appeared one report (with a backported version to BFS 464) about suffering responsiveness. Let's see more testers' comments.

      Your actualized VRQ is still running fine. :-)
      BR, Manuel

      Delete
  4. Hmm, it seems that latest pf-kernel update breaks resuming from s2ram for me. I have to find out who is guilty — 4.3.1/4.3.2 update or -vrq.

    ReplyDelete
    Replies
    1. Do you use -pf or just bare -vrq?

      Delete
    2. vrg.
      I didn't test vanilla 4.3.x, so this bug might be unrelated to vrq patches.

      Delete
    3. Sorry, no blog/email access during this weekend. Have you figured out which introduces the s2ram resuming fail?
      I have noticed there is a "smp_processor_id() in preemptible" when cpu goes offline. But it is a mainline issue and on my machine, suspend/resume still works with it. So I simply ignore it and wait for the fix from mainline.

      Delete
    4. Hi, pf,
      I just done a quick test and found that there is a "possible circular locking dependency detected" introduced with new code. Pls try to remove below two lines in tasks_cpu_hotplug() and see if this fix your resume issue.

      diff --git a/kernel/sched/bfs.c b/kernel/sched/bfs.c
      index fe3270ff..9a71274 100644
      --- a/kernel/sched/bfs.c
      +++ b/kernel/sched/bfs.c
      @@ -6376,7 +6376,6 @@ static void tasks_cpu_hotplug(int cpu)
      if (cpu == 0)
      return;

      - read_lock(&tasklist_lock);
      do_each_thread(t, p) {
      if (cpumask_test_cpu(cpu, &p->cpus_allowed_master)) {
      count++;
      @@ -6388,7 +6387,6 @@ static void tasks_cpu_hotplug(int cpu)
      cpumask_weight(tsk_cpus_allowed(p));
      }
      } while_each_thread(t, p);
      - read_unlock(&tasklist_lock);

      if (count) {
      printk(KERN_INFO "Renew affinity for %d processes to cpu %d\n",

      Delete
    5. Why don't I face the issue(s)? I have none of the two by post-factum reported problems. Does it depend on the nr of cpu cores? Or what else?
      And would you consider removing these two lines to be generally safer/better?

      BR, Manuel Krause

      Delete
    6. Simply remove these two lines will check out whether it is the cause of the issue for the affected users. And I'd like to keep them but to find a way to solve the circular dead lock. But last night when I wrote the post, it's not a good condition for me to site down and write any codes, just came back home after a busy weekend and have some wine, :)

      Delete
    7. @Alfred, I've already switched to 4.3.3 + BFS 0.467 to test latest code. Probably, anonymous could test your patch?

      Delete
    8. I have updated this post with two patches to fix the circular locking in -gc and -vrq, if you have suspend/resume issue, you can apply the patch and see if it works.

      BR Alfred

      Delete
    9. I'm not the other Anonymous on here and haven't suffered from his post-factum's issues, but using this patch on -VRQ works well on here. For 4.3.3 I've removed BTRFS + dependencies to make my peace with TuxOnIce and timing problems with my gfx.
      Don't know, if I'd really put BTRFS back in soon to re-check with the newest -VRQ fix, as it was so annoying to not get the system resumed.

      BR, Manuel Krause

      Delete
    10. @Anonymous (the other one :-) ):
      Have you tried the patch(es) that Alfred provided? It would be nice to read some result from a person with a system really affected by this issue.

      Thank you in advance, BR,
      Manuel Krause

      Delete
    11. Nothing new on here? Disappointing.

      BR, Manuel Krause

      Delete
  5. Also got the following same panic twice on my router:

    http://i.piccy.info/i9/454c5f1d8063e0f707bdc0896d86e565/1449961790/283985/951663/panic.jpg

    Weird :(.

    ReplyDelete
    Replies
    1. Getting the same panic with v4.3.3 and official BFS v0.467.

      Delete
    2. Expect it to be mainline issue. This commit:

      https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3759824da87b30ce7a35b4873b62b0ba38905ef5

      has changed the function in question between 4.2 and 4.3. I've reverted it, now compiling and testing.

      Delete
    3. Thank you, post-factum, for your investigations in this case. I really wish you'd get no further panics.

      BR, Manuel Krause

      Delete
    4. @pf
      Thanks for the update. I insist that the "possible circular locking dependency detected" issue should be addressed too. I'll include the fix in next -gc/-vrq release.

      Delete
  6. @Alfred: Have you already found an opinion about Con's BFS 0467 aproach?
    Keeping sched_interactive at 1 can improve low-latency and setting it to 0 would lead to the BFS 0466 results (possibly higher latencies but possibly higher throughput), if I understand Con's posts and the code changes correctly? I haven't found time to test his code, so far.

    Best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. Basically, I am just play with 0466 code change. And don't have time to check the code of 0467 yet, but from CK's description, sched_interactive sounds like a switch for throughput and interacting. In this case, IMO, I'd like to make it as compile time option which depend on the kernel preemption mode setting, that's server or desktop. That's my first though about the 0467.

      Currently, I am reworking a commit in vrq which introduced throughput regression on 0466 code change, but on the other hand, which also contribute a lot for the latency improvement.

      Delete
    2. Fine! Looking forward to test the reworked -vrq commit. :-)
      BR,
      Manuel Krause

      Delete
    3. Regarding the BFS 0466/0467 code change, Con already answered, that 0467 only introduces the "interactive" switch between == 1 (default)
      dl = p->deadline
      and otherwise
      if (tcpu != cpu && task_sticky(p) && scaling_rq(rq))
      continue;
      dl = p->deadline << locality_diff(tcpu, rq);

      So the soft locality affinity path gets completely phased out for "interactive", and the previous (pre 0466 and your -GC) show a way in the middle?

      I also see that Con changed the locality_diff function, your -GC contains the applicable BFS code there and in -VRQ you've changed the general approach for this code section with the scost task caching commits.

      In my opinion you should not take over "interactive" as compile time and dependant setting, that would make testing too complicated for now. Maybe a boot-time commandline append would be useful?!

      BR Manuel Krause

      Delete
    4. I have fixed the regressed commit in -vrq and it provide better throughput improvement. The benchmark record is 2m32s comparing to 2m34s in -gc(at 300% workload), and it shows very little overhead cost when workload come from 100% to 300% workload(2m30s at 100% workload). But the interaction of bfs0466 is really bad, mpv starts losing frame when background workload > 100%.

      I'm not going to introduce 0466 or 0467 into -vrq till we have a working solution to take the advantage from it.

      Delete
    5. Nice to read that all. And... Where is the fixed commit? ;-)
      Or do you want to add the circular locking fix together in one round?

      Thank you for your engaged and hard work! BR,
      Manuel Krause

      Delete
    6. I'm no native speaker, so "engaged" is not the correct word for my thought. I meant something like "dedicated".
      BR, Manuel.

      Delete
    7. Sorry, I didn't want to impose any pressure on you and your work.

      The current -VRQ on 4.3.3 with the latest fix patch + BFQ v7r8 + TuxOnIce is still up very well... since my latest .config change, excluding (for me unneeded) BTRFS. So to say, I haven't needed any reboot, mostly caused by TuxOnIce + i915 that didn't show any need: Every first resume attempt worked immediately, and no corruptions on X refresh, so, no timing issues for this setup. I've also done some few suspend-to-RAM inbetween. This is now for an uptime of total 7 days, what I've never reached before.

      BR, and I'm looking forward to your updates,
      Manuel Krause

      Delete
    8. Forgotten to add: For these 7 days of uptime...
      ...the number of failing/succeeded/attempted/total resumes is: 0/20/20/20
      Manuel

      Delete
    9. Together with some Christmas related boredom and additional thoughts about Con's intentions with the 466/467 patches, I've managed to change the code to imitate the BFS "interactive" setting's behaviour for actual -VRQ. I say "imitate", as the code change is in fact so trivial, without a switch, and I'm no professional programmer.
      The result is working very well since yesterday, for now ~22h. The benefits for interactivity are visible and nothing negative regarding throughput so far, but only based on subjective observations in everyday's use.

      So, with that finding comes up the question, what the soft locality affinity approach is good for, at all?! Does it pay off, if there are more than 1,2,3... cpu cores? I have 2.

      Best regards, thank you for your time, and the very best wishes for you in 2016!

      Manuel Krause

      Delete
    10. OMG... I've finally uploaded the patch "to imitate CK's 0.467 behaviour with 'interactive' set". It is only meant for Alfred's current 4.3.1-vrq branch (the old one, online, at the moment of writing this here) -- not later, it has no switch and doesn't clean code, but it does eliminate the soft affinity locality stuff, like CK did for the 'interactive' (default) setting in 0.467.
      Link: http://pastebin.com/99DZd0Kc

      I haven't had any issues during many hours of uptime with it, but handle with care, and as always, keep your disk backups fresh.

      @Alfred: Your opinion on this dumb solution? ;-)

      BR Manuel Krause

      Delete
    11. @Alfred: ...and regarding the topic on the newer thread: The patch posted above, based on your actual -vrq branch, does equalze everything between my cpu0/cpu1, almost like mirrored. But you would know that.
      BR Manuel

      Delete