Wednesday, December 31, 2014

VRQ 0.3 updates

In the last day of 2014, I will like to announce the 0.3 updates of VRQ solution for BFS.

It's almost a rework to address the lock dependency issues which caused by introduction of rq lock in VRQ. I'll like to do more clean up during 3.18 release, if it goes well, it will be moved to -gc branch.

Here are the test result of CFS, BFS original, Baseline(the VRQ based on) and the VRQ. It shows that after three release of VRQ, it can be better than original BFS in all kinds of workload and comparable against CFS(again? As I remember correctly, BFS is better than CFS in compiling test some release ago). But most important of all, VRQ open the opportunity to further improvements, I will give these new ideas a try next year.

Happy new Year 2015.

#1 50% task/cores ratio workload test

3.18_CFS_50
5m18.385s
5m17.836s
5m17.783s
5m17.765s
5m17.730s
5m17.663s
5m17.596s
5m17.566s
5m17.487s
5m17.455s

3.18_0460_50
5m20.448s
5m20.384s
5m20.374s
5m20.286s
5m20.245s
5m20.192s
5m20.166s
5m20.155s
5m20.150s
5m20.093s

3.18_Baseline_50
5m17.975s
5m17.552s
5m17.512s
5m17.492s
5m17.485s
5m17.482s
5m17.480s
5m17.475s
5m17.457s
5m17.342s

3.18_VRQ_50
5m17.350s
5m17.339s
5m17.318s
5m17.297s
5m17.284s
5m17.282s
5m17.248s
5m17.220s
5m17.209s
5m17.145s

#2 100% task/cores ratio workload test

3.18_CFS_100
2m50.047s
2m50.037s
2m49.825s
2m49.750s
2m49.749s
2m49.744s
2m49.706s
2m49.703s
2m49.673s
2m49.633s

3.18_0460_100
2m51.933s
2m50.640s
2m50.431s
2m50.424s
2m50.421s
2m50.386s
2m50.362s
2m50.267s
2m50.248s
2m50.129s

3.18_Baseline_100
2m51.862s
2m50.527s
2m50.506s
2m50.400s
2m50.370s
2m50.361s
2m50.283s
2m50.282s
2m50.189s
2m50.146s

3.18_VRQ_100
2m50.944s
2m49.812s
2m49.797s
2m49.700s
2m49.683s
2m49.672s
2m49.653s
2m49.649s
2m49.613s
2m49.506s

#3 150% task/cores ratio workload test

3.18_CFS_150
2m53.382s
2m53.366s
2m53.328s
2m53.326s
2m53.326s
2m53.310s
2m53.307s
2m53.262s
2m53.208s
2m53.127s

3.18_0460_150
2m57.280s
2m56.710s
2m56.124s
2m55.860s
2m55.843s
2m55.725s
2m55.646s
2m55.643s
2m55.597s
2m55.582s

3.18_Baseline_150
2m57.100s
2m55.907s
2m55.796s
2m55.788s
2m55.755s
2m55.749s
2m55.740s
2m55.736s
2m55.732s
2m55.726s

3.18_VRQ_150
2m55.449s
2m53.168s
2m52.112s
2m51.898s
2m51.649s
2m51.539s
2m51.527s
2m51.371s
2m51.272s
2m51.270s

#4 100%+50%(IDLE) task/cores ratio workload tests

3.18_CFS_100_50
2m55.069s
2m53.714s
2m53.638s
2m53.586s
2m53.474s
2m53.466s
2m53.313s
2m53.280s
2m53.196s
2m53.187s

3.18_0460_100_50
2m50.730s
2m50.713s
2m50.651s
2m50.620s
2m50.615s
2m50.598s
2m50.560s
2m50.548s
2m50.460s
2m50.457s

3.18_Baseline_100_50
2m50.826s
2m50.691s
2m50.683s
2m50.632s
2m50.601s
2m50.574s
2m50.555s
2m50.549s
2m50.549s
2m50.507s

3.18_VRQ_100_50
2m49.822s
2m49.799s
2m49.766s
2m49.706s
2m49.673s
2m49.631s
2m49.623s
2m49.587s
2m49.583s
2m49.564s

23 comments:

  1. Can I omit "3b9cc00 bfs: xxxx_schedule() stat debug" or is it needed? Sounds like debugging overhead.

    Thanks for your continued work on BFS+ and, yes, I wish a Happy New Year,

    Manuel

    ReplyDelete
  2. BTW, have you already reviewed the patches in http://ck.kolivas.org/patches/bfs/3.0/3.18/pending/ from 20141231?
    Would they be useful with 3.18-vrq, too? At least from the "bfs460-locked-pluggedio.patch" I see it's not applying correctly.

    Best regards, Manuel

    ReplyDelete
    Replies
    1. "bfs460-locked-pluggedio.patch" is a fix for known issue, if you don't have such issue, you can safely ignore it, ATM. I will wait for next sync with new release of BFS to catch all these patches.

      Delete
  3. My first observation with 3.18.1 + BFQ + 23 BFS-VRQ-branch patches (omitted patch 23 of 24 "3b9cc00 bfs: xxxx_schedule() stat debug") is, that my two CPU cores aren't used/ loaded equally. I see it when observing gkrellm CPU0/1 charts while doing things.
    At that moment there are mainly running a worldcommunitygrid client in the background, a firefox-esr with 110 open tabs and a smplayer playing an .avi movie.

    One of the two cores shows approx. 50% of the NON-IDLE load of the other. Funny, that if I quit and newly start firefox, this can happen on cpu0 or cpu1, to see more load on the other cpuX and vice versa.
    Quitting the low prioritised wcg client, the core what had less NON-IDLE load before then shows 50% more load than the other.

    The normal BFS always tried to balance these loads equally. Maybe there's something going wrong?

    Best regards, Manuel

    ReplyDelete
    Replies
    1. It's interesting, I will try to reproduced it. So I would like to ask a few questions.
      1. What priority the wcg client is running? Single thread or multithread?
      2. smplayer using single thread or multithread? How much cpu% when it plays your avi movie?
      3. When un-balanced loading triggered, how long it lasted or stay that way?
      4. Would you able to use htop and observing the same?
      3.

      Delete
    2. Mmmh. I don't like 3.18.x, as the automatic fan management fails.
      But o.k.: Now the answers:
      1. I run wcg as root and it's a nice 19 process. There are 2 work units running, each of them on one of my two cpu's cores.
      2. smplayer/ MPlayer doesn't play a role for the kind of videos I usually play. The load is neglectible. About +5% on one core?
      3. Unbalanced load would stay forever from the beginning. At least it seams so after booting, login to KDE, then starting programs and so on.
      4. htop shows the same load imbalance.

      Yesterday I've then made a new discovery: After Suspend-to-Disk and resuming, all load was assigned to CPU0. Nothing balancing to CPU1. That stayed "forever".

      Manuel

      Delete
    3. Nothing, what I've written, means: Only the wcg client kept running on CPU1.

      Delete
    4. Yeah, in 3.18, suspend is broken in my workstation and mpv movie screen is crapped in window mode.
      Ok, back to the topic. I have tried to reproduce your usage by using mprime to simulate wcg(I don't install it) background workload and mpv plays mkv movie to simulate font workload.

      Delete
  4. In both 3.17 and 3.18 -vrq branch, I can both observer kind of imbalanced workload among cpus, but it doesn't continue forever as you said. I will re-check this on original BFS kernel tonight.

    For the imbalanced behaviors, I think I can explain in this way. The cpu usage in htop is a statistics data. That means for example, if a single thread task which occupied 10% cpu, during the htop calculating windows, saying 2 seconds, if it stays in cpu0 1 second and stays in cpu1 1 second, it will result in 5% usage in each cpus via htop, this looks balanced. From the scheduler's point of view, switch task among cpus is not a good idea(though, in my opinion, in your hardware, 2 cores shares same LLC, switch among these two cpus is cost free). So in idea way, this task may stay the whole 2 seconds on one cpu, it results in 0% usage in one cpu and 10% in another, it looks imbalanced via htop.

    For your new discovery, I don't have suspend-to-disk to work, I tested with suspend(to ram) and resume, both front-ground and background load are the same after resume.

    Two additional questions:
    1. How does wcg set its work unit, it can be fixed on a certain cpu or not care about? Like mprime, it just start number of thread and doesn't care which cpu/core they runs on.
    2. What is the behaviours of your usage in 3.17 and 3.18 with pure BFS?

    ReplyDelete
  5. O.k., now I've found time to test a bit and to answer.
    I've tested your baseline 3.18.y-gc (instead of your proposal of "pure" BFS, as I use your .y-gc on 3.17.x, too).
    Please, also take notice of, that I permanently observe the two cpu cores via gkrellm, where I see in colours what amount of idle load vs. normal+system load happens ATM. Here the interval is very small. I understand, that the watched "balance" is not a real balance, but a statistical one.
    With your baseline patches, on the kernels 3.17.8 & 3.18.2, the 2 cores show a (virtually) equal load over short and long time. When I can agree to your explanation of the unbalanced behaviour in general: the VRQ patched 3.18.2 doesn't equalise correctly at all.
    I run wcg with BOINC at default settings. It seems to fetch and queue work units as afforded and brings the new ones to each cpu core when that's run without one active work unit. In the BOINC Manager I can temporarily suspend tasks, and see, that this is true.
    With the VRQ 0.3 patched 3.18 the firefox task settles to CPU0 (first core) completely over time, while CPU1 (second core) then only serves for wcg work unit 2. Starting a kernel compile "make -j1" or "-j2" shows, that CPU1 doesn't do more of half than CPU0. Then stopped all wcg clients, the -main- desktop+firefox load switches over to CPU1. And after that "make -j1" and "-j2" show equal load on both cores. Compilation quitted, then starting wcg again, the firefox load gets back to CPU0. And, while doing this row of tests (of course, without rebooting), with each test, the ability of CPU1 vanishes, to take over "normal" or "system" tasks. For kernel compilation: CPU0 at full load, CPU1 only showing some peaks.

    Conclusion: There's something severely imbalancing within your VRQ (only).

    Best regards, Manuel

    ReplyDelete
    Replies
    1. I have tried to simulate your usage as your last description.
      I started 2 mprime thread and use taskset to make them all run on cpu1. After this, I can see backgroud workload occupied cpu1 100% and system normal workload on cpu0. The mprime threads are running at nice level 19.
      Then I start kernel compile with "make -j2", cpu0 is occupied by 90%+ normal workload and cpu is occupied by 70%-90% normal workload, some system workload(red) and the rest are taken by mprime threads. The result is quit as expected. I have also roll back to the Baseline version kernel and get the same result.

      So let's give it a last shoot before I go to install the BOINC.
      1. What priority and nice level the wcg work unit is running at?
      2. Your "make -j2" is running at default priority and nice level?(guess so)

      Delete
    2. PS, what does "And, while doing this row of tests (of course, without rebooting), with each test, the ability of CPU1 vanishes, to take over "normal" or "system" tasks. For kernel compilation: CPU0 at full load, CPU1 only showing some peaks." means?

      When disable/enable wcg repeatedly, cpu1 fails to pick up normal/system tasks? If so, it looks like wcg is running at high priority than it seems to be.

      BR Alfred

      Delete
    3. Mmmh, I don't understand exactly why you need to re-model my system's behaviour with taskset. On here this is happening without manual intervention.
      To your questions: There may be one significant difference to your mprime tests:
      1. Querying the two wcg tasks:
      # schedtool `pidofproc wcgrid_faah_7.1`
      PID 15980: PRIO 0, POLICY B: SCHED_BATCH , NICE 19, AFFINITY 0x3
      PID 21063: PRIO 0, POLICY B: SCHED_BATCH , NICE 19, AFFINITY 0x3
      I mean, it's the SCHED_BATCH scheduling policy that's different.
      2. Yes, the "make -j2" commands are executed at default priority, SCHED_NORMAL and NICE 0.
      3. Yes, repeatedly stopping+restarting both wcg tasks and kernel compile, resulted in cpu1 not picking up normal tasks from cpu0 over the repeated steps.

      I hope this info helps to understand a bit better what's going on here with VRQ.

      BR, Manuel

      Delete
    4. Thanks for your time to help testing and provide useful info.
      Once set policy to SCHED_BATCH, the issue is 100% reproduced and confirm that baseline version(-gc branch) is clear and -vrq is impacted.
      Once the fix is ready, I will update the -vrq branch and let you know.

      BR Alfred

      Delete
    5. Glad, to hear that we've found the culprit. :-) I wish you good luck with elaborating the fix and hope that it wouldn't lead to bad benchmarking results of VRQ, that do look promising, for now.

      BR, Manuel

      Delete
    6. Please, do also provide an incremental patch for me, somewhere, so we all are able to see what you've changed vs. current VRQ.

      Thanks, Manuel

      Delete
    7. I have to said it's too earlier to say 100% reproduced yesterday. Yes, when the issue is triggered, no matter what background workload(SCHED_BATCH or SCHED_NORMAL) is running, one cpu is failed to pick up normal or system task. But to trigger the issue, here, I use "schedtool -3 xxx” to set mprime thread to SCHED_BATCH, is not work well after a flash restarted system. Last night, no matter how I play with it, it just can't trigger the issue, but when I try it today(that system doesn't restart during night), one-shot bingo!
      Based on current info, the test cycle is 12h+ for me, and I think I need to re-test the baseline version then bisect to find the commit. In other word, it wouldn't be a short time to expected the fix.

      Any updates will let you know.

      Delete
    8. Oh, that doesn't sound good. But, anyway, a quality fix is better than a too early one.
      Please, remember, that my tests with your baseline patches up to 3.18.2 were error-free.
      I'll stay tuned and can also test preliminary test-fix-patches for you.

      Good luck, Manuel

      Delete
    9. I have bisect and found commit "bfs: vrq: RQ niffy solution." which introduced this issue. You can skip the last 3 to 4 commit in -vrq branch and see if it fix the issue for you. Those commits are "
      bfs: vrq: dedicated xxxx_schedule().
      bfs: vrq, refactory wake_up_new_task.
      bfs: vrq: RQ niffy solution."
      I am still waiting for the debug load info to tell the detail cause of this issue.

      BR Alfred

      Delete
    10. Yes, you're right: Reverting these three -vrq patches (thank you for naming them!) brings back normal behaviour. :-) Also for the case of resuming from suspend to disk everything works well.
      BR Manuel

      Delete
    11. Hi, Alfred!
      Just wanted to come back and ask now, if you've found a new solution/fix (other than me reverting the patches).

      Best regards,
      Manuel

      Delete
    12. Just replied in a new post. There will be a new solution in 3.19.

      BR Alfred

      Delete