Alfred Chen's Blog: VRQ 0.3 updates

Wednesday, December 31, 2014

VRQ 0.3 updates

In the last day of 2014, I will like to announce the 0.3 updates of VRQ solution for BFS.

It's almost a rework to address the lock dependency issues which caused by introduction of rq lock in VRQ. I'll like to do more clean up during 3.18 release, if it goes well, it will be moved to -gc branch.

Here are the test result of CFS, BFS original, Baseline(the VRQ based on) and the VRQ. It shows that after three release of VRQ, it can be better than original BFS in all kinds of workload and comparable against CFS(again? As I remember correctly, BFS is better than CFS in compiling test some release ago). But most important of all, VRQ open the opportunity to further improvements, I will give these new ideas a try next year.

Happy new Year 2015.

#1 50% task/cores ratio workload test

3.18_CFS_50
5m18.385s
5m17.836s
5m17.783s
5m17.765s
5m17.730s
5m17.663s
5m17.596s
5m17.566s
5m17.487s
5m17.455s

3.18_0460_50
5m20.448s
5m20.384s
5m20.374s
5m20.286s
5m20.245s
5m20.192s
5m20.166s
5m20.155s
5m20.150s
5m20.093s

3.18_Baseline_50
5m17.975s
5m17.552s
5m17.512s
5m17.492s
5m17.485s
5m17.482s
5m17.480s
5m17.475s
5m17.457s
5m17.342s

3.18_VRQ_50
5m17.350s
5m17.339s
5m17.318s
5m17.297s
5m17.284s
5m17.282s
5m17.248s
5m17.220s
5m17.209s
5m17.145s

#2 100% task/cores ratio workload test

3.18_CFS_100
2m50.047s
2m50.037s
2m49.825s
2m49.750s
2m49.749s
2m49.744s
2m49.706s
2m49.703s
2m49.673s
2m49.633s

3.18_0460_100
2m51.933s
2m50.640s
2m50.431s
2m50.424s
2m50.421s
2m50.386s
2m50.362s
2m50.267s
2m50.248s
2m50.129s

3.18_Baseline_100
2m51.862s
2m50.527s
2m50.506s
2m50.400s
2m50.370s
2m50.361s
2m50.283s
2m50.282s
2m50.189s
2m50.146s

3.18_VRQ_100
2m50.944s
2m49.812s
2m49.797s
2m49.700s
2m49.683s
2m49.672s
2m49.653s
2m49.649s
2m49.613s
2m49.506s

#3 150% task/cores ratio workload test

3.18_CFS_150
2m53.382s
2m53.366s
2m53.328s
2m53.326s
2m53.326s
2m53.310s
2m53.307s
2m53.262s
2m53.208s
2m53.127s

3.18_0460_150
2m57.280s
2m56.710s
2m56.124s
2m55.860s
2m55.843s
2m55.725s
2m55.646s
2m55.643s
2m55.597s
2m55.582s

3.18_Baseline_150
2m57.100s
2m55.907s
2m55.796s
2m55.788s
2m55.755s
2m55.749s
2m55.740s
2m55.736s
2m55.732s
2m55.726s

3.18_VRQ_150
2m55.449s
2m53.168s
2m52.112s
2m51.898s
2m51.649s
2m51.539s
2m51.527s
2m51.371s
2m51.272s
2m51.270s

#4 100%+50%(IDLE) task/cores ratio workload tests

3.18_CFS_100_50
2m55.069s
2m53.714s
2m53.638s
2m53.586s
2m53.474s
2m53.466s
2m53.313s
2m53.280s
2m53.196s
2m53.187s

3.18_0460_100_50
2m50.730s
2m50.713s
2m50.651s
2m50.620s
2m50.615s
2m50.598s
2m50.560s
2m50.548s
2m50.460s
2m50.457s

3.18_Baseline_100_50
2m50.826s
2m50.691s
2m50.683s
2m50.632s
2m50.601s
2m50.574s
2m50.555s
2m50.549s
2m50.549s
2m50.507s

3.18_VRQ_100_50
2m49.822s
2m49.799s
2m49.766s
2m49.706s
2m49.673s
2m49.631s
2m49.623s
2m49.587s
2m49.583s
2m49.564s

23 comments:

AnonymousJanuary 2, 2015 at 9:58 PM
Can I omit "3b9cc00 bfs: xxxx_schedule() stat debug" or is it needed? Sounds like debugging overhead.

Thanks for your continued work on BFS+ and, yes, I wish a Happy New Year,

Manuel
ReplyDelete
Replies
AnonymousJanuary 2, 2015 at 10:05 PM
BTW, have you already reviewed the patches in http://ck.kolivas.org/patches/bfs/3.0/3.18/pending/ from 20141231?
Would they be useful with 3.18-vrq, too? At least from the "bfs460-locked-pluggedio.patch" I see it's not applying correctly.

Best regards, Manuel
ReplyDelete
Replies
AnonymousJanuary 3, 2015 at 12:40 AM
My first observation with 3.18.1 + BFQ + 23 BFS-VRQ-branch patches (omitted patch 23 of 24 "3b9cc00 bfs: xxxx_schedule() stat debug") is, that my two CPU cores aren't used/ loaded equally. I see it when observing gkrellm CPU0/1 charts while doing things.
At that moment there are mainly running a worldcommunitygrid client in the background, a firefox-esr with 110 open tabs and a smplayer playing an .avi movie.

One of the two cores shows approx. 50% of the NON-IDLE load of the other. Funny, that if I quit and newly start firefox, this can happen on cpu0 or cpu1, to see more load on the other cpuX and vice versa.
Quitting the low prioritised wcg client, the core what had less NON-IDLE load before then shows 50% more load than the other.

The normal BFS always tried to balance these loads equally. Maybe there's something going wrong?

Best regards, Manuel
ReplyDelete
Replies
Alfred ChenJanuary 5, 2015 at 10:24 PM
In both 3.17 and 3.18 -vrq branch, I can both observer kind of imbalanced workload among cpus, but it doesn't continue forever as you said. I will re-check this on original BFS kernel tonight.

For the imbalanced behaviors, I think I can explain in this way. The cpu usage in htop is a statistics data. That means for example, if a single thread task which occupied 10% cpu, during the htop calculating windows, saying 2 seconds, if it stays in cpu0 1 second and stays in cpu1 1 second, it will result in 5% usage in each cpus via htop, this looks balanced. From the scheduler's point of view, switch task among cpus is not a good idea(though, in my opinion, in your hardware, 2 cores shares same LLC, switch among these two cpus is cost free). So in idea way, this task may stay the whole 2 seconds on one cpu, it results in 0% usage in one cpu and 10% in another, it looks imbalanced via htop.

For your new discovery, I don't have suspend-to-disk to work, I tested with suspend(to ram) and resume, both front-ground and background load are the same after resume.

Two additional questions:
1. How does wcg set its work unit, it can be fixed on a certain cpu or not care about? Like mprime, it just start number of thread and doesn't care which cpu/core they runs on.
2. What is the behaviours of your usage in 3.17 and 3.18 with pure BFS?
ReplyDelete
Replies
AnonymousJanuary 9, 2015 at 9:51 PM
O.k., now I've found time to test a bit and to answer.
I've tested your baseline 3.18.y-gc (instead of your proposal of "pure" BFS, as I use your .y-gc on 3.17.x, too).
Please, also take notice of, that I permanently observe the two cpu cores via gkrellm, where I see in colours what amount of idle load vs. normal+system load happens ATM. Here the interval is very small. I understand, that the watched "balance" is not a real balance, but a statistical one.
With your baseline patches, on the kernels 3.17.8 & 3.18.2, the 2 cores show a (virtually) equal load over short and long time. When I can agree to your explanation of the unbalanced behaviour in general: the VRQ patched 3.18.2 doesn't equalise correctly at all.
I run wcg with BOINC at default settings. It seems to fetch and queue work units as afforded and brings the new ones to each cpu core when that's run without one active work unit. In the BOINC Manager I can temporarily suspend tasks, and see, that this is true.
With the VRQ 0.3 patched 3.18 the firefox task settles to CPU0 (first core) completely over time, while CPU1 (second core) then only serves for wcg work unit 2. Starting a kernel compile "make -j1" or "-j2" shows, that CPU1 doesn't do more of half than CPU0. Then stopped all wcg clients, the -main- desktop+firefox load switches over to CPU1. And after that "make -j1" and "-j2" show equal load on both cores. Compilation quitted, then starting wcg again, the firefox load gets back to CPU0. And, while doing this row of tests (of course, without rebooting), with each test, the ability of CPU1 vanishes, to take over "normal" or "system" tasks. For kernel compilation: CPU0 at full load, CPU1 only showing some peaks.

Conclusion: There's something severely imbalancing within your VRQ (only).

Best regards, Manuel
ReplyDelete
Replies