Sunday, May 13, 2018

PDS 0.98p release

PDS 0.98p is released with the following changes

1. Minor code optimization here and there.
2. Balance optimization. This fix regression in 098n when workload % < core number.
3. 32ms balance interval.

This is the last PDS release in 4.16 kernel, thanks for the benchmark testing by Predo and Manuel, a regression in 098n release is identified and fix. In this release, the balance interval also increase to 32ms, this will help with throughput, let's test this setting before there is new way to do balance later(maybe next kernel release).

Enjoy PDS 0.98p for v4.16 kernel, :)

Code are available at
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.16.y-pds
and also
https://github.com/cchalpha/linux-gc/commits/linux-4.16.y-pds

All-in-one patch is available too.


PS, please apply the below patch if you don't have SMT enabled in kernel config, or if you don't know what's in your kernel config, you'd better have it. Thanks manuel for reporting.

 https://bitbucket.org/alfredchen/linux-gc/downloads/v4.16_pda098p_non_smt_fix.patch

45 comments:

  1. PS, the bitbucket repository size limitation warning pops up again. The bitbucket git repository has to be dropped eventually. Here is the all in one PDS patch github repository if you don't know yet.
    https://github.com/cchalpha/PDS-mq

    ReplyDelete
  2. Thanks again already! If anyone is looking, the full patch part is:
    https://github.com/cchalpha/PDS-mq/blob/master/4.16/v4.16_pds098p.patch

    ReplyDelete
  3. Replies
    1. Compiled and booted ok on Ryzen@home...
      Have not done any tests, I'm busy with watching ice hockey championship right now :)

      Br, Eduardo

      Delete
  4. Thanks
    Here are the new benchmarks:
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit#gid=81881499

    Pedro

    ReplyDelete
    Replies
    1. What's the lame x4 test, there is a huge regression I can't ignore.

      Delete
    2. Oh sorry, it was a wrong copy paste between two sheets.
      Those were the results of lame x8 (see 4.16 sheet).
      I've fixed it.

      Pedro

      Delete
    3. The result is still interesting. From my sanity test result, it show improvement for all kind of workload with 32ms balance interval changes, but it seems not the case from yours. In your benchmark result, the 0980 + balance_optimization patch is the best among all.

      But I have to put all these away and focus on 4.17 sync-up, a lots of changes from the mainline in this cycle. Let's resume the this in next kernel release.

      Delete
  5. I still got high average deviations for 100%, 150% and 200% load, with 098p. Even worse than with 0.98o + balance_optimization.patch, so, rendering collected data useless for comparison. Meaning, removing systemd/ cron related stuff didn't make any improvement.
    Tonight, I'd check reliability with CFS.
    Any further suggestions are welcome!

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. My current average deviations of sanity testing (with the same unified testing conditions):
      make -j1: PDSo+B_O.patch 0,01%; PDSp 0,01%; CFS 0,00%
      make -j2: PDSo+B_O.patch 4,04%; PDSp 5,06%; CFS 1,94%
      make -j3: PDSo+B_O.patch 1,37%; PDSp 3,86%; CFS 0,58%
      make -j4: PDSo+B_O.patch 0,66%; PDSp 3,70%; CFS 0,05%
      make -j5: PDSo+B_O.patch 0,23%; PDSp 0,08%; CFS 0,03%
      make -j6: PDSo+B_O.patch 0,08%; PDSp 0,09%; CFS 0,02%
      (5 rounds for each make, the sum of real values matches the calculated date diff, with a consistent script overhead of ~3min over total ~520min for each sanity run ever) {compilations on /dev/shm, one round of spin-up make -j2, cron&systemd not affecting sanity}

      I absolutely don't know, where to search for the source of error:
      make? gcc7? Is it about the timers in my system? threadirqs=Y? MNATIVE=Y? Wrong time accounting of the real values (only) on my system?
      This really leaves me clueless -- cannot be that (only) my system isn't suitable for this benchmark.

      BR, Manuel Krause

      Delete
    2. @Pedro:
      Maybe I'm allowed to ask, but you don't need to do me the favour: Can you maybe upload the .config of your test-compilation kernel and of your currently used one (if they're different) somewhere? I'm not quite sure, that I've kept all necessary things included after my recent .config-decluttering. Maybe the diff can show up some clues to me.

      BR, TIA, Manuel Krause

      Delete
    3. These are the real values of my previous testings (conditions and deviations described above):
      make -j1: PDSo+B_O.patch 1439,25; PDSp 1439,17; CFS 1430,26
      make -j2: PDSo+B_O.patch 1192,66; PDSp 1054,13; CFS 922,24
      make -j3: PDSo+B_O.patch 919,41; PDSp 958,82; CFS 933,85
      make -j4: PDSo+B_O.patch 921,48; PDSp 940,69; CFS 972,77
      make -j5: PDSo+B_O.patch 916,52; PDSp 918,01; CFS 995,45
      make -j6: PDSo+B_O.patch 919,06; PDSp 918,24; CFS 1014,81

      Last night I simply wasn't able to decide what to test next and skipped it. Either furtherly test the row of 4.16 PDS revisions, or test a 4.16.8 MuQSS kernel, or a 4.15.15 PDS with cachehot (with my old kernel .config).

      BR, Manuel Krause

      Delete
    4. I've uploaded my kernel config here
      https://pastebin.com/cyN2x0aP
      It's based on Archlinux's one.

      I have no clues about your high deviation.
      Maybe try a single threaded benchmark, like lame, to compare the results.

      Pedro

      Delete
    5. @Pedro:
      Many thanks for the .config, I'd have a look at it tomorrow.

      Doesn't the make -j1 on my dual-core already imply a single threaded compilation?

      BR, Manuel Krause

      Delete
    6. And, folks?! Ready for some even more crazy deviations? ;-)

      This time from my old 4.15.15 PDSk + cachehot.patch (unified testing)
      make -j1: real 1443,34 avedev 0,00%
      make -j2: real 1032,92 avedev 10,15%
      make -j3: real 975,33 avedev 10,17%
      make -j4: real 918,88 avedev 0,14%
      make -j5: real 939,02 avedev 2,89%
      make -j6: real 988,17 avedev 6,55%

      I hope it's not too disgusting for you that I've left-in the german decimal notation (and that I'm not still ready to present such beautiful charts like Pedro).

      @Pedro: I've had a longer inspection of your .config this afternoon. We're talking about ~490 different entries vs. my current .config, and yes, it's like looking for a needle in a haystack. I've now picked and changed some few options that at least sound to be able to tame some things and will give it a try tonight.

      Thanks,
      Manuel Krause

      Delete
    7. Regarding the new settings, I haven't had luck to reduce the deviations. And I promise to shut up with filling this blog with useless data collections until I've really found an improvement (or cancel this benchmarking project when I lost my patience).

      @Alfred:
      Do you by coincidence, like Pedro, have CONFIG_IRQ_TIME_ACCOUNTING=y in your current testbed system?

      TIA, Manuel Krause

      Delete
    8. @Manuel
      I always have CONFIG_IRQ_TIME_ACCOUNTING=y in my system.
      From my experience, background workload, on demand service request, background FS activities, thermal, file cache status(that's why I use "spin-up" round to "cache" files into system memory), all these impact the final result.

      Delete
    9. Maybe, I've just had a too narrow focus. So far, I've only minimized effects that you and Pedro decribe(d), what's quite appropriate, and then bothered myself with fiddling with the _running_ kernel's .config.
      Can be, that also paying attention to the object of investigation, i.e. the make command and what follows upon, brings an effort. In some way Pedro had already expressed this in one of his early postings regarding this topic. Thus, my humble idea for coming night's sanity test is to compile the "object-kernel" as GENERIC rather than MNATIVE like the ("subject") running kernel to be benchmarked.

      BR, Manuel Krause

      Delete
    10. What is your hardware ?
      Maybe you can try with intel pstate+performance or acpi-cpufreq+performance, to rule out the frequency governor.
      Also try lame. It usually has lower standard deviation.

      One a side note, I've realized that one formula was wrong in my spreadsheet. I've mistaken average deviation for standard deviation.
      I've fixed it. I know use standard deviation for the spread.

      Pedro

      Delete
    11. @Pedro:
      It always seemed that the P-State driver has no chance on my system. Thus, for some years now, I'm using acpi-cpufreq & the performance governor, and do writing this ("performance") into the right places within a little startup script. This Laptop is on AC anyways.

      Yes, you've had taken the avedev formula up to now, but I've had serious troubles with translating those terms into german and back, and so let the stdev values calculate in my libreoffice charts, too, in addition to avedev. It makes quite a difference in % when deviances are high. Anyway, your data is still far better than with my deviances.

      Let's see, if I succeed tonight.

      BR, Manuel Krause

      Delete
    12. Short info about the last 2 nights of testing: No improvements achieved, my relative standard deviations do still go up to 11%. (My best tested kernel in this regard was a 4.16.8 and CFS with 2,3% rel. stdev.)
      I really need to avoid my former mistake to change too many parameters at a time, so my progress is evidently slow.

      @Pedro, @Alfred:
      Does someone of you, by coincidence, use the "treadirqs" kernel command line parameter (what I am doing)?

      BR, Manuel Krause

      Delete
    13. I meant "threadirqs", of course.

      Delete
    14. BTW., the change from my former CONFIG_MNATIVE=y to CONFIG_MCORE2=y for my Core2Duo on the running "subject" kernel reduced the compilation times of the "object" kernel by overall 1,3% over the complete 517min of test, last night. Nothing else was changed.
      So, the MNATVE does not pick-up optimizations, primarily regarding performance, on here.

      BR, Manuel Krause

      Delete
    15. Mmmh, removing "threadirqs" from the kernel command line reduced compilation time by another 1.3%, but not the deviations satisfactorily.

      Has someone of you experiences with / insights about the difference of /usr/bin/time vs. the shell's time command? Or do they use the same internals?

      Manuel

      Delete
    16. This doesn't make any difference. Data quality remains bad with the same loads (100% and 150% load) and no overhead occurred.

      As a last resort I'd now update my system what I've postponed for several weeks now. I don't expect much from simply updating my gcc7 (7.3.1 20180323 [gcc-7-branch revision 258812]) -- would you either suggest to downgrade or to upgrade gcc?

      BR, Manuel

      Delete
    17. Thank you for nothing...
      gcc-8 is a real mess: Noone can read the kernel compilation output, as "objtool: warning: ..." floods the screens. Maybe, that you even need Con Kolivas' patch http://ck.kolivas.org/patches/4.0/4.16/4.16-ck1/patches/0016-Fix-Werror-build-failure-in-tools.patch
      to make it compile at all with gcc-8.

      But tonight, with this messy kernel, I'd again benchmark the PDS p+. Beware, @Alfred: Maybe something of the non-SMT balancing is not as balanced or tamed as wished!

      BR, Manuel Krause

      Delete
    18. I really want to apologize for my impoliteness last night. I was simply loosing my religion, what's even the more problematic, when one (me) never had a religion. ;-)
      'sanity' with gcc-8 provides the same bad data in the same places, with an increased compilation time by overall 30mins / ~6%.

      BR, Manuel

      Delete
    19. I don't use the "threadirqs" kernel boot parameter.
      I don't think it's the kernel config that causes the high deviations. I've done some tests in the past and customizing the config as a low impact on throughput (I've not tested every options though).

      Try other benchmarks like compression or audio encoding to rule out gcc.
      For example : time -p xz -k -T 2 /dev/shm/linux-4.4.89.tar

      Pedro

      Delete
    20. Yes, @Pedro, you're absolutely right with your suggestion. Quite some kind of irrational, that I sticked with the 'make' testing for so long time.
      I'll calibrate and adapt the script for tonight.

      Regarding the "threadirqs" parameter, I've made the subjective experience that it is beneficial for lower latency, although it implies a measurable performance impact (reliable ~1.3% on my machine with the 'make' sanity tests).

      BR, Manuel

      Delete
    21. This morning surprized me with a really pretty data quality. One can almost say it's beautiful: Relative standard deviations between 0.07 and 0.12% and relative SEM of max. 0.03%. I've kept the tests closely to those of Alfred's sanity/throughput tests, thus varying the -T parameter of xz for the different loads.
      Let's see the coming night's test with a different kernel, to see if it's suitable to compare scheduler changes.

      Many thanks and BR,
      Manuel

      Delete
    22. @Pedro:
      I hope you aren't bothered too much by my questions... Mmmh.

      When you colour your spreadsheets to signal the differences vs. max. stddev, am I right with my assumption, that you refer to the max. stdev for each separate population? (Meaning, for 'make -j1' the accordingly calculated one, and for 'xz' the other according one? Or do you use one overall max. deviation for all per kernel tests/ populations? I'm just not sure, what's the most correct way.)

      Another question: Is it possible to upload or import a libreoffice chart to google spreadsheets? From my experience, they at least use the same function names (my system is more or less en_us only except for decimals and dates).

      Just another one: You describe in your spreadsheets "within +- max std dev". Either this is irritating or wrong. The stdev itself describes the whole amount vs. the average, so there is no need to have the "+-" expressed (In my understanding it means a doubled stdev taken into account.). Please correct me, if I'm wrong or if you follow a different theory!

      BR, Manuel

      BR, Manuel

      Delete
    23. @Manuel

      Yes, max stdev means the maximum standard deviation of the following two values : the reference which is the first column (CFS 1000Hz) and the other kernel (for exemple PDS 1000Hz). And it is for each test.
      Averaging the stdev between all the tests wouldn't make sense.

      Yes google sheets handle the function's name conversion I believe. The charts are often messed up though. It's better to re-create charts under google sheets.

      When I wrote within '+- max stdev', it means for each test :
      "time reference kernel" - maxstdev < "time other kernel" < "time reference kernel" + maxstdev
      (with max stdev in seconds of course).

      It's a poor criteria to take into account the uncertainty of the results.
      I want to chose a better criteria but I don't find the time to read about statistics and think about it.

      Pedro

      Delete
    24. @Pedro:
      Thank you very much for your further explanations. Together with what I've read yesterday it's really helpful to get it right.
      In the past nights I continued to test more kernels and although I first was enthusiatic about low stdevs and SEMs with 'xz -T ?', this wasn't justified at all. When beginning to compare the kernels it showed that the differences are so low, that I can't stay with your approach of "+- max. stdev" to evaluate significance. After reading a bit more and doing some calculations with my data collection I tend to use the 95% z-confidence interval approach, simply as it's commonly used. (Just a simple practical explanation can be found here: http://www.stat.wmich.edu/s216/book/node81.html)
      I've also noticed, that there is the 'CONFIDENCE' formula in libreoffice/google spreadsheet, what makes things a bit easier, but one needs to take care to calculate the 'stdev' variable in there correctly for both sample means with
      "SQRT(stdev_value1^2 + stdev_value2^2)".

      But even with this more accurate evaluation I can only say from my tests, so far, the differences between PDS 098p, o and n are marginal (within confidence interval), although tendencies are visible, while the comparison with CFS is very significant (PDS being far better for loads >= 150% and slightly worse below).

      I'll need more tests and some time to make up a pretty spreadsheet to present.

      Thank you,
      Manuel

      Delete
    25. @Alfred:
      I'm not done with the whole 'xz -T ?' benchmarking I'd like to present. Needs 4..5 more days.

      But I need to mention last nights testing of a 4.15.15+PDS098k+cachehot.patch: The latter is for all 6 workloads (50-300%) better than current 4.16.12+PDS098p by 7..8% ! No typos. Unfortunately, I haven't appropriate other 4.15 kernels to cross-check at hand atm. (like CFS at least).

      BR, Manuel Krause

      Delete
  6. @Alfred:
    I just noticed the following compile time WARNING:

    CC kernel/sched/pds.o
    kernel/sched/pds.c: In function ‘pds_trigger_balance’:
    kernel/sched/pds.c:3045:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

    I haven't CONFIG_SCHED_SMT enabled.

    Seems like I haven't looked at the output carefully enough, last time I compiled 098p. Sorry for that.

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Am I right, when only moving up the "#endif /* CONFIG_SCHED_SMT */" by one line, so the result looks like this? (Almost, due to missing whitespaces...)

      if (1 == rq->nr_running) {
      #ifdef CONFIG_SCHED_SMT
      pds_sg_balance(rq);
      #endif /* CONFIG_SCHED_SMT */
      return false;
      } else {
      return pds_load_balance(rq);
      }
      }
      #endif /* CONFIG_SMP */

      Hopefully this won't result in desaster^^ ;-)

      BR, Manuel Krause

      Delete
    2. Important self-reply: No desaster. All seems to work fine. Although I don't know if I did it right. Manuel

      Delete
    3. Thanks. I will public this fix ASAP.

      Delete
    4. Thanks for taking care of this. It also reassures me to have understood some parts of the underlying changes of your commit 31e1487a.

      This leads me to a follow-up question: As my 2 cpu cores are HT capable, but not my chipset, and I'd enable the SMT option(s) -- how would the scheduler react? Would it find a "fake sibling" like as real? And/ or would it only result in overhead due to confusion?

      BR, Manuel Krause

      Delete
    5. Whatever the limitation is made(by bios, chipset, kernel config, kernel cmdline), you can use "dmesg | grep -i pds" to check the final cpu topology set up for PDS scheduler.

      Delete
    6. @Alfred:
      Ah, o.k. -- Then I can only add, that the detection algorithm is obviously working properly, as it shows the same result with and without CONFIG_SCHED_SMT=y on my system:
      pds: PDS-mq CPU Scheduler 0.98p by Alfred Chen.
      pds: cpu #0 affinity check mask - coregroup 0x00000002
      pds: cpu #1 affinity check mask - coregroup 0x00000001

      Thanks and BR, Manuel Krause

      Delete
    7. BTW, even when adding "smt=2" to kernel cmdline for my SMT enabled kernel, nothing changes on my system. I've now compiled it out to reduce eventual/ possible overhead. Quite interesting though, and thanks for having this opportunity to verify it within PDS!

      BR, Manuel Krause

      Delete
  7. Sorry for the late reply. Was busy testing my new customized board with 8th gen intel cpu, it has a powerful bios which can be used to set up different performance setting of each core, and most important, fully under control. Though, it may take 1 ~ 2 months before it becomes productive, but it's still exciting.

    ReplyDelete
  8. I was just getting ready to build 4.16.11-rc1 and Manuel's #endif move, and noticed this in dmesg:
    [ 0.000000] pds: PDS-mq CPU Scheduler 0.98p by Alfred Chen.
    [ 0.210297] pds: cpu #0 affinity check mask - coregroup 0x00000002
    [ 0.210300] pds: cpu #1 affinity check mask - coregroup 0x00000001
    [ 3453.712773] CPU: 0 PID: 3493 Comm: systemd-sleep Not tainted 4.16.9-1-pds #1
    [ 3453.712773] CPU: 0 PID: 3493 Comm: systemd-sleep Tainted: G W 4.16.9-1-pds #1

    And then 117 more of that last message. Is this important?

    ReplyDelete
    Replies
    1. It is just about cpu topology information that PDS is using, nothing more that than this.
      The rest are your kernel warning/error, but happen that you put pds into kernel version information, so they also be printed in the grep output.

      Delete