Thursday, March 1, 2018

CacheHot experimental patch for PDS

Overview


In modern computer system design, cpu cache plays an important rule in performance but it "becomes a resource which is transparently used and administered by the processors.", which make it like a "black whole" and hard to be measured, especially from per task aspect in the kernel point of view. Despite of the difficulty and limitation, it is worthy to give it a shot based on recent test result of a prototype cache hot patch for PDS scheduler.

The Cache-Hot prototype patch here is trying to determine a task is cache hot or not when it is waking up, then choose different method to select a cpu to run on based on this cache hot information. If the task is cache hot, then it can use the accurate but complicated method to choose a cpu be affinity with the cache which the task resides in. If the task is cache cool, then it can simply choose any cpu it can run on. In this way, the overhead of selecting a cpu to run on is reduced.

The test result of kernel compilation result can be download from here. It shows that overhead is reduced in heavy workload, and the overhead cutting should be able to been seem in light workload if advance the measure model in the feature.

Limitation

1. The prefer setting/formula in this patch may just work for x86 arch.
2. The prefer setting/formula in this patch may just work for certain workload(linux kernel compilation)

Consider these limitations and this version is still a prototype patch, it will be better not official put into PDS and provide it as an experimental patch to try it out.

Try out the patch

The patch can be download from here, apply it upon pds098k.

Then adjust the SCHED_CACHE_HOT_SWITCHES_TH in pds.c to the reference value in the below formula according to your cpu topology

SCHED_CACHE_HOT_SWITCHES_TH = 8  * (LLC_SIZE / 3) * (1.8 / CPU_SPEED) * (4 / NUM_CPUS)

LLC_SIZE is last level cache size in MB
CPU_SPEED is the cpu speed in GHz
NUM_CPUS is the numbers of logical cpu(HT counts in)

Feel free to try other value than the reference one, and on different kind of workload. Your feedback will be welcome.

What's next

Don't worry about too much of the SCHED_CACHE_HOT_SWITCHES_TH, it's just a simple measurement in the prototype. Another model is working in progress and will be available in next version.

32 comments:

  1. @all & @Alfred:
    Can someone of you, please, ease me from my pain of confusion?

    My example calculation is hopefully correct for my "Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz" (Penryn) and I used the 3M L2 Cache as LLC:
    * SCHED_CACHE_HOT_SWITCHES_TH = 8 * (LLC_SIZE / 3) * (1.8 / CPU_SPEED) * (4 / NUM_CPUS)
    * = 8 * ( 3 MB / 3) * (1.8 / 2.26 GHz ) * (4 / 2 )
    * = 8 * 1 * 0.7965 * 2 = 12.7434

    How should the line in pds.c:
    #define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
    look like with it?
    I'm really sorry that I'm not more used to programming.

    And, another question comes up: Will it be useful on a dual-core without true HT capabilities? OTOH, maybe it'll be a real benefit?

    Thank you in advance for the little help,
    and best regards,

    Manuel Krause

    ReplyDelete
    Replies
    1. @Manuel
      In your case, just replace the 8UL wiht 13UL.

      Delete
    2. @Alfred:
      Many thanks for the clarifications, also those for jwh7.
      I need to observe it for longer uptime. So far, I can only tell that all is working very well for me (13UL). Atm. I don't (and you shouldn't) trust my subjective impression that it has a positive impact on overall interactiveness (KDE, Xorg, window & Firefox tabs switching e.g.) -- can be due to the low uptime. But my first experience is positive.

      Many thanks for your work and BR,

      Manuel Krause

      Delete
    3. @Alfred:
      Also, after ~30h uptime with one hibernation (TOI) over-night all works well. No aging effect at all for my system. I have no imagination on what knob your patch really turned, but it's working. {No benchmarking from my side, so others' experience may differ.}

      BR, Manuel Krause

      Delete
  2. @Alfred, should the function use round() or ceiling()? Doesn't matter in my case (8*1/3*1.8/2.6*4/2 = 3.69228 -> 4UL), but it will for some. :) For my netbook though...will this be transparent for UP, or should I remove the patch for it?

    ReplyDelete
    Replies
    1. Also, I presume LLC the per-core cache? My Athlon X2 is listed with L2: "2 x 1 MB 16-way set associative exclusive caches", to which cpuinfo and cpuid concurs.

      Delete
    2. SCHED_CACHE_HOT_SWITCHES_TH should be a unsigned long long, but anyway, you can try 3 or 4.
      For UP, there is another ttwu function, should be fine. Currently, I don't think too much about UP and 32bits.
      Yes, like you said llc size should be "per-core" cache size. That's 1MB for your case.

      Delete
    3. I meant your formula, not code. :) For example if one's calculation came out to 7.4, should they opt to use:
      round(7.4) = 7
      ...or:
      ceiling(7.4) = 8
      ? I suppose your answer would be to test both though. :)

      Delete
    4. Both should be fine. Just pick one you like mostly, and don't waste too much time on this. As I am going to drop this threshold in next version, in the final version, thresholds should be autofill by cpu information populated during kernel boot up and no need to manually set them.(At least for x86 arch).

      Delete
    5. @jwh7
      For my system @Alfred's suggestion of (13UL) -- what would mean the ceiling() value -- all is working better than before. So, the formula is quite well chosen for a "prototype". Maybe he'd reveal more secrets (with)in his next patch-evolution. ^^

      Have you tried it so far?

      BR, Manuel Krause

      Delete
    6. @Manuel
      :) No secrets at all. I once want to finalize the threshold, but it takes me much time on testing and result is not very stable. Then I realize that the measurement is not accurate in this patch, so I release this prototype patch for testing and work on the next version.
      I got the first test result of v0.2 cachehot patch yesterday, it looks good in all kinds of workload. But I do need more time to test stability and tune the parameters(now there are 6 parameters than the only one in prototype). So it may take some time before I officially release it.

      Delete
    7. @Manuel, yes I have this running on my Athlon X2 x64 PC and now also my Celeron M (UP,630 MHz overclocked to 981) netbook. In your case, round and ceiling are the same. Ceiling just means 'always round up' while floor is 'always round down'. :)

      Delete
    8. @jwh7
      O.k., thank you for the info. Dunno if I'd learn more programming by following you folks and asking quite dumb questions. ;-) But at least a little (chance) remains.

      Delete
  3. Thanks Alfred.

    I did some throughput & interbench tests with PDS 0.98k.
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    PDS really shines.
    I'm not sure I understand the interbench test though.

    Pedro

    ReplyDelete
    Replies
    1. @Pedro
      Good work. The result really shows the improvement have been done in 098k(comparing the result in 4.14).

      Delete
  4. Thanks Alfred.


    I used these settings for i5 and i7, excellent work!


    #ifdef CONFIG_MBROADWELL
    #define SCHED_CACHE_HOT_SWITCHES_TH (5UL)
    #else
    #define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
    #endif

    #ifdef CONFIG_MWESTMERE
    #define SCHED_CACHE_HOT_SWITCHES_TH (6UL)
    #else
    #define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
    #endif

    ReplyDelete
  5. Hi I got a small regression with kernel building cpu is a 5960x (20 logical cores)
    make -j20 -s 667.38s user 69.40s system 1453% cpu 50.692 total
    Everything fine if I increase the jobs count:
    make -j30 -s 720.80s user 72.45s system 1698% cpu 46.717 total

    ReplyDelete
    Replies
    1. Yes. There maybe some regression for light workload(<=100%) in this prototype. Maybe you can try heavy workload like 300%. Feedback will be welcome.

      Delete
    2. At the moment I only have access to my quad core so here are the results from that:

      make -j8 577.67s user 40.03s system 336% cpu 3:03.44 total
      make -j16 719.03s user 48.02s system 648% cpu 1:58.31 total
      make -j24 781.63s user 51.74s system 761% cpu 1:49.50 total

      Delete
  6. ups I got the CPU wrong in the first post it's a 7900X but here are the 300% and 200% results

    make -j60 -s 763.05s user 79.45s system 1737% cpu 48.501 total
    make -j40 -s 748.63s user 78.72s system 1731% cpu 47.777 total

    ReplyDelete
    Replies
    1. Thanks for sharing. But without comparison, it can't tell if there is an improvement.

      Delete
    2. @Alfred:
      And you... do you have efforts with your v0.2(^X) new parameters testing?
      Me, just being curious and hopeful, but feeling no need to rush, as your first suggestion is still working very well. I, for myself, can't even imagine how much work it may mean to make CacheHot fit for most of all cpus.

      Thank you for your great ideas and implementations anyway,
      and BR,
      Manuel Krause

      Delete
    3. @Manuel
      The most challenge work is to find the best fit cache-hot measurement model, "best fit" means the balance of accurate and calculation effect. I work on another model yesterday, which is much accurate but the result turns out it makes much more overhead in calculation. But, yes, I still need tests to tune the parameters in the model to see if the model is good or not.

      Delete
  7. So here are kernel benchmarks with and without the patch on my quad core


    with patch
    make -j8 -s 571.11s user 39.90s system 329% cpu 3:05.16 total
    make -j16 -s 732.30s user 49.59s system 684% cpu 1:54.24 total
    make -j24 -s 780.68s user 50.81s system 757% cpu 1:49.72 total

    without patch
    make -j8 -s 581.09s user 38.85s system 295% cpu 3:29.52 total
    make -j16 -s 741.42s user 48.10s system 702% cpu 1:52.34 total
    make -j24 -s 768.42s user 49.50s system 725% cpu 1:52.71 total

    ReplyDelete
    Replies
    1. Thanks for your sharing. That really show the improvement and it's worthy to working on this "cachehot" idea. Currently, I am working on retesting the version 0.2 measurement model, trying to make up a formula for the primary parameters. Once it is done, I will release the new patch.

      Delete
  8. @Alfred:
    "Long time no read"... I really hope you're well and healthy!
    Any help we can offer, e.g. for preliminary testings of v0.2/3/4?

    Best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. I have worked on a v0.2, but the test result changes from time to time, so there must be some wrong in the model.
      Currently, I am working on v4.16 sync up and put some solid changes upon it. New version of cachehot patch will be based on v4.16 kernel.

      Delete
    2. @ Alfred:
      Thank you for notifying us! And it's good to read that you continue your good work on this PDS.
      I really don't want to beg for it (but in fact it's what I'm doing now... ): Would it be possible for you to backport your refined CacheHot solution to 4.15 kernels, once you found it? I'd really appreciate (experimental) testing on the current kernel, thus not mixing with possible 4.16 related differences vs. 4.15.

      Never mind, if my argument appears too weak for you to convince you.

      Anyway best regards, and many thanks for your work!
      BR, Manuel Krause

      Delete
  9. Those changes from PDS 0.98h (base 4.14.16) to 0.98k with CacheHot patch (base 4.15.9) are impressive !

    Finally I can write text, listen to music, watch videos, etc. while e.g. compiling chromium (= heavy load on the CPU) and that without even having to renice portage to 19 !

    Before that there always had been issues with occasional delays in text input, stutters, lags (e.g. random interruptions of sound output) - but so far I haven't encountered them.

    This needs further observation but I'm pretty optimistic ;)

    Great work !

    It's getting closer to interactive response & latency of rt-kernels without actually having to use a rt-kernel :)

    ReplyDelete
    Replies
    1. A clarification:

      It's probably not only the CacheHot changes and going from 4.14 to 4.15 but I haven't tried to use the system for quite some time while compiling e.g. chromium since that always was mutually exclusive with (multi-tasking) work on the system

      Delete
    2. @kernelOfTruth:
      Nice tor read that you, joining the party a little late ;-), are also experiencing those positive effects you're reporting. On here it's almost the same picture after longer observation, even when utilizing 'dumb' players like flash and/or having high disk i/o (on here with bfq-mq + pf's picked patches for it). I also can't pin it down to a particular step, but I don't need to IMO, as the result in means of responsiveness and stability is proof enough for me.

      Enjoy it and BR,
      Manuel Krause

      Delete
    3. @kernelOfTruth
      Good to hear your feedback. I think the overhead cutting in 098i release may play major part of your observation.

      Delete