Alfred Chen's Blog: CacheHot experimental patch for PDS

Thursday, March 1, 2018

CacheHot experimental patch for PDS

Overview

In modern computer system design, cpu cache plays an important rule in performance but it "becomes a resource which is transparently used and administered by the processors.", which make it like a "black whole" and hard to be measured, especially from per task aspect in the kernel point of view. Despite of the difficulty and limitation, it is worthy to give it a shot based on recent test result of a prototype cache hot patch for PDS scheduler.

The Cache-Hot prototype patch here is trying to determine a task is cache hot or not when it is waking up, then choose different method to select a cpu to run on based on this cache hot information. If the task is cache hot, then it can use the accurate but complicated method to choose a cpu be affinity with the cache which the task resides in. If the task is cache cool, then it can simply choose any cpu it can run on. In this way, the overhead of selecting a cpu to run on is reduced.

The test result of kernel compilation result can be download from here. It shows that overhead is reduced in heavy workload, and the overhead cutting should be able to been seem in light workload if advance the measure model in the feature.

Limitation

1. The prefer setting/formula in this patch may just work for x86 arch.
2. The prefer setting/formula in this patch may just work for certain workload(linux kernel compilation)

Consider these limitations and this version is still a prototype patch, it will be better not official put into PDS and provide it as an experimental patch to try it out.

Try out the patch

The patch can be download from here, apply it upon pds098k.

Then adjust the SCHED_CACHE_HOT_SWITCHES_TH in pds.c to the reference value in the below formula according to your cpu topology

SCHED_CACHE_HOT_SWITCHES_TH = 8 * (LLC_SIZE / 3) * (1.8 / CPU_SPEED) * (4 / NUM_CPUS)

LLC_SIZE is last level cache size in MB
CPU_SPEED is the cpu speed in GHz
NUM_CPUS is the numbers of logical cpu(HT counts in)

Feel free to try other value than the reference one, and on different kind of workload. Your feedback will be welcome.

What's next

Don't worry about too much of the SCHED_CACHE_HOT_SWITCHES_TH, it's just a simple measurement in the prototype. Another model is working in progress and will be available in next version.

32 comments:

AnonymousMarch 2, 2018 at 6:04 AM
@all & @Alfred:
Can someone of you, please, ease me from my pain of confusion?

My example calculation is hopefully correct for my "Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz" (Penryn) and I used the 3M L2 Cache as LLC:
* SCHED_CACHE_HOT_SWITCHES_TH = 8 * (LLC_SIZE / 3) * (1.8 / CPU_SPEED) * (4 / NUM_CPUS)
* = 8 * ( 3 MB / 3) * (1.8 / 2.26 GHz ) * (4 / 2 )
* = 8 * 1 * 0.7965 * 2 = 12.7434

How should the line in pds.c:
#define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
look like with it?
I'm really sorry that I'm not more used to programming.

And, another question comes up: Will it be useful on a dual-core without true HT capabilities? OTOH, maybe it'll be a real benefit?

Thank you in advance for the little help,
and best regards,

Manuel Krause
ReplyDelete
Replies
jwh7March 3, 2018 at 3:36 AM
@Alfred, should the function use round() or ceiling()? Doesn't matter in my case (8*1/3*1.8/2.6*4/2 = 3.69228 -> 4UL), but it will for some. :) For my netbook though...will this be transparent for UP, or should I remove the patch for it?
ReplyDelete
Replies
AnonymousMarch 3, 2018 at 2:25 PM
Thanks Alfred.

I did some throughput & interbench tests with PDS 0.98k.
https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

PDS really shines.
I'm not sure I understand the interbench test though.

Pedro
ReplyDelete
Replies
Andy LavrMarch 3, 2018 at 9:47 PM
Thanks Alfred.

I used these settings for i5 and i7, excellent work!

#ifdef CONFIG_MBROADWELL
#define SCHED_CACHE_HOT_SWITCHES_TH (5UL)
#else
#define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
#endif

#ifdef CONFIG_MWESTMERE
#define SCHED_CACHE_HOT_SWITCHES_TH (6UL)
#else
#define SCHED_CACHE_HOT_SWITCHES_TH (8UL)
#endif
ReplyDelete
Replies
AnonymousMarch 4, 2018 at 5:59 AM
Hi I got a small regression with kernel building cpu is a 5960x (20 logical cores)
make -j20 -s 667.38s user 69.40s system 1453% cpu 50.692 total
Everything fine if I increase the jobs count:
make -j30 -s 720.80s user 72.45s system 1698% cpu 46.717 total
ReplyDelete
Replies
UnknownMarch 5, 2018 at 12:37 PM
ups I got the CPU wrong in the first post it's a 7900X but here are the 300% and 200% results

make -j60 -s 763.05s user 79.45s system 1737% cpu 48.501 total
make -j40 -s 748.63s user 78.72s system 1731% cpu 47.777 total
ReplyDelete
Replies
AnonymousMarch 7, 2018 at 11:55 AM
So here are kernel benchmarks with and without the patch on my quad core

with patch
make -j8 -s 571.11s user 39.90s system 329% cpu 3:05.16 total
make -j16 -s 732.30s user 49.59s system 684% cpu 1:54.24 total
make -j24 -s 780.68s user 50.81s system 757% cpu 1:49.72 total

without patch
make -j8 -s 581.09s user 38.85s system 295% cpu 3:29.52 total
make -j16 -s 741.42s user 48.10s system 702% cpu 1:52.34 total
make -j24 -s 768.42s user 49.50s system 725% cpu 1:52.71 total
ReplyDelete
Replies
AnonymousMarch 17, 2018 at 9:44 AM
@Alfred:
"Long time no read"... I really hope you're well and healthy!
Any help we can offer, e.g. for preliminary testings of v0.2/3/4?

Best regards,
Manuel Krause
ReplyDelete
Replies
kernelOfTruthMarch 30, 2018 at 2:03 PM
Those changes from PDS 0.98h (base 4.14.16) to 0.98k with CacheHot patch (base 4.15.9) are impressive !

Finally I can write text, listen to music, watch videos, etc. while e.g. compiling chromium (= heavy load on the CPU) and that without even having to renice portage to 19 !

Before that there always had been issues with occasional delays in text input, stutters, lags (e.g. random interruptions of sound output) - but so far I haven't encountered them.

This needs further observation but I'm pretty optimistic ;)

Great work !

It's getting closer to interactive response & latency of rt-kernels without actually having to use a rt-kernel :)
ReplyDelete
Replies

Add comment