@Eduardo: Hi, I've seen in https://gitlab.com/alfredchen/projectc/-/issues/21 that you use PDS from PrjC with 5.11.11. Do you see advantages of PDS vs. BMQ ?
Yes, I'm now using PDS for some time already. The reason, if I remember correctly, was some edge cases, when there is some large load on the system, the electron apps were just slow with BMQ, that includes Atom, Skype, MS Teams and maybe chromium too. It's being a while now, so I can say 100% for sure, but I remember very slow scrolling with them under load.
So I tried PDS and I think it's better as I don't recall any slowdowns, but I can't guarantee that I ran the same load with PDS too. Since PDS has been revived, probably I'll stick with it, I'll try out BMQ too, if there will be some adjustments.
Thank you for your explanations, Eduardo! This encourages me to try PDS, as soon as the current issue-fix testing is done. (I've also seen some very rare mouse lagging under heavy load, but I atm. use both PS/2 Trackman + keyboard via one USB adapter, and I sometimes test different performance levels, so maybe it's not directly comparable.)
Am I allowed to ask, which adjustments you like to see in BMQ?
It's hard to say about adjustments, I as a user would like to see everything running smoothly :) Alfred is right, all frameworks are probably deasigned with standard scheduler in mind and behave like they do, but I need those apps to work properly anyway :)
About feature, these days we have plenty of CPUs, I have this idea about running an app on dedicated cores. So, I want an app, like a game, to run on 4 dedicated CPUs for the maximum speed and no interference from other tasks, I would like every process to move off those 4 CPUs and only the game with its child processes there and game processes / threads should be properly scheduled within those 4 dedicated CPUs. If new task spawns while game is running, it should not get scheduled on those 4 dedicated CPUs. When the game ends, I would like all 4 CPUs back in action for the rest of the tasks as before. This doesn't have to be all automatic, I'm just talking abot the idea :)
None of current solutions I'm aware of provide this and I don't know if it's even feasible to do or will it work good from scheduler perspective :) I have not thought how much performance can be gained this way, if it looks very small and the work to be done is way too much, let's leave this as an idea :)
Maybe Alfred could comment on this theoretical idea, whether it even makes sense for something like this.
Good topic here. I'd like to join and reply one by one. For BMQ vs PDS, BMQ use the most ideally algorithm which just based on task's priority, it works fine in small system. But in large system invokes some applications which were designed for mainline linux scheduler, it may encounter priority issue.
PDS, which is based on priority and deadline which has more overhead than BMQ, but it guarantees no priority issue like BMQ from its design.
I am using PDS on my system right now, just because it is a new comer and needs more attention. BMQ should be fine on my small system.
Also, recently I have an idea to replace the skiplist in PDS, I believe it will reduce some overhead comparing to BMQ. I need to think carefully before hand on it.
For Eduardo's idea, current kernel has provided some mechanism for this propose. Pls check isolcpus kernel parameter. I need to double check if this is supported in projc schedulers, but the basic idea is to reserve some cpus using isolcpus then use set_cpus_allowed_ptr() API to assigned these cpus for the tasks dedicated on them.
On my side the PDS plays very well and I don't see much CPU utilization in my usual use on the new machine (4 cores, 8 threads). BMQ seems to use some little more. OTOH, I haven't stressed PDS with heavy use cases so far.
Eduardo's idea of dedicating tasks (or tasks + their child processes) to CPUs/threads (or groups of them) sounds very interesting to experiment with. Unfortunately I apparently don't have enough knowledge about nor practical experience with the CGROUPS subsystem from which I thought up to now, that it would implement such possibilities (e.g. with CPUSETS) ? Or is this just not yet supported by BMQ / PDS ?
The file /usr/src/linux/Documentation/admin-guide/cgroup-v1/cpusets.rst provides info and useful examples. Atm. I try to migrate userspace chromium to some dedicated cpuset.
ALUHEAD:~ # cd /sys/fs/cgroup/cpuset/ ALUHEAD:/sys/fs/cgroup/cpuset # mkdir Charlie ALUHEAD:/sys/fs/cgroup/cpuset # cd Charlie ALUHEAD:/sys/fs/cgroup/cpuset/Charlie # /bin/echo 6-7 > cpuset.cpus
We need some memory attached, otherwise it won't work at all and fail with: /bin/echo: write error: No space left on device So: ALUHEAD:/sys/fs/cgroup/cpuset/Charlie # /bin/echo 0 > cpuset.mems
So... Somehow it seems to work. I've kicked the chromium to CPUs 5-6. And there now appears most utilization with in-browser video-playback. The multiple child processes look like to also spawn there.
So far I wasn't able to remove previously set up cpusets and I also noticed, that other processes may appear in cpusets of quitted processes (e.g. an older chromium session). Maybe "/bin/echo 1 > cpuset.cpu_exclusive" can be of help.
Followup: Failure is most likely due to my misuse of one "/bin/echo $$ > tasks" in a root konsole for testing one or two directories. Unfortunately the process is not killable and neither are the cpusets removable.
Yeah, cool! The settings for dedicated CPUs also survive hibernation/ suspend without issues. This is with v5.11.15 + PDS & the recent pending fix patches.
@Eduardo: Your desired functionality seems to work properly, already, with CGROUPS & CPUSETS. If you don't make my newbe errors..
Now I'd only have to script soemthing for bash to adjust the "pidofproc" output to echo into /sys/fs/cgroup/cpuset/XYZ/tasks.
I have already tried the isolcpus parameter together with nohz_full quite some time ago and game ended up as jittery mess :) It microstuttered a lot. In addition to that, tasks on isolcpus do not get balanced, so multiple of them could end up on the same CPU if not placed carefully by hand.
I should have been prepared better for this at least on mainline kernel :) I did not even look into cpusets/cgroups as I was under impression that PDS/BMQ did not support them :) However, Manuel says it's working :)
So yesterday I had a brief moment of time available and I tried cpusets/cgroups on PDS using "cset shield", it sort of worked, but some tasks still were using cores I isolated according to htop. I have to test this more, of course.
But before that, I have a question to Alfred, are cpusets/cgroups fully supported in BMQ/PDS so we can safely using CPU shielding?
Mmmh, then I don't understand, why the basic functionality with the commands shown above works for me: To reassure myself, I just tested to limit a avidemux_qt5 video recoding process to the CPU threads 4-7 with the cpuset interface (without external apps and without touching isolcpus). Looking at the gkrellm display, more than 95% of the CPU load went to threads 4-7. I have no other explanation than: that it works.
cpu affinity should be basic functionality which inherit from parent, I believe that's why it works. Without isocpus, other tasks which inherit affinity from init(pid 0), will still able to run on cu 4~7.
I was able to run some tests yesterday with cpusets with mainline, mainline + nohz_full, PDS and BMQ. My findings yesterday and the day before yesterday, are in line what Alfred said. Basically "cset shield" sets affinity for the processes and they inherit affinity from parent and that helps with overall situation.
About tests, I simply ran Unigine Valley with and without cpusets. All kernels were built with 500Hz frequency.
The fastest was BMQ (this is in line with my previous tests), to my actual surprise, BMQ + nohz_full + cupsets gave the best result, that is in contradiction to my previous findings with nohz_full, maybe I just messed up stuff previously :)
Next fastest were PDS, then mainline and mainline + nohz_full. Mainline gained about 2% from cpusets, BMQ / PDS gained not that much, results were closer. Even nohz_full performed quite well with BMQ, even Doom Eternal was smooth AF with nohz_full. I'll be trying to use BMQ + nohz_full by default now for testing purposes.
Please note that I tested just one benchmark, single threaded. I have to run way more benchmarks and other tasks in parallel, to get knowledge whether cpusets actually influence results in meaningful way. For that I'm afraid, I'll not have that much time in foreseeable future.
So I'll be field testing nohz_full + BMQ on day to day tasks, compilations and sometime games with cpusets (I don't play much, but sometimes I do).
@Eduardo: Thank you for your work! Excuse me, but can it be that you've forgotten to mention the cpusets setup for your benchmarking tasks? If not clarified, your results look like a bunch of appreciated spring flowers. :-D
I have Ryzen 1700 which have 2CCX (core complexes with separate L3 cache), so I just isolated the second CCX using "sudo cset shield --cpu 4-7,12-15 --kthread=on". Then if needed, change cpuset ownership to your user.
I ran my tests using "cset shield -e somesupercommand".
When using nohz_full, I passed kernel parameter nohz_full=4-7,12-15 too.
@Eduardo: This definitely shouldn't sound impolite at all. (I was just overwhelmed by the amount of different man-pages for cset.) I, of course, thank you very much for your advice and information!
> cpu affinity should be basic functionality which inherit from parent, I believe that's why it works. > Without isocpus, other tasks which inherit affinity from init(pid 0), will still able to run on cu 4~7.
Yes, this is what I can observe on here. The setting of /sys/fs/cgroup/cpuset//cpuset.cpu_exclusive to 1 isn't sufficient to isolate the CPU -- it only isolates the processes and children. At this point I can't understand the term in kernel-parameters.txt regarding isolcpus: "[Deprecated - use cpusets instead]".
I haven't read all relevant(?) web info regarding cpusets yet... But I caught one idea, to put all processes into separate cpusets, like containers, plenty of possibilities (e.g. 0-1 for base processes, 2-3 for browsing, 4-... etc.).
Maybe the most stupid question of the year, but the most important for me: How do I put init (PID 0) into a limited cpuset ?
> Maybe the most stupid question of the year, but the most important for me: How do I put init (PID 0) into a limited cpuset ? isolcpus can do this, it kicks in at very early stage, at sched_init_smp(), below code sets init (pid0) cpus.
2 >-------/* Move init over to a non-isolated CPU */ 3 >-------if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0) 4 >------->-------BUG();
Another possible way is "auto group" feature, maybe limited cpus can be set at that time, or control that group later after system boot-up.
Just to get your concept right: As an example, adding "isolcpus=2-7" to kernel command line would leave CPUs 0 & 1 open for init and children? And the further cpusets can be then configured later (either with cpusets directly or with the more convenient cset)?
The SCHED_AUTOGROUP 'Depends on: !SCHED_ALT [=y]' and in addition to this, it's current concept doesn't sound convincing for Eduardos and my intentions, IMO. Let's see how far I get, when I'm allowed to reboot again, after issue/23 testing finished. :-)
Many many thanks to you both, @Eduardo and @Alfred, for this discussion. I've learned quite a lot about a topic that I was interested in for long time. Nice to have you here!
@Alfred & @ Eduardo: I don't know if these threads (here) about isolating CPUs with Project C do get much attention over time. Refining during the last weeks, I've got a more or less simple setup working, combining the needed "isolcpus" kernel command line parameter, "cset shield", some "/sys/fs/cgroup/cpuset" manipulations and with some short scripts to migrate processes to the desired cpuset.
Should I write a summary of our discussion of this topic into the "Issues" section @gitlab, including examples of scripts/ commands that do work? I still have further questions regarding this topic, and may be others as well. E.g. whether "cpuset.cpu_exclusive", "cpuset.sched_relax_domain_level" and "cpuset.sched_load_balance" get any effect with current Project C.
Wouldn't it be good to have this topic over there?
The https://gitlab.com/alfredchen/projectc/-/blob/master/5.11/prjc_v5.11-r3.patch seems to be an empty file.
ReplyDeleteAnyway, thank you for your work :)
Thanks for reporting that. I have fixed it and rewrote the release notes.
Delete@Eduardo:
ReplyDeleteHi,
I've seen in
https://gitlab.com/alfredchen/projectc/-/issues/21
that you use PDS from PrjC with 5.11.11.
Do you see advantages of PDS vs. BMQ ?
TIA,
Manuel
Yes, I'm now using PDS for some time already. The reason, if I remember correctly, was some edge cases, when there is some large load on the system, the electron apps were just slow with BMQ, that includes Atom, Skype, MS Teams and maybe chromium too.
DeleteIt's being a while now, so I can say 100% for sure, but I remember very slow scrolling with them under load.
So I tried PDS and I think it's better as I don't recall any slowdowns, but I can't guarantee that I ran the same load with PDS too.
Since PDS has been revived, probably I'll stick with it, I'll try out BMQ too, if there will be some adjustments.
BR,
Eduardo
Thank you for your explanations, Eduardo!
DeleteThis encourages me to try PDS, as soon as the current issue-fix testing is done.
(I've also seen some very rare mouse lagging under heavy load, but I atm. use both PS/2 Trackman + keyboard via one USB adapter, and I sometimes test different performance levels, so maybe it's not directly comparable.)
Am I allowed to ask, which adjustments you like to see in BMQ?
BR, Manuel
It's hard to say about adjustments, I as a user would like to see everything running smoothly :)
DeleteAlfred is right, all frameworks are probably deasigned with standard scheduler in mind and behave like they do, but I need those apps to work properly anyway :)
About feature, these days we have plenty of CPUs, I have this idea about running an app on dedicated cores. So, I want an app, like a game, to run on 4 dedicated CPUs for the maximum speed and no interference from other tasks, I would like every process to move off those 4 CPUs and only the game with its child processes there and game processes / threads should be properly scheduled within those 4 dedicated CPUs. If new task spawns while game is running, it should not get scheduled on those 4 dedicated CPUs.
When the game ends, I would like all 4 CPUs back in action for the rest of the tasks as before.
This doesn't have to be all automatic, I'm just talking abot the idea :)
None of current solutions I'm aware of provide this and I don't know if it's even feasible to do or will it work good from scheduler perspective :)
I have not thought how much performance can be gained this way, if it looks very small and the work to be done is way too much, let's leave this as an idea :)
Maybe Alfred could comment on this theoretical idea, whether it even makes sense for something like this.
BR,
Eduardo
Good topic here. I'd like to join and reply one by one.
DeleteFor BMQ vs PDS, BMQ use the most ideally algorithm which just based on task's priority, it works fine in small system. But in large system invokes some applications which were designed for mainline linux scheduler, it may encounter priority issue.
PDS, which is based on priority and deadline which has more overhead than BMQ, but it guarantees no priority issue like BMQ from its design.
I am using PDS on my system right now, just because it is a new comer and needs more attention. BMQ should be fine on my small system.
Also, recently I have an idea to replace the skiplist in PDS, I believe it will reduce some overhead comparing to BMQ. I need to think carefully before hand on it.
For Eduardo's idea, current kernel has provided some mechanism for this propose. Pls check isolcpus kernel parameter. I need to double check if this is supported in projc schedulers, but the basic idea is to reserve some cpus using isolcpus then use set_cpus_allowed_ptr() API to assigned these cpus for the tasks dedicated on them.
DeleteOn my side the PDS plays very well and I don't see much CPU utilization in my usual use on the new machine (4 cores, 8 threads). BMQ seems to use some little more. OTOH, I haven't stressed PDS with heavy use cases so far.
DeleteEduardo's idea of dedicating tasks (or tasks + their child processes) to CPUs/threads (or groups of them) sounds very interesting to experiment with.
Unfortunately I apparently don't have enough knowledge about nor practical experience with the CGROUPS subsystem from which I thought up to now, that it would implement such possibilities (e.g. with CPUSETS) ?
Or is this just not yet supported by BMQ / PDS ?
Enlighten me, please! :-)
Manuel
;-) This overlapped... different people having thought at the same topic same time.
DeleteManuel
The file /usr/src/linux/Documentation/admin-guide/cgroup-v1/cpusets.rst provides info and useful examples.
DeleteAtm. I try to migrate userspace chromium to some dedicated cpuset.
ALUHEAD:~ # cd /sys/fs/cgroup/cpuset/
ALUHEAD:/sys/fs/cgroup/cpuset # mkdir Charlie
ALUHEAD:/sys/fs/cgroup/cpuset # cd Charlie
ALUHEAD:/sys/fs/cgroup/cpuset/Charlie # /bin/echo 6-7 > cpuset.cpus
We need some memory attached, otherwise it won't work at all and fail with:
/bin/echo: write error: No space left on device
So:
ALUHEAD:/sys/fs/cgroup/cpuset/Charlie # /bin/echo 0 > cpuset.mems
Let's see if it works out of the box.
Thank you for your inspiration!
BR, Manuel
So...
DeleteSomehow it seems to work. I've kicked the chromium to CPUs 5-6. And there now appears most utilization with in-browser video-playback. The multiple child processes look like to also spawn there.
So far I wasn't able to remove previously set up cpusets and I also noticed, that other processes may appear in cpusets of quitted processes (e.g. an older chromium session).
Maybe "/bin/echo 1 > cpuset.cpu_exclusive" can be of help.
May be only some newbie troubles with this topic.
BR, Manuel
Followup: Failure is most likely due to my misuse of one "/bin/echo $$ > tasks" in a root konsole for testing one or two directories.
DeleteUnfortunately the process is not killable and neither are the cpusets removable.
Manuel
Yeah, cool! The settings for dedicated CPUs also survive hibernation/ suspend without issues.
DeleteThis is with v5.11.15 + PDS & the recent pending fix patches.
@Eduardo: Your desired functionality seems to work properly, already, with CGROUPS & CPUSETS. If you don't make my newbe errors..
Now I'd only have to script soemthing for bash to adjust the "pidofproc" output to echo into /sys/fs/cgroup/cpuset/XYZ/tasks.
Manuel
I have already tried the isolcpus parameter together with nohz_full quite some time ago and game ended up as jittery mess :) It microstuttered a lot.
DeleteIn addition to that, tasks on isolcpus do not get balanced, so multiple of them could end up on the same CPU if not placed carefully by hand.
I should have been prepared better for this at least on mainline kernel :)
I did not even look into cpusets/cgroups as I was under impression that PDS/BMQ did not support them :) However, Manuel says it's working :)
So yesterday I had a brief moment of time available and I tried cpusets/cgroups on PDS using "cset shield", it sort of worked, but some tasks still were using cores I isolated according to htop. I have to test this more, of course.
But before that, I have a question to Alfred, are cpusets/cgroups fully supported in BMQ/PDS so we can safely using CPU shielding?
BR,
Eduardo
cpusets/cgroups is not fully supported, most are dummy api implementation.
Deleteisocpus should work out-of-box, as I just checked. And indeed, it requires users to plan tasks on cpu carefully by hand.
Mmmh, then I don't understand, why the basic functionality with the commands shown above works for me:
DeleteTo reassure myself, I just tested to limit a avidemux_qt5 video recoding process to the CPU threads 4-7 with the cpuset interface (without external apps and without touching isolcpus). Looking at the gkrellm display, more than 95% of the CPU load went to threads 4-7. I have no other explanation than: that it works.
Or, have I misunderstood something?
BR, Manuel
cpu affinity should be basic functionality which inherit from parent, I believe that's why it works.
DeleteWithout isocpus, other tasks which inherit affinity from init(pid 0), will still able to run on cu 4~7.
I was able to run some tests yesterday with cpusets with mainline, mainline + nohz_full, PDS and BMQ.
DeleteMy findings yesterday and the day before yesterday, are in line what Alfred said. Basically "cset shield" sets affinity for the processes and they inherit affinity from parent and that helps with overall situation.
About tests, I simply ran Unigine Valley with and without cpusets. All kernels were built with 500Hz frequency.
The fastest was BMQ (this is in line with my previous tests), to my actual surprise, BMQ + nohz_full + cupsets gave the best result, that is in contradiction to my previous findings with nohz_full, maybe I just messed up stuff previously :)
Next fastest were PDS, then mainline and mainline + nohz_full.
Mainline gained about 2% from cpusets, BMQ / PDS gained not that much, results were closer. Even nohz_full performed quite well with BMQ, even Doom Eternal was smooth AF with nohz_full. I'll be trying to use BMQ + nohz_full by default now for testing purposes.
Please note that I tested just one benchmark, single threaded. I have to run way more benchmarks and other tasks in parallel, to get knowledge whether cpusets actually influence results in meaningful way.
For that I'm afraid, I'll not have that much time in foreseeable future.
So I'll be field testing nohz_full + BMQ on day to day tasks, compilations and sometime games with cpusets (I don't play much, but sometimes I do).
BR,
Eduardo
@Eduardo:
DeleteThank you for your work!
Excuse me, but can it be that you've forgotten to mention the cpusets setup for your benchmarking tasks?
If not clarified, your results look like a bunch of appreciated spring flowers. :-D
BR, Manuel
I have Ryzen 1700 which have 2CCX (core complexes with separate L3 cache), so I just isolated the second CCX using "sudo cset shield --cpu 4-7,12-15 --kthread=on".
DeleteThen if needed, change cpuset ownership to your user.
I ran my tests using "cset shield -e somesupercommand".
When using nohz_full, I passed kernel parameter nohz_full=4-7,12-15 too.
BR,
Eduardo
@Eduardo:
DeleteThe cset commands set is a real mess. Who decided to put commands and docs into approx. 570 pieces for only one purpose? :-((
@Eduardo:
DeleteThis definitely shouldn't sound impolite at all. (I was just overwhelmed by the amount of different man-pages for cset.)
I, of course, thank you very much for your advice and information!
BR, Manuel
> cpu affinity should be basic functionality which inherit from parent, I believe that's why it works.
ReplyDelete> Without isocpus, other tasks which inherit affinity from init(pid 0), will still able to run on cu 4~7.
Yes, this is what I can observe on here. The setting of /sys/fs/cgroup/cpuset//cpuset.cpu_exclusive to 1 isn't sufficient to isolate the CPU -- it only isolates the processes and children. At this point I can't understand the term in kernel-parameters.txt regarding isolcpus: "[Deprecated - use cpusets instead]".
I haven't read all relevant(?) web info regarding cpusets yet... But I caught one idea, to put all processes into separate cpusets, like containers, plenty of possibilities (e.g. 0-1 for base processes, 2-3 for browsing, 4-... etc.).
Maybe the most stupid question of the year, but the most important for me: How do I put init (PID 0) into a limited cpuset ?
TIA and BR,
Manuel
> Maybe the most stupid question of the year, but the most important for me: How do I put init (PID 0) into a limited cpuset ?
Deleteisolcpus can do this, it kicks in at very early stage, at sched_init_smp(), below code sets init (pid0) cpus.
2 >-------/* Move init over to a non-isolated CPU */
3 >-------if (set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_DOMAIN)) < 0)
4 >------->-------BUG();
Another possible way is "auto group" feature, maybe limited cpus can be set at that time, or control that group later after system boot-up.
Just to get your concept right: As an example, adding "isolcpus=2-7" to kernel command line would leave CPUs 0 & 1 open for init and children? And the further cpusets can be then configured later (either with cpusets directly or with the more convenient cset)?
DeleteThe SCHED_AUTOGROUP 'Depends on: !SCHED_ALT [=y]' and in addition to this, it's current concept doesn't sound convincing for Eduardos and my intentions, IMO.
Let's see how far I get, when I'm allowed to reboot again, after issue/23 testing finished. :-)
Many many thanks to you both, @Eduardo and @Alfred, for this discussion. I've learned quite a lot about a topic that I was interested in for long time. Nice to have you here!
BR, Manuel
@Alfred & @ Eduardo:
DeleteI don't know if these threads (here) about isolating CPUs with Project C do get much attention over time.
Refining during the last weeks, I've got a more or less simple setup working, combining the needed "isolcpus" kernel command line parameter, "cset shield", some "/sys/fs/cgroup/cpuset" manipulations and with some short scripts to migrate processes to the desired cpuset.
Should I write a summary of our discussion of this topic into the "Issues" section @gitlab, including examples of scripts/ commands that do work?
I still have further questions regarding this topic, and may be others as well. E.g. whether "cpuset.cpu_exclusive", "cpuset.sched_relax_domain_level" and "cpuset.sched_load_balance" get any effect with current Project C.
Wouldn't it be good to have this topic over there?
BR,
Manuel
Thunder Titanium Lights - Classic Lighting | Tioga Arts
ReplyDeleteThunder Titanium Lights have a flat screen LED used ford edge titanium display to keep 4x8 sheet metal prices near me you connected titanium white dominus price to titanium glasses frames your video and it means more than titanium wedding band sets you might think.