PDS 0.99b is released with the following changes
1. Sync with mainline 4.19 scheduler changes.
This is the first PDS release for 4.19, just sync-up changes.
Enjoy PDS 0.99b for v4.19 kernel, :)
Code are available at https://gitlab.com/alfredchen/linux-pds
All-in-one patch is available too.
PS, from now on, the linux-pds git will contains PDS only commits beside the mainline codes, while will help you to get the pure PDS commits a little easier.
Compiles and boots OK here, thanks.
ReplyDeleteboots fine for me also.
ReplyDeleteBut I have a problem with the nvidia driver.
If I use X compositor stuff from plasma. I get soft lockups.
softirq trace (not really helpful): https://pastebin.com/Q2FdmA7J
kernel config: https://pastebin.com/wQLxtXvj
I "ported" muqss which works fine with the same config.
Hi. Can you upload your ported MUQSS somewhere?
DeleteDo you have same problem with PDS on 4.18 kernels?
DeleteAnd how it works with CFS mainline scheduler?
And I notice it happened on cpu#17, what's your system(cpu) looks like? Would you give the output of "dmesg | grep -i pds"? It will help to understood the cpu topology setup. Thx.
DeleteIt's an Intel 10 Core (7900X)
Deletehere the pds output:
https://pastebin.com/tWJwYwHN
This comment has been removed by the author.
DeleteHere is the synced ck patch for 4.19:
ReplyDeletehttps://jki.io/ck-4.19.patch.bz2
Thank you very much, sir.
DeleteWorks very good, thanks again.
DeleteI get the bug also with 4.18.16 only applied the pds patch to exclude other sources of bugs.
ReplyDeletehttps://pastebin.com/SNYwe1i1
I can trigger this just by letting mpv run for some time.
I haven't seen "soft lockup" issue for a very long time. To exclude the causes, here is the tests you can try
Delete1. If your cpu is overclocked, pls reverse the overclock and test again.
2. Try "mpv --hwdec=no --vo=null --ao=null xxxxxx", this will most likely disable using your gpu and use cpu only. Let's see if it is nvidia related or not.
I enabled some debug options now I get a better stacktrace:
Deletehttps://pastebin.com/g91aig7w
Seems to be yield related the nvidia opengl implementation is using sched_yield.
I think you can try different yield_type at /sys/kernel/yield_type and see which one helps.
Deletesetting the env variable: __GL_YIELD=NOTHING seems to do the trick. No more soft lockups but I will report back if I hit the issue again seems to be somewhat random.
DeleteI just hit the issue again __GL_YIELD is also a no. I switched back to muqss for the time being maybe I get the issue also there. I'm out of ideas for fixes it's just hanging somewhere in the nvidia driver.
DeleteHaven't you tried different yield_type to /sys/kernel/yield_type?
DeleteMaybe remove nvidia proprietary driver (if used) and try using nouveau?
DeleteI tried yield_type=0 no effect. the problem is I can't get a good backtrace from the bug even with gdb because nvidia doesn't has debbug symbols compiled in. nouveau would be worth a try if the power consumption is not higher. I will experiment a bit to find the root cause for the problem.
ReplyDeletenouveau is most likely not an option no x265 hardware decode acceleration. So I have to track down the issue with pds or use muqss for the time being.
DeleteThere is 0,1,2 for yield_type. For the yield_type, there is a long story, you can search my blog for it.
DeleteI know the yield_type settings but I didn't need to adjust it for a while I tried all settings except 3
DeleteIn my system I have defaultes it to 2, because of the reasons I don't remember anymore.
DeleteThis is from MuQSS developer:
-----
This determines what type of yield calls to sched_yield will perform.
0: No yield.
1: Yield only to better priority/deadline tasks. (default)
2: Expire timeslice and recalculate deadline.
Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.
-----
Br, Eduardo
2 seems to work for now
DeleteOk, I ran the system for a while and trigged the bug again. The nvidia driver seems to be using sched_yield in a weird way. regadless of yield_type 0,1,2 the bug occurs. I now found a way to trigger it every time at least so testing for the bug is easier.
DeleteHere is the story about sched_yield in PDS, FYI.
Deletehttps://cchalpha.blogspot.com/2017/12/pds-098h-release.html
@Jan Killius
When the soft lockup happens, does it has any impact or just a panic log in the dmesg?
There is nothing in the panic log beside the soft lockup. But I tracked the issue down I had X running in the scheduling policy SCHED_ISO this seems to trigger the behaivor.
DeleteHere are my test results.
here are the results for what triggers the bug:
I'm using https://pavelfatin.com/typometer/ with 1ms delay and alacritty this seems to trigger the bug fast
SCHED_ISO break everytime X hangs forever
SCHED_RR gets 1000ms+ hangs with 1
SCHED_NORMAL works everytime
changing the nice level or attaching with gdb unhangs X
Not using SCHED_ISO for X is my workaround for now.
the 1 by SCHED_RR is the yield_type
DeleteIs the bug also in MUQSS?
DeleteLet's leave "why APP using yield() to give away cpu time" aside. When task uses yield() to give away cpu time, but from the scheduler's point of view, it's not guarantee the task will be schedule out, especially in a system with multiple cores. Take JK's 10C20T cpu for example, when system workload is idle or low, most likely each task can occupy a core to run. Even the task call yield(), and schduler() is called, but from the scheduer(PDS)'s view, there is no other tasks which has higher priority(most likely none) pending on other cpu run queue to run on, the task calling yield() still the next task to run. This will fail the expectation of yield(), and if it continues, watch dog will pop up and "soft lockup". This also explain why higher priority task with yield() is more easier to get soft lockup.
DeletePDS and MuQSS has many difference. One in this case is, when schedule() is call, PDS will only select pending tasks from other cpu run queue then they has higher priority than the one from local run queue, while MuQSS select the higher priority or lower deadline task among all the run queue. This difference let MuQSS has better chance to switch to another task when yield() is called.
For your issue, IMO, lower the priority of task which calling yield() seems to be the solution. I don't want to apply CFS's dirty skip_buddy logic into PDS to grantee the yield(), nor the MuQSS task selection logic just to increase the chance to yield to another task.
Thanks for the explanation. The Problem was just hard to track down but a workaround is enough for me it's a special case high core count and proprietary nvidia driver.
DeleteEverything else works really great PDS is a bit faster(~%5) than MUQSS for kernel compiling.
@Jan Killius
DeleteThanks for the effort reporting and testing for this issue. This helps with understanding sched_yield() in a better way.
PS, it's nice to have a high core count cpu. PDS is designed to well scalable with high core count cpu(will hit the first wall when >64 cpu), but I never get a chance to test it.
I am wondering if i am struggling with a somewhat similar problem with 4.19 kernel too.
DeleteI use Wine + dxvk (d3d11 -> Vulkan translation), and a recent patch made my testbenchmark do some "hickups". Not using schedtool -I (sched_iso) seemed to help, but i did test some tweaks to both "yield_type" and "rr_interval" without any change.
DXVK dev cant really figure out why his patch would get this "1-second-freeze" for me (and possibly one more.. both with nVidia proprietary drivers).
I do not get a "soft lockup" like the above, so it might be totally unrelated.. but seeing as scheduler/yield MIGHT be a culprit in a "high-load w/nvidia" situation... perhaps?
Will do some more testing tho...
PS. For what its worth: When running wine with PDS vs MuQSS, i notice PDS uses "cores" prio, vs MuQSS loads hyperthreads a lot more. MuQSS also gives slightly less score than PDS. And in my book, its well worth taxing the "real cores" a lot higher than offloading onto threads. CFS (been a while) i think divided everything more or less equal regarding "real" core vs. "hyperthread" core. (If that makes sense)
After doing some more tests, it turns out that for some weird reason using "sched_iso" causes those hiccups. Using the default i dont experience this.
DeleteLoose a wee bit of performance vs. only using "schedtool -n -5", but compared to random 1-second freezes i guess its a win. Unknown reason to why this happens, but it MIGHT be a driver issue..
I see a new nVidia driver was out last night (410.73), but i kinda need the vulkan stuff from 396.54.09 , so i wont test if that helps anything tho.
@Sveinar
DeleteJust some quick comments about your new issue b4 my bed time.
SCHED_ISO has higher priority than SCHED_NORMAL, and most tasks(even kernel threads) are running as SCHED_NORMAL. Put too many tasks run as ISO will leave less cpu for the rest of the system.
So my best practice is, let the task run as NORMAL unless your task is tested can run as ISO without any side effort.
@Alfred
DeleteNot sure what "too many tasks" would be, but the only thing i will run with "sched_iso" is wine. I dont have as a habit to run "everything" with that, cos scheduling everything for realtime kinda defeats the purpose :)
No the bug isn't in MUQSS.
ReplyDeleteOk, thanks.
ReplyDelete