1. Remove resched_closest_idle() as planned in previous release post. I haven't got the feedback yet, but considering more reports are coming, both calls are removed at this release. The modified commit is
da50716 bfs: Full cpumask based and LLC sensitive cpu selection, v4
2. Fix update_cpu_load_nohz() redefine error when undefine CONFIG_NO_HZ_COMMON, this is a missing for the 4.2 sync up, the modified commit is
75fd9b3 bfs: [Sync] 4.2 sync up v2
New -gc branch can be found at bitbucket and github. An all-in-one patch includes all bfs related changes(*NOT* all commits in -gc branch) is also provided.
Have fun with this new -gc release, and -vrq branch update is in-coming.
BR Alfred
Booted OK for me, thanks.
ReplyDeleteI've sent those idle-related patch to user which reported panic (remember bunch of screenshots). Still waiting for response from him.
@post-factum:
ReplyDeleteThank you very much for still tracking it !!!
Greets,
Manuel
Reported as fixed.
ReplyDelete@Alfred:
ReplyDeleteCan you, please, in short time, provide a revert patch for your "resched_closest_idle" version based upon you current -gc?
I've looked into the changed code and see that you've changed much more than you've described. So that I'm unable to revert it back on my own.
The reason to write this: TuxOnIce fails. But it did not fail with your former 4.2-bfs patches/ branch.
As there may be other things causing misbehaviour on here (my opensuse upgrade and 'new' gcc-4.9) I would be glad to exclude at least one point in the list.
Thank you in advance,
Manuel
BTW, the -vrq for 4.1 was much more failsafe, regarding TuxOnIce, IMO.
DeleteManuel
Hmm, TOI fails for me as well — it simply does not write anything to disk :(.
DeleteIn my case it did write on suspend but didn't read from disk at resume time. :-(
DeleteThis afternoon I've collected the related functions to revert the "resched_closest_idle" removal in current -gc and manually adjusted them into a patch. I've uploaded it to http://pastebin.com/FCe7H6ar
(Some of the hunks' numbers maybe inaccurate, I hope that the rest is o.k.)
As Alfred wasn't decided whether to remove the first/ the second/ or both "resched_closest_idle" calls, this evening I've done testings by commenting out one or the other call wit "//". The result on her was: Keeping both or each of the calls made my system resume correctly with TuxOnIce.
After reading post-factum's posting here I rechecked the kernel version without the patch == pure -gc ---- and the wrong behaviour has vanished.
So, now, I'm absolutely clueless on how to proceed on here. I hope that someone of you can review the patch and or also test the possible cases.
Best regards,
Manuel Krause
I really don't know, what this issue depends on:
DeleteJust did another reboot +uptime +hibernation to the patch-less -gc kernel (of course always with BFQ and with TOI)... and it failed once again at TOI resume.
To trigger this issue on here some uptime is needed and some amount in swap partition.
BR, Manuel
@Manuel
ReplyDeleteSorry for the late reply. Just back from a trip. I'm not quit catch up with your issue yet, but If I have uploaded the 4.2 -vrq branch before I leave for the trip last week, so you can give it a try.
Did you mean 'next' week? Otherwise I don't understand your message, as I don't see -vrq for 4.2 anywhere.
DeleteBR, Manuel
O.k., thank you Alfred, for making the 4.2-vrq repositories accessible.
DeleteNow, I'm currently testing it, and the first TOI hibernations (two) succeeded. No other problems so far.
BR, Manuel
Reverting gc_v4.2_0463_2 didn't help, now trying to revert all BFS-related commits.
ReplyDeleteUmm, no, TOI fail is definitely not BFS-related. I've reverted everything related to BFS and didn't got TOI working. Sorry, will debug TOI itself.
DeleteI really don't want to discourage you in debugging TOI, but your experience definitely contradicts my positive experience with several successful TOI resumes (none had failed) when having used Alfred's previous 4.2 sync up revision gc_v4.2_0463_1 (and plus the BFQ). But it can also be, that I haven't tested it long enough.
DeleteSo, if you have any idea on how to help your debugging/ testing, please let me know.
Best regards,
Manuel
Manuel, it seems TOI fails reliably with btrfs as I've reported here: http://lists.tuxonice.net/pipermail/tuxonice-devel/2015-September/007542.html
DeletePoint 2) is still unfixed, BTW.
Deleteblogspot has eaten my previous Comment to the last one: weird! Here it comes again:
DeleteThank you post-factum. That's a really nice & professional bug-report. :-)
Two notes:
1) As my system doesn't make use of btrfs, your bug is most likely not related to my issue with TOI
2) Your issue reminds me of one old BUG that I've reported to Nigel many months ago: Same symptom as you have: "no disk write on suspend" when I've loop mounted another disk-image upon a ntfs-3g mounted real partition. Of course, not the same use case, but not incomparable.
Let's hope the best for answers and a fix from Nigel.
BR, Manuel Krause
OK, let's summary. There are two issues we are discussing here
ReplyDelete1. TOI failed to write image file on btrfs, which turn out not a BFS related issue and @pf is tracing it with TOI.
2. On -gc branch, after apply the removal of "resched_closest_idle", TOI failed to resume. On -vrq, the first attempt looks good and @Manuel is tracing it.
Correct me if I miss-understand and thanks for testing so far.
BR Alfred
This comment has been removed by the author.
Delete1. Nope. TOI failed to write to swap, not to the file (I haven't try file writer though).
DeleteEn... interesting, I guess you are using swap file setup which on a btrfs. But it sounds that it's a classical swap partition setup, but it should has nothing to do with what FS you are using, IMO.
DeleteYes, it is classical swap partition setup, and no, it really has something to do with FS as it isn't frozen properly.
Delete2. -- regarding the correctly summarized issue 2: Although I'd need more time to test the -vrq, so far, all TOI resumes succeeded. Tested also with some reboots between multiple attempts. Setup is with a classical swap partition and mounted partitions include ext4 and ntfs-3g but no btrfs.
DeleteAs the resumes do not reliably fail ;-) with removed "resched_closest_idle" on -gc, I'd like to test this kernel a bit more to see, if I can get some logs. If you have hints on how to improve my testing, please let me know.
Manuel
2. regarding issue 2.: The only things I can offer for a failing TOI resume before reading from swap were gathered with "no_console_suspend" kernel command line option, of course: No log available at this point of kernel resume:
Delete...
Doing atomic copy/restore <------------------ must come from TOI
serial 00:05: disabled
PM: quiesce of devices complete after 13.x msecs
PM: late quiesce of devices complete after 0.5x msecs
PM: noirq quiesce of devices complete after 5.0x msecs
ACPI : EC: EC stopped
Disabling non-boot CPUs ... <---------- END of available logged messages on screen
In a properly working kernel this would go on with calling tuxoniceui_text from within the booted initrd and the following output (now from a dmesg with the 4.2.0-vrq):
serial 00:05: disabled <--------------- overlapped message
PM: freeze of devices complete after 386.568 msecs <--------------- overlapped message
PM: late freeze of devices complete after 11.979 msecs <--------------- overlapped message
PM: noirq freeze of devices complete after 1.267 msecs <--------------- overlapped message
ACPI: Preparing to enter system sleep state S4 <--------------- on screen omitted message only in dmesg
ACPI : EC: EC stopped <--------------- overlapped message
PM: Saving platform NVS memory <--------------- on screen omitted message only in dmesg
Disabling non-boot CPUs ... <--------------- overlapped message
Renew affinity for 416 processes to cpu 1
smpboot: CPU 1 is now offline
PM: Restoring platform NVS memory
ACPI : EC: EC started
Enabling non-boot CPUs ...
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x1
Renew affinity for 415 processes to cpu 1
cache: parent cpu1 should not be sleeping
bfs/vrq: ci[1,0] = 1, 32768
bfs/vrq: ci[1,1] = 1, 32768
bfs/vrq: ci[1,2] = 2, 3145728
bfs/vrq: CACHE_SCOST_THRESHOLD(1) = 18
CPU1 is up
ACPI: Waking up from system sleep state S4 <----------------- nothing relevant after this point
The general behaviour reminds me of the first time that Con removed the plugged IO code temorarily. But I don't have enough coding knowledge/ historical ambitions to pinpoint something in detail. I also need to admit, that I didn't make use of the -ck-only patches at all since that times. The -gc and following -vrq patches were too promising.
BTW, the current -vrq is still running fine and no misbehaviour so far.
BR, Manuel
For issue 1, as it's not related to scheduler code, so I'd like to put it aside.
DeleteFor issue 2, as "resched_closest_idle" causes crash for some user, and removal of it cause TOI failed to resume, compare the impact, I'd like to keep it removed and -vrq seems a work around for TOI usage. And I will look into the removed code again and find a proper way in next release.
@Manuel
Thanks for the log. I can see the latest task cache code work as expected on your system. Your cpu looks like a core2 with 3M L2 cache. :)
@Alfred:
DeleteI'm o.k. with your strategy, the -vrq worked well with the previous release and so it does with the current, too. So it's no problem for me to use it, it's quite the contrary, I'm glad with -vrq on here.
Just let me/ us know if you have some patch to test/ debug the -gc branch with TOI.
And... yes it's a Core2 with 3M cache Your patch detects it quite exactly. ;-)
Best regards and many thanks for your successful work,
Manuel
The -vrq branch is not a kind of "holy water". Just encountered a row of non-succeeding resumes with TOI. :-(
ReplyDeleteNew test kernel would exclude BLK_DEV_THROTTLING, BLK_CGROUP, BFQ_GROUP_IOSCHED again, like I had it with -vrq 4.1 kernels.
Manuel
Mmmh, I know it was not well thought to blame -vrq or -gc. It's also not good to post too early test results (e.g. let some new kernel run for 2 days only).
DeleteWith the above mentioned features removed the system resume with TOI appears to be stable again. As these settings seem to utilize new algorithms, when enabled, in 4.2 compared to 4.1, as I understood the BFQ information, it makes sense for me to revert to my 4.1 config. I also take into account/ interpret the most recent bugfix discussions on the BFQ newsgroup (https://groups.google.com/forum/?fromgroups=#!forum/bfq-iosched). As I manually don't make use of the advantages of these settings (and also don't know about them -- maybe someone of you has time to explain?) and don't know if my system silently uses them, it seems to be safer for me to leave them disabled.
Now, I seem to need to re-test the -gc branch again, with these new-old settings, too, in the coming days.
BR, Manuel
The -gc hibernation was failing for me due to my new setup with BLK_DEV_THROTTLING, BLK_CGROUP, BFQ_GROUP_IOSCHED enabled.
DeleteWhen I disable them, TOI works very well with -gc branch, too. Kernel 4.2.1. So, it's a seperate BFQ issue.
Thank you for your audience, best regards,
Manuel
Thanks for your testing, @Manuel.
DeleteOoops, I missed to post a late addon to this issue:
DeleteI can achieve the same positive results when using the BFQ v7r8 for 4.2 patches provided from the BFQ io-scheduler's team.
BR Manuel