What's new:
1. Base on BFS 0463 and kernel v4.1.4
2. Fix/Sync against BFS 0463
- 3b14908 bfs: [Sync] 4.1 schedule_user().
- 9f9dc34 bfs: [Fix] 0463 remove unused register_task_migration_notifier().
- 0145370 bfs: [Sync] TIF_POLLING_NRFLAG for wake_up_if_idle() and resched_curr().
- 775e28a bfs: [Sync] sched_init_numa().
- c6c5894 bfs: [Sync] task_sched_runtime().
- 4a48abf bfs: [Sync] sched_setscheduler() logic, v3
- dc4fa45 bfs: -gc BFS enchancement patch set version.
Code has been forced push to bitbucket and github . For those just want to easier apply the patches, here is the one for all patch include all BFS related commits in my gc-branch: bfs_enhancement_v4.1_0463_1.patch
If you are using the gc-branch, I'll highly suggest you to upgrade to this gc release. An updated -vrq branch will be coming soon, no new commits is planned(have to delay to next week as much sync-up works this week), but will be some bug fixes for the existed ones.
BR Alfred Chen
Update:
Add one more commit to fix RCU stall issue.
bfs: v4.1_0463_1 rcu stall fix.
|
Unfortunately, this update breaks my system. I've encountered RCU stalls after NetworkManager bootup with warnings in dmesg and complete system hanging. Mainline 4.1.4 works OK.
ReplyDeleteP.S. Here is appropriate commit in my git tree, it corresponds to your pure BFS patch:
ReplyDeletehttps://github.com/pfactum/pf-kernel/commit/2fec55d0bd336e3a860dd8ffcc9fe52eb14bdf20
Also, here is my config:
https://github.com/pfactum/pf-kernel/blob/pf-4.1/configs/dell-vostro-3360.config
RCU has been configured as follows:
===
# RCU Subsystem
CONFIG_PREEMPT_RCU=y
CONFIG_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_BOOST=y
CONFIG_RCU_KTHREAD_PRIO=1
CONFIG_RCU_BOOST_DELAY=500
# CONFIG_RCU_EXPEDITE_BOOT is not set
# RCU Debugging
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
===
Guess it is an new issue, leave the RCU configures aside first, as it works with 4.12 with bfs 0462 and gc, it should be fine for 0463 and gc.
DeleteIs it possible for you to bisect this issue using git from commit
47fd3be sched: 4.1-sched-bfs-463
to
dc4fa45 bfs: -gc BFS enchancement patch set version. Add -gc patch set version in dmesg. (consider as 'bad')
and find out which commit introduced new issue?
BR Alfred
Hi, Alfred,
DeleteI've got the same issue like post-factum. So I've done bisecting.
The issue comes when adding commit 4a48abf1816e488bb132fd33ac550cb0dabdc3fc,
"bfs: [Sync] sched_setscheduler() logic, v3".
I hope this helps you to quickly find the error and enable you offer a fixing patch.
Best regards,
Manuel Krause
(Maybe I should add, that I did a "fake" bisection, not using git, but using your separate patches from your bitbucket repo, in the order you've committed them. I then imitated the way how git bisection does it's job.) Shouldn't make a difference for the resulting culprit. *MK
DeleteThanks Manuel. I'll look into that commit tomorrow, there are more than expected changes in that one and maybe I have done some wrong.
DeleteO.k., good to know that. Thank you for all your work! Highly appreciated!
DeleteBTW, the "old" -vrq branch is running _absolutely_ failureless and nicely since the day I had compiled it for the first time. (Now, I have it with 4.1.4.)
@pf & @Manuel
DeletePlz try this patch upon current gc branch and see it fix the rcu stall issue.
Patch file is https://bitbucket.org/alfredchen/linux-gc/downloads/4.1_0463_1_rcu_stall_fix.patch
When merge the code I have forgot what I have written in the previous version, crap.
It seems it did the trick — now I'm able at least to write this comment. Will test more, thanks to both of you.
DeleteYes, I can also confirm that everything is running as desired with the rcu_stall_fix.patch on here.
DeleteThank you for fixing so quickly!
Now I'm really looking forward for your updated -vrq branch. :-)
Manuel
Thanks you for the feedback. I am currently deal with the "unpluged" issue mentioned in CK's blog when untar files to an TF card. So I may need some time before look at vrq branch.
DeleteBR Alfred
Mmmh. I've also read those postings. Are you able to reproduce the problem -- and if yes -- what are the best ways to reproduce it?
DeleteThe code in question is also in your -gc and -vrq branch before the 0463 update (meaning, without the fix that Con provided), so I'm wondering why I haven't had problems on my system.
Manuel
I am not quit sure yet. Event I applied CK's fix patch, I still get similar issue when untar files to that tf card. Maybe the btrfs on it was doom which cause the issue, I'm still looking into it.
DeleteHehe, funny you... In the meantime while I assumed that you investigate the unplugged-i/o issue, I've tested to apply the "old" -vrq patches on top of the "new" -gc... and now see that you've just updated the -vrq... As there was only one minor fuzz, some offsets and no compilation errors, I'd simply reboot and report issues under the appropriate thread.
DeleteBTW, in graysky's ck-repo forum thread (https://bbs.archlinux.org/viewtopic.php?id=111715&p=101) there are several success reports regarding the use of Con's fix patch. I'm not sure if that info is a waste of time or not.
Best regards,
Manuel
Just to notify other readers: The approach in the above comment as of August 7, 2015 at 8:16 AM is NOT the preferred way. Alfred has updated his -vrq patches in the meantime. We better should use the newer ones.
DeleteBR, Manuel
Here is the patch accepted by CK:
ReplyDeletehttps://gist.github.com/2917ffd222e8b100ffd3
It cannot be applied to -gc in current form but may be adopted. Alfred?
You can try this https://bitbucket.org/alfredchen/linux-gc/downloads/4.1_0463_1_revert_unplugged.patch
DeleteI am still watching whether to use this patch or not, no similar issues are reported when using gc branch. I last untar files to tf card turns out to be a btrfs fs corruption.
@pf, any similar report from usr pf-kernel users?
BR Alfred
That's a really good point, Alfred, to ask for pf-kernel users' experience. I'm watching the pf-kernel forum, but can't see any specific failure report for this. Maybe I've overlooked something especially regarding older kernels, in which Con also had to revert the plugged I/O code. I'm not sure, what previous/ maybe deprecated kernel it exactly was.
DeleteAnd now again: Are you both, including you, Oleksandr, able to reproduce the issue? And if yes, how do I get it on here, too? I'm currently fine with the renewed -vrq branch, but have to admit, that I'm not making heavy use of transfers to external exchangable media.
Best regards, and thank you both for your valuable and appreciated work,
Manuel Krause
Just for the records, I've found a related reference in this directory, at least: http://ck.kolivas.org/patches/bfs/3.0/3.18/pending/ The patch is "bfs460-locked-pluggedio.patch". Funny, to realize the date... 31-Dec-2014.
DeleteFrom your programmers' point of view, would you say the revert of the newer plugged I/O code would slow down the kernel? I'm a bit astonished, that the code is in CFS as well (or based on it,) and we haven't seen complaints from that people.
BR, Manuel Krause
Alfred, there are no reports regarding Btrfs as far as I remember. I'm unable to reproduce the issue for now as I do not use Btrfs (if it is Btrfs-related at all).
DeleteAnyway, if there some report happens, I'll forward it to you.
As I know, kernelOfTruth has "btrfs scrub" issue but his "trial fix" which similar to CK's reverse-unpluggged patch doesn't look helping. I have sent email to him and ask how his issue going.
DeleteFor these unplugged io code changes, I'd like to keep it sync with mainline code at this moment, if it's confirmed it causes issues then we found solution to fix it.
BR Alfred
Maybe "someone" ;-) will realize in some near future, that your, Alfred's, code improvements to BFS are that comprehensive and useful that he'd adopt them and make fixing with such an "old" hack solution superfluous/ unnecessary. I'm content with the results of your recent heavy work.
DeleteAny news from kernelOfTruth in the meantime?
BR Manuel
Too early to claim that, Manuel.
DeleteI've just encountered plug-related kernel panic on my home server with Btrfs.
Screenshot 1: https://drive.google.com/file/d/0BwjMKZtUByALaXBiNWVLQWFqVFU/view?usp=sharing
Screenshot 2: https://drive.google.com/file/d/0BwjMKZtUByALNUNVcUhSeV9XNnc/view?usp=sharing
Will try to adapt CK fix attempt to -gc branch.
OK, booted with the following patch:
Deletehttps://github.com/pfactum/pf-kernel/commit/044be9cd1fe06ffd00d4d6424729a88ebf497104
Will test more.
@pf, would you please share your btrfs setup(mount options etc), I'll see if I can reproduce it myself.
DeleteAnd at the same time, let's see if your above patch works? And I am coming up with another patch to address this kind of issue.
BR Alfred
I'm in doubt this is btrfs issue at all. Anyway:
Delete===
pf@defiant:~ » cat /etc/fstab| grep btrfs
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa / btrfs rw,relatime,space_cache 0 0
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /home btrfs rw,relatime,space_cache,subvol=@home 0 0
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /srv btrfs rw,relatime,space_cache,subvol=@srv 0 0
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/torrents btrfs rw,relatime,space_cache,subvol=@torrents 0 0
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/syncthing btrfs rw,relatime,space_cache,subvol=@syncthing 0 0
UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/backups btrfs rw,relatime,space_cache,subvol=@backups 0 0
===
===
pf@defiant:~ » lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 119.2G 0 disk
└─sda1 8:1 0 119.2G 0 part
└─md127 9:127 0 119.2G 0 raid10
├─base-boot 253:0 0 512M 0 lvm /boot
└─base-system 253:1 0 118.7G 0 lvm
└─system 253:2 0 118.7G 0 crypt
├─system-swap 253:3 0 8G 0 lvm [SWAP]
└─system-root 253:4 0 110.7G 0 lvm /
sdb 8:16 0 119.2G 0 disk
└─sdb1 8:17 0 119.2G 0 part
└─md127 9:127 0 119.2G 0 raid10
├─base-boot 253:0 0 512M 0 lvm /boot
└─base-system 253:1 0 118.7G 0 lvm
└─system 253:2 0 118.7G 0 crypt
├─system-swap 253:3 0 8G 0 lvm [SWAP]
└─system-root 253:4 0 110.7G 0 lvm /
sdc 8:32 1 57.7G 0 disk
└─md1 9:1 0 57.7G 0 raid10 /mnt/music
sdd 8:48 1 57.7G 0 disk
└─md1 9:1 0 57.7G 0 raid10 /mnt/music
===
@pf
DeleteI'll wait for your test result. Hopefully can identify whether it is a btrfs issue or scheduler isssue.
Running OK as for now.
Delete@pf
DeleteCan it be 100% reproduced when you disable your additional patch?
In my opinion, that isn't the point. If the algorithm fails at least once without the additional patch, there's something wrong with the code. So the question would rather be: Does the problem ever happen _with_ the patch?
DeleteA side question: What is so "bad" with the patched code?
Best regards,
Manuel Krause
@Alfred, I cannot reproduce it reliably as it happens under unknown circumstances for me. But it happens without patch and as for now doesn't happen with patch.
DeleteLooks like race condition, but unfortunately I've got no idea what to do.
@pf
DeleteWould you plz try this path?
https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work.patch
I'm also testing it on top of -vrq atm, since you've published it. Just for you to get sure. No major issues so far with it in ~4h of uptime. My [i915] threw out two [Warning]s without it, now one [Error], but each without negative effects.
DeleteI really hope, post-factum has a kind of timeframe in mind for the issue to materialize. ;-)
Best regards to both of you,
Manuel Krause
OK, as for now I may assume that CK fix did the trick, as my machine works well with it.
DeleteNow I'm going to discard that patch and apply Alfred's one.
Reverted Con's patch, applied https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work.patch, rebooted OK. Waiting for… something.
Delete@post-factum: I wish you happy testing! ;-) I've had it running for three days with -vrq without any related issues (that I haven't had anyway).
DeleteBR Manuel
Alfred, unfortunately, your patch doesn't fix the issue. Got the same panic:
Deletehttps://drive.google.com/file/d/0BwjMKZtUByALYjJGNHF2T2VGc2M/view?usp=sharing
Only CK fix seems to work OK now.
Many thanks for your testing time! (Although it's sad news.)
DeleteBR Manuel
@pf
DeleteThanks for testing! At least it clear my suspicion in this code path. I'll find some time to look into other code path.
BR Alfred
I really hope that you are able to find the "final fix" ;-)
DeleteBest wishes and always good luck,
Manuel