Wednesday, August 5, 2015

gc-branch update with CK's BFS 0463

CK finally releases BFS 0463 against kernel 4.1 this week, so here comes the gc branch updates.

What's new:
1. Base on BFS 0463 and kernel v4.1.4
2. Fix/Sync against BFS 0463
  • 3b14908 bfs: [Sync] 4.1 schedule_user().
  • 9f9dc34 bfs: [Fix] 0463 remove unused register_task_migration_notifier().
3. New Sync patches which pick up sync changes from previous mainline changes (most from 3.17 to 3.18 and some fixes upon previous patch)
  • 0145370 bfs: [Sync] TIF_POLLING_NRFLAG for wake_up_if_idle() and resched_curr().
  • 775e28a bfs: [Sync] sched_init_numa().
  • c6c5894 bfs: [Sync] task_sched_runtime().
  • 4a48abf bfs: [Sync] sched_setscheduler() logic, v3
4. Give a meaningful version name for this patch sets "v4.1_0463_1"
  • dc4fa45 bfs: -gc BFS enchancement patch set version.

Code has been forced push to bitbucket and github . For those just want to easier apply the patches, here is the one for all patch include all BFS related commits in my gc-branch: bfs_enhancement_v4.1_0463_1.patch

If you are using the gc-branch, I'll highly suggest you to upgrade to this gc release. An updated -vrq branch will be coming soon, no new commits is planned(have to delay to next week as much sync-up works this week), but will be some bug fixes for the existed ones.

BR Alfred Chen


Update:
Add one more commit to fix RCU stall issue.

bfs: v4.1_0463_1 rcu stall fix.

40 comments:

  1. Unfortunately, this update breaks my system. I've encountered RCU stalls after NetworkManager bootup with warnings in dmesg and complete system hanging. Mainline 4.1.4 works OK.

    ReplyDelete
  2. P.S. Here is appropriate commit in my git tree, it corresponds to your pure BFS patch:

    https://github.com/pfactum/pf-kernel/commit/2fec55d0bd336e3a860dd8ffcc9fe52eb14bdf20

    Also, here is my config:

    https://github.com/pfactum/pf-kernel/blob/pf-4.1/configs/dell-vostro-3360.config

    RCU has been configured as follows:

    ===
    # RCU Subsystem
    CONFIG_PREEMPT_RCU=y
    CONFIG_SRCU=y
    CONFIG_TASKS_RCU=y
    CONFIG_RCU_STALL_COMMON=y
    CONFIG_RCU_FANOUT=64
    CONFIG_RCU_FANOUT_LEAF=16
    # CONFIG_RCU_FANOUT_EXACT is not set
    CONFIG_RCU_FAST_NO_HZ=y
    # CONFIG_TREE_RCU_TRACE is not set
    CONFIG_RCU_BOOST=y
    CONFIG_RCU_KTHREAD_PRIO=1
    CONFIG_RCU_BOOST_DELAY=500
    # CONFIG_RCU_EXPEDITE_BOOT is not set
    # RCU Debugging
    # CONFIG_PROVE_RCU is not set
    # CONFIG_SPARSE_RCU_POINTER is not set
    CONFIG_RCU_CPU_STALL_TIMEOUT=60
    # CONFIG_RCU_CPU_STALL_INFO is not set
    # CONFIG_RCU_TRACE is not set
    ===

    ReplyDelete
    Replies
    1. Guess it is an new issue, leave the RCU configures aside first, as it works with 4.12 with bfs 0462 and gc, it should be fine for 0463 and gc.

      Is it possible for you to bisect this issue using git from commit
      47fd3be sched: 4.1-sched-bfs-463
      to
      dc4fa45 bfs: -gc BFS enchancement patch set version. Add -gc patch set version in dmesg. (consider as 'bad')
      and find out which commit introduced new issue?

      BR Alfred

      Delete
    2. Hi, Alfred,
      I've got the same issue like post-factum. So I've done bisecting.
      The issue comes when adding commit 4a48abf1816e488bb132fd33ac550cb0dabdc3fc,
      "bfs: [Sync] sched_setscheduler() logic, v3".
      I hope this helps you to quickly find the error and enable you offer a fixing patch.

      Best regards,
      Manuel Krause

      Delete
    3. (Maybe I should add, that I did a "fake" bisection, not using git, but using your separate patches from your bitbucket repo, in the order you've committed them. I then imitated the way how git bisection does it's job.) Shouldn't make a difference for the resulting culprit. *MK

      Delete
    4. Thanks Manuel. I'll look into that commit tomorrow, there are more than expected changes in that one and maybe I have done some wrong.

      Delete
    5. O.k., good to know that. Thank you for all your work! Highly appreciated!

      BTW, the "old" -vrq branch is running _absolutely_ failureless and nicely since the day I had compiled it for the first time. (Now, I have it with 4.1.4.)

      Delete
    6. @pf & @Manuel
      Plz try this patch upon current gc branch and see it fix the rcu stall issue.
      Patch file is https://bitbucket.org/alfredchen/linux-gc/downloads/4.1_0463_1_rcu_stall_fix.patch
      When merge the code I have forgot what I have written in the previous version, crap.

      Delete
    7. It seems it did the trick — now I'm able at least to write this comment. Will test more, thanks to both of you.

      Delete
    8. Yes, I can also confirm that everything is running as desired with the rcu_stall_fix.patch on here.
      Thank you for fixing so quickly!

      Now I'm really looking forward for your updated -vrq branch. :-)

      Manuel

      Delete
    9. Thanks you for the feedback. I am currently deal with the "unpluged" issue mentioned in CK's blog when untar files to an TF card. So I may need some time before look at vrq branch.

      BR Alfred

      Delete
    10. Mmmh. I've also read those postings. Are you able to reproduce the problem -- and if yes -- what are the best ways to reproduce it?
      The code in question is also in your -gc and -vrq branch before the 0463 update (meaning, without the fix that Con provided), so I'm wondering why I haven't had problems on my system.

      Manuel

      Delete
    11. I am not quit sure yet. Event I applied CK's fix patch, I still get similar issue when untar files to that tf card. Maybe the btrfs on it was doom which cause the issue, I'm still looking into it.

      Delete
    12. Hehe, funny you... In the meantime while I assumed that you investigate the unplugged-i/o issue, I've tested to apply the "old" -vrq patches on top of the "new" -gc... and now see that you've just updated the -vrq... As there was only one minor fuzz, some offsets and no compilation errors, I'd simply reboot and report issues under the appropriate thread.
      BTW, in graysky's ck-repo forum thread (https://bbs.archlinux.org/viewtopic.php?id=111715&p=101) there are several success reports regarding the use of Con's fix patch. I'm not sure if that info is a waste of time or not.

      Best regards,
      Manuel

      Delete
    13. Just to notify other readers: The approach in the above comment as of August 7, 2015 at 8:16 AM is NOT the preferred way. Alfred has updated his -vrq patches in the meantime. We better should use the newer ones.

      BR, Manuel

      Delete
  3. Here is the patch accepted by CK:

    https://gist.github.com/2917ffd222e8b100ffd3

    It cannot be applied to -gc in current form but may be adopted. Alfred?

    ReplyDelete
    Replies
    1. You can try this https://bitbucket.org/alfredchen/linux-gc/downloads/4.1_0463_1_revert_unplugged.patch

      I am still watching whether to use this patch or not, no similar issues are reported when using gc branch. I last untar files to tf card turns out to be a btrfs fs corruption.

      @pf, any similar report from usr pf-kernel users?

      BR Alfred

      Delete
    2. That's a really good point, Alfred, to ask for pf-kernel users' experience. I'm watching the pf-kernel forum, but can't see any specific failure report for this. Maybe I've overlooked something especially regarding older kernels, in which Con also had to revert the plugged I/O code. I'm not sure, what previous/ maybe deprecated kernel it exactly was.
      And now again: Are you both, including you, Oleksandr, able to reproduce the issue? And if yes, how do I get it on here, too? I'm currently fine with the renewed -vrq branch, but have to admit, that I'm not making heavy use of transfers to external exchangable media.

      Best regards, and thank you both for your valuable and appreciated work,
      Manuel Krause

      Delete
    3. Just for the records, I've found a related reference in this directory, at least: http://ck.kolivas.org/patches/bfs/3.0/3.18/pending/ The patch is "bfs460-locked-pluggedio.patch". Funny, to realize the date... 31-Dec-2014.

      From your programmers' point of view, would you say the revert of the newer plugged I/O code would slow down the kernel? I'm a bit astonished, that the code is in CFS as well (or based on it,) and we haven't seen complaints from that people.

      BR, Manuel Krause

      Delete
    4. Alfred, there are no reports regarding Btrfs as far as I remember. I'm unable to reproduce the issue for now as I do not use Btrfs (if it is Btrfs-related at all).

      Anyway, if there some report happens, I'll forward it to you.

      Delete
    5. As I know, kernelOfTruth has "btrfs scrub" issue but his "trial fix" which similar to CK's reverse-unpluggged patch doesn't look helping. I have sent email to him and ask how his issue going.

      For these unplugged io code changes, I'd like to keep it sync with mainline code at this moment, if it's confirmed it causes issues then we found solution to fix it.

      BR Alfred

      Delete
    6. Maybe "someone" ;-) will realize in some near future, that your, Alfred's, code improvements to BFS are that comprehensive and useful that he'd adopt them and make fixing with such an "old" hack solution superfluous/ unnecessary. I'm content with the results of your recent heavy work.

      Any news from kernelOfTruth in the meantime?

      BR Manuel

      Delete
    7. Too early to claim that, Manuel.

      I've just encountered plug-related kernel panic on my home server with Btrfs.

      Screenshot 1: https://drive.google.com/file/d/0BwjMKZtUByALaXBiNWVLQWFqVFU/view?usp=sharing

      Screenshot 2: https://drive.google.com/file/d/0BwjMKZtUByALNUNVcUhSeV9XNnc/view?usp=sharing

      Will try to adapt CK fix attempt to -gc branch.

      Delete
    8. OK, booted with the following patch:

      https://github.com/pfactum/pf-kernel/commit/044be9cd1fe06ffd00d4d6424729a88ebf497104

      Will test more.

      Delete
    9. @pf, would you please share your btrfs setup(mount options etc), I'll see if I can reproduce it myself.
      And at the same time, let's see if your above patch works? And I am coming up with another patch to address this kind of issue.

      BR Alfred

      Delete
    10. I'm in doubt this is btrfs issue at all. Anyway:

      ===
      pf@defiant:~ » cat /etc/fstab| grep btrfs
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa / btrfs rw,relatime,space_cache 0 0
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /home btrfs rw,relatime,space_cache,subvol=@home 0 0
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /srv btrfs rw,relatime,space_cache,subvol=@srv 0 0
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/torrents btrfs rw,relatime,space_cache,subvol=@torrents 0 0
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/syncthing btrfs rw,relatime,space_cache,subvol=@syncthing 0 0
      UUID=14140a7f-23bc-4dab-b263-f2f46f5d70aa /mnt/backups btrfs rw,relatime,space_cache,subvol=@backups 0 0
      ===

      ===
      pf@defiant:~ » lsblk
      NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
      sda 8:0 0 119.2G 0 disk
      └─sda1 8:1 0 119.2G 0 part
      └─md127 9:127 0 119.2G 0 raid10
      ├─base-boot 253:0 0 512M 0 lvm /boot
      └─base-system 253:1 0 118.7G 0 lvm
      └─system 253:2 0 118.7G 0 crypt
      ├─system-swap 253:3 0 8G 0 lvm [SWAP]
      └─system-root 253:4 0 110.7G 0 lvm /
      sdb 8:16 0 119.2G 0 disk
      └─sdb1 8:17 0 119.2G 0 part
      └─md127 9:127 0 119.2G 0 raid10
      ├─base-boot 253:0 0 512M 0 lvm /boot
      └─base-system 253:1 0 118.7G 0 lvm
      └─system 253:2 0 118.7G 0 crypt
      ├─system-swap 253:3 0 8G 0 lvm [SWAP]
      └─system-root 253:4 0 110.7G 0 lvm /
      sdc 8:32 1 57.7G 0 disk
      └─md1 9:1 0 57.7G 0 raid10 /mnt/music
      sdd 8:48 1 57.7G 0 disk
      └─md1 9:1 0 57.7G 0 raid10 /mnt/music
      ===

      Delete
    11. @pf
      I'll wait for your test result. Hopefully can identify whether it is a btrfs issue or scheduler isssue.

      Delete
    12. @pf
      Can it be 100% reproduced when you disable your additional patch?

      Delete
    13. In my opinion, that isn't the point. If the algorithm fails at least once without the additional patch, there's something wrong with the code. So the question would rather be: Does the problem ever happen _with_ the patch?
      A side question: What is so "bad" with the patched code?

      Best regards,
      Manuel Krause

      Delete
    14. @Alfred, I cannot reproduce it reliably as it happens under unknown circumstances for me. But it happens without patch and as for now doesn't happen with patch.

      Looks like race condition, but unfortunately I've got no idea what to do.

      Delete
    15. @pf
      Would you plz try this path?
      https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work.patch

      Delete
    16. I'm also testing it on top of -vrq atm, since you've published it. Just for you to get sure. No major issues so far with it in ~4h of uptime. My [i915] threw out two [Warning]s without it, now one [Error], but each without negative effects.

      I really hope, post-factum has a kind of timeframe in mind for the issue to materialize. ;-)

      Best regards to both of you,
      Manuel Krause

      Delete
    17. OK, as for now I may assume that CK fix did the trick, as my machine works well with it.

      Now I'm going to discard that patch and apply Alfred's one.

      Delete
    18. Reverted Con's patch, applied https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work.patch, rebooted OK. Waiting for… something.

      Delete
    19. @post-factum: I wish you happy testing! ;-) I've had it running for three days with -vrq without any related issues (that I haven't had anyway).

      BR Manuel

      Delete
    20. Alfred, unfortunately, your patch doesn't fix the issue. Got the same panic:

      https://drive.google.com/file/d/0BwjMKZtUByALYjJGNHF2T2VGc2M/view?usp=sharing

      Only CK fix seems to work OK now.

      Delete
    21. Many thanks for your testing time! (Although it's sad news.)

      BR Manuel

      Delete
    22. @pf
      Thanks for testing! At least it clear my suspicion in this code path. I'll find some time to look into other code path.

      BR Alfred

      Delete
    23. I really hope that you are able to find the "final fix" ;-)

      Best wishes and always good luck,

      Manuel

      Delete