Wednesday, September 16, 2015

gc_v4.2_0463_2 released

As title, just two new updates in this release

1. Remove resched_closest_idle() as planned in previous release post. I haven't got the feedback yet, but considering more reports are coming, both calls are removed at this release. The modified commit is

da50716 bfs: Full cpumask based and LLC sensitive cpu selection, v4

2. Fix update_cpu_load_nohz() redefine error when undefine CONFIG_NO_HZ_COMMON, this is a missing for the 4.2 sync up, the modified commit is

75fd9b3 bfs: [Sync] 4.2 sync up v2

New -gc branch can be found at bitbucket and github. An all-in-one patch includes all bfs related changes(*NOT* all commits in -gc branch) is also provided. 

Have fun with this new -gc release, and -vrq branch update is in-coming.

BR Alfred

31 comments:

  1. Booted OK for me, thanks.

    I've sent those idle-related patch to user which reported panic (remember bunch of screenshots). Still waiting for response from him.

    ReplyDelete
  2. @post-factum:
    Thank you very much for still tracking it !!!
    Greets,
    Manuel

    ReplyDelete
  3. @Alfred:
    Can you, please, in short time, provide a revert patch for your "resched_closest_idle" version based upon you current -gc?
    I've looked into the changed code and see that you've changed much more than you've described. So that I'm unable to revert it back on my own.

    The reason to write this: TuxOnIce fails. But it did not fail with your former 4.2-bfs patches/ branch.
    As there may be other things causing misbehaviour on here (my opensuse upgrade and 'new' gcc-4.9) I would be glad to exclude at least one point in the list.

    Thank you in advance,
    Manuel

    ReplyDelete
    Replies
    1. BTW, the -vrq for 4.1 was much more failsafe, regarding TuxOnIce, IMO.
      Manuel

      Delete
    2. Hmm, TOI fails for me as well — it simply does not write anything to disk :(.

      Delete
    3. In my case it did write on suspend but didn't read from disk at resume time. :-(

      This afternoon I've collected the related functions to revert the "resched_closest_idle" removal in current -gc and manually adjusted them into a patch. I've uploaded it to http://pastebin.com/FCe7H6ar
      (Some of the hunks' numbers maybe inaccurate, I hope that the rest is o.k.)

      As Alfred wasn't decided whether to remove the first/ the second/ or both "resched_closest_idle" calls, this evening I've done testings by commenting out one or the other call wit "//". The result on her was: Keeping both or each of the calls made my system resume correctly with TuxOnIce.
      After reading post-factum's posting here I rechecked the kernel version without the patch == pure -gc ---- and the wrong behaviour has vanished.

      So, now, I'm absolutely clueless on how to proceed on here. I hope that someone of you can review the patch and or also test the possible cases.

      Best regards,
      Manuel Krause

      Delete
    4. I really don't know, what this issue depends on:
      Just did another reboot +uptime +hibernation to the patch-less -gc kernel (of course always with BFQ and with TOI)... and it failed once again at TOI resume.

      To trigger this issue on here some uptime is needed and some amount in swap partition.

      BR, Manuel

      Delete
  4. @Manuel
    Sorry for the late reply. Just back from a trip. I'm not quit catch up with your issue yet, but If I have uploaded the 4.2 -vrq branch before I leave for the trip last week, so you can give it a try.

    ReplyDelete
    Replies
    1. Did you mean 'next' week? Otherwise I don't understand your message, as I don't see -vrq for 4.2 anywhere.

      BR, Manuel

      Delete
    2. O.k., thank you Alfred, for making the 4.2-vrq repositories accessible.
      Now, I'm currently testing it, and the first TOI hibernations (two) succeeded. No other problems so far.

      BR, Manuel

      Delete
  5. Reverting gc_v4.2_0463_2 didn't help, now trying to revert all BFS-related commits.

    ReplyDelete
    Replies
    1. Umm, no, TOI fail is definitely not BFS-related. I've reverted everything related to BFS and didn't got TOI working. Sorry, will debug TOI itself.

      Delete
    2. I really don't want to discourage you in debugging TOI, but your experience definitely contradicts my positive experience with several successful TOI resumes (none had failed) when having used Alfred's previous 4.2 sync up revision gc_v4.2_0463_1 (and plus the BFQ). But it can also be, that I haven't tested it long enough.

      So, if you have any idea on how to help your debugging/ testing, please let me know.

      Best regards,
      Manuel

      Delete
    3. Manuel, it seems TOI fails reliably with btrfs as I've reported here: http://lists.tuxonice.net/pipermail/tuxonice-devel/2015-September/007542.html

      Delete
    4. Point 2) is still unfixed, BTW.

      Delete
    5. blogspot has eaten my previous Comment to the last one: weird! Here it comes again:

      Thank you post-factum. That's a really nice & professional bug-report. :-)
      Two notes:
      1) As my system doesn't make use of btrfs, your bug is most likely not related to my issue with TOI
      2) Your issue reminds me of one old BUG that I've reported to Nigel many months ago: Same symptom as you have: "no disk write on suspend" when I've loop mounted another disk-image upon a ntfs-3g mounted real partition. Of course, not the same use case, but not incomparable.

      Let's hope the best for answers and a fix from Nigel.

      BR, Manuel Krause

      Delete
  6. OK, let's summary. There are two issues we are discussing here
    1. TOI failed to write image file on btrfs, which turn out not a BFS related issue and @pf is tracing it with TOI.
    2. On -gc branch, after apply the removal of "resched_closest_idle", TOI failed to resume. On -vrq, the first attempt looks good and @Manuel is tracing it.

    Correct me if I miss-understand and thanks for testing so far.

    BR Alfred

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. 1. Nope. TOI failed to write to swap, not to the file (I haven't try file writer though).

      Delete
    3. En... interesting, I guess you are using swap file setup which on a btrfs. But it sounds that it's a classical swap partition setup, but it should has nothing to do with what FS you are using, IMO.

      Delete
    4. Yes, it is classical swap partition setup, and no, it really has something to do with FS as it isn't frozen properly.

      Delete
    5. 2. -- regarding the correctly summarized issue 2: Although I'd need more time to test the -vrq, so far, all TOI resumes succeeded. Tested also with some reboots between multiple attempts. Setup is with a classical swap partition and mounted partitions include ext4 and ntfs-3g but no btrfs.
      As the resumes do not reliably fail ;-) with removed "resched_closest_idle" on -gc, I'd like to test this kernel a bit more to see, if I can get some logs. If you have hints on how to improve my testing, please let me know.

      Manuel

      Delete
    6. 2. regarding issue 2.: The only things I can offer for a failing TOI resume before reading from swap were gathered with "no_console_suspend" kernel command line option, of course: No log available at this point of kernel resume:
      ...
      Doing atomic copy/restore <------------------ must come from TOI
      serial 00:05: disabled
      PM: quiesce of devices complete after 13.x msecs
      PM: late quiesce of devices complete after 0.5x msecs
      PM: noirq quiesce of devices complete after 5.0x msecs
      ACPI : EC: EC stopped
      Disabling non-boot CPUs ... <---------- END of available logged messages on screen

      In a properly working kernel this would go on with calling tuxoniceui_text from within the booted initrd and the following output (now from a dmesg with the 4.2.0-vrq):
      serial 00:05: disabled <--------------- overlapped message
      PM: freeze of devices complete after 386.568 msecs <--------------- overlapped message
      PM: late freeze of devices complete after 11.979 msecs <--------------- overlapped message
      PM: noirq freeze of devices complete after 1.267 msecs <--------------- overlapped message
      ACPI: Preparing to enter system sleep state S4 <--------------- on screen omitted message only in dmesg
      ACPI : EC: EC stopped <--------------- overlapped message
      PM: Saving platform NVS memory <--------------- on screen omitted message only in dmesg
      Disabling non-boot CPUs ... <--------------- overlapped message
      Renew affinity for 416 processes to cpu 1
      smpboot: CPU 1 is now offline
      PM: Restoring platform NVS memory
      ACPI : EC: EC started
      Enabling non-boot CPUs ...
      x86: Booting SMP configuration:
      smpboot: Booting Node 0 Processor 1 APIC 0x1
      Renew affinity for 415 processes to cpu 1
      cache: parent cpu1 should not be sleeping
      bfs/vrq: ci[1,0] = 1, 32768
      bfs/vrq: ci[1,1] = 1, 32768
      bfs/vrq: ci[1,2] = 2, 3145728
      bfs/vrq: CACHE_SCOST_THRESHOLD(1) = 18
      CPU1 is up
      ACPI: Waking up from system sleep state S4 <----------------- nothing relevant after this point


      The general behaviour reminds me of the first time that Con removed the plugged IO code temorarily. But I don't have enough coding knowledge/ historical ambitions to pinpoint something in detail. I also need to admit, that I didn't make use of the -ck-only patches at all since that times. The -gc and following -vrq patches were too promising.

      BTW, the current -vrq is still running fine and no misbehaviour so far.

      BR, Manuel

      Delete
    7. For issue 1, as it's not related to scheduler code, so I'd like to put it aside.
      For issue 2, as "resched_closest_idle" causes crash for some user, and removal of it cause TOI failed to resume, compare the impact, I'd like to keep it removed and -vrq seems a work around for TOI usage. And I will look into the removed code again and find a proper way in next release.

      @Manuel
      Thanks for the log. I can see the latest task cache code work as expected on your system. Your cpu looks like a core2 with 3M L2 cache. :)

      Delete
    8. @Alfred:
      I'm o.k. with your strategy, the -vrq worked well with the previous release and so it does with the current, too. So it's no problem for me to use it, it's quite the contrary, I'm glad with -vrq on here.

      Just let me/ us know if you have some patch to test/ debug the -gc branch with TOI.

      And... yes it's a Core2 with 3M cache Your patch detects it quite exactly. ;-)

      Best regards and many thanks for your successful work,
      Manuel

      Delete
  7. The -vrq branch is not a kind of "holy water". Just encountered a row of non-succeeding resumes with TOI. :-(
    New test kernel would exclude BLK_DEV_THROTTLING, BLK_CGROUP, BFQ_GROUP_IOSCHED again, like I had it with -vrq 4.1 kernels.

    Manuel

    ReplyDelete
    Replies
    1. Mmmh, I know it was not well thought to blame -vrq or -gc. It's also not good to post too early test results (e.g. let some new kernel run for 2 days only).
      With the above mentioned features removed the system resume with TOI appears to be stable again. As these settings seem to utilize new algorithms, when enabled, in 4.2 compared to 4.1, as I understood the BFQ information, it makes sense for me to revert to my 4.1 config. I also take into account/ interpret the most recent bugfix discussions on the BFQ newsgroup (https://groups.google.com/forum/?fromgroups=#!forum/bfq-iosched). As I manually don't make use of the advantages of these settings (and also don't know about them -- maybe someone of you has time to explain?) and don't know if my system silently uses them, it seems to be safer for me to leave them disabled.

      Now, I seem to need to re-test the -gc branch again, with these new-old settings, too, in the coming days.

      BR, Manuel

      Delete
    2. The -gc hibernation was failing for me due to my new setup with BLK_DEV_THROTTLING, BLK_CGROUP, BFQ_GROUP_IOSCHED enabled.

      When I disable them, TOI works very well with -gc branch, too. Kernel 4.2.1. So, it's a seperate BFQ issue.

      Thank you for your audience, best regards,
      Manuel

      Delete
    3. Thanks for your testing, @Manuel.

      Delete
    4. Ooops, I missed to post a late addon to this issue:
      I can achieve the same positive results when using the BFQ v7r8 for 4.2 patches provided from the BFQ io-scheduler's team.

      BR Manuel

      Delete