Wednesday, June 7, 2017

VRQ 0.96 release

Abandon this release due to lock-up issues reported by users, the lock-up is caused by "Lock strategy update" commit, which works well on my working machines and continue during my work on SMT sensitive scheduling, that make me believe it was good and stable.

New "Lock strategy update" debug patch will be posted here for testing. Once it is confirmed work well for other users,  the repined 096a will be released.

VRQ 0.96 is released with the following changes

1. Sync up cpufreq util usage.
2. Lock strategy update, which hopefully fix potential lock issue when task migrating.
3. SMT sensitive scheduling v0.1

Main feature in this release is the first version of SMT sensitive scheduling, which reduce 10s kernel compile benchmark on my test machine(original 7m17s) under 50% workload.
Or, you can easy to observe cpu usage changes when any physical cores available, scheduler will not put task to smt core. For example, if two tasks are running on a 2 cores 4 threads cpu, one will be on cpu 0 or 1, another will be on cpu 2 or 3.

Further improvement for SMT sensitive will be in next release. I'd see if any improvement/simplify can be made to current design.

Enjoy VRQ 0.96 for v4.11 kernel, and unlock your SMT cpu ability with VRQ, :)

code are available at
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.11.y-vrq
and also
https://github.com/cchalpha/linux-gc/commits/linux-4.1.y-vrq

All-in-one patch is available too.

BR Alfred 

76 comments:

  1. Hi Alfred,
    I'm testing it on Ryzen and i7. I probably broke something on the i7, because it doesn't boot and halts at "triggering udev events". The ryzen build did fully boot up and I was able to start the test compile job and see the SMT magic in effect by watching core utilization graphs (good job btw), but it gradually crashed and these are the weird logs I was able to catch:

    make[3]: Warning: File 'include/config/auto.conf' has modification time 7023 s in the future
    /bin/sh: fork: retry: Zdroj je dočasne neprístupný (translation: source temporarily unavailable)
    make[3]: fork: Zdroj je dočasne neprístupný
    make[3]: warning: Clock skew detected. Your build may be incomplete.

    After that build errors and waiting for jobs to finish from parallel builds and later only fork errors.
    The ssh session in second terminal crashed too.
    The machine didn't die completely, ping works but ssh doesn't respond anymore.
    I was using gcc 7.0 on previous builds and that was updated to 7.1 quite recently and I'm not sure if I had a successful vanilla build on that. So that might be the culprit. Maybe the same as on the i7 machine. I will try to compile vanilla kernel and report if it woks (after I get home and reboot the machine).

    Best regards,
    Dzon

    ReplyDelete
    Replies
    1. @Alfred,
      I have confirmed that freshly compiled vanilla kernel works (survived another kernel compile and normal use for ~1 hour without problem) and kernel compiled with the same environment and config with only applied vrq96 patch on top causes problems. Its 4.11.3 on ryzen machine and patch applies cleanly. These problems are random, from lockup during boot or on login screen up to lockup only after some heavy use (a few minutes in 16thread kernel compile job). Most of the times filesystems crash too and journals are empty, but after a few tries part of the journal managed to get flushed to disk and here it is attached (only kernel logs after first warning to keep it brief):
      https://pastebin.com/ScCe1cWv

      If you have some idea for debugging approach, please let me know.
      Best regards,
      Dzon

      Delete
    2. @Dzon
      1) For Ryzen cpu, how it works with VRQ095b? If it works well with VRQ095b, there are just three commits added to make it VRQ096, so you can simply apply them one by one and find out which one breaks your system. You can checkout the commits at https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.11.y-vrq
      2) How is VRQ096 on your intel i7 system?

      Delete
    3. @Alfred,
      95b didn't lockup on 4.11.0. I will try to apply 95b today to 4.11.3 and after that if it works I will try the commits. I didn't test the i7 further, because it wouldn't boot and its my work computer. Will look into that later if time allows (cant test it remotely from home).

      BR, Dzon.

      Delete
    4. @Alfred
      I have ran the builds remotely. Started with 95b, that worked. Then I tested commit b13b01c and that worked too.
      Kernel with 95b and commits b13b01c and e970c48 started having problems.
      While watching cpu utilization in graph I noticed that the build job started similarly as on 95b, but after few minutes the cumulative cpu utilization started to drop in steps of 200% (2 cores relative utilization) down to only 2 threads fully utilized. Sadly I didn't manage to grab a screenshot. After that the machine stopped responding. I noticed just before the end that random processes which, shouldn't have (like my ssh daemon), had 100% cpu and few cores were on 100% kernel time.

      BR, Dzon.

      Delete
  2. @Alfred:
    Thank you for the update! It's in use together with fresh 4.11.4 kernel, now. Except from the kernel changes, that aren't relevant for my HW, only the "vrq: Lock strategy update" would make a difference on here, as I'm without SMT (HW not capable and not configured).
    Regarding Dzon's message above: I have no problems with usual compilation (dualcore and make -j2).

    If you have a little time, can you please explain in short your expected effects of the lock strategy update?

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Thanks for the info Manuel. Has to be something wrong with my toolchain. Never had such problem before. Probably some update (both systems compile without errors in expected time but have problems). Will report back when I got it working.

      Best regards,
      Dzon

      Delete
    2. @Dzon:
      I've forgotten to add in my post above, that I'm still compiling with an older gcc version:
      "gcc (SUSE Linux) 5.4.1 20170331 [gcc-5-branch revision 246615]"
      Don't know if that matters, just want to let you know about another possible difference vs. your results.

      BR, Manuel Krause

      Delete
    3. @Manuel thanks. This is gcc (GCC) 7.1.1 20170516, original package for my distro. Before that I used self compiled git version of 7.0.1, where I would expect problems but it worked flawlessly. Before that 6.3 for a long time also without a hitch. Just tried recompiling vanilla kernel on gcc 7.1.1 and it seems to work. I will let it run longer to confirm.

      Best regards,
      Dzon

      Delete
    4. @Manuel
      For the lock strategy update, it means the task_access_xxxx() APIs. Task now has a ON_RQ_MIGRATING state, which was used only when cpu off line and moving tasks from offline cpu to another cpu.
      But in new SMT sensitive scheduling, most likely tasks will be migrated from one cpu to another, it's kind of balance.
      Now, the lock strategy is updated to aware task ON_RQ_MIGRATING state and wait till task exit ON_RQ_MIGRATING state.

      Delete
  3. Hi Alfred, and thank you for this release of VRQ.

    I had a freeze the two times I ran my usual 'make -j4 ffmpeg' benchmark. I didn't tried a third time. The build freeze, but I'm still able to switch to another tty.
    Here is the error log :
    https://pastebin.com/i41j3zr6

    'make -j1', -j2, -j8 and above are fine though.

    If I find time, I'll try with VRQ 0.95b.

    Pedro

    ReplyDelete
    Replies
    1. Well, I've managed to test VRQ 0.95b earlier than expected.
      All the tests run fine.

      I've forgot to post the details of my config:
      GCC7 from Archlinux repo, linux 4.11.4, SMT_NICE disabled.

      Pedro

      Delete
    2. @Alfred,

      I applied two commits ("smt sensitive scheduling v0.1" and "Sync up cpufreq util usage") to VRQ 0.95 (that is for 4.10) and got the same issue on i7 as Pedro.
      Error is about the same, but just in case it's there: https://pastebin.com/e664MhQD

      Config:
      CONFIG_SCHED_SMT=y
      CONFIG_SMT_NICE=y
      CONFIG_NO_HZ_COMMON=y
      CONFIG_NO_HZ_FULL=y
      CONFIG_NO_HZ_FULL_ALL=y
      CONFIG_NO_HZ=y
      CONFIG_HZ_100=y
      CONFIG_HZ=100

      P.S. Sorry in advance, if posting "backport" related info is inappropriate in this thread

      Br, Eduardo

      Delete
    3. @Eduardo
      Thanks for testing and trying. But the lock strategy update is a pre-required commit of "smt sensitive scheduling v0.1", as the later one will cause task in "ON_RQ_MIGRATING" state a lot.
      So, we need to make the "lock strategy update" right, then let things moving. :)

      Delete
  4. Abandon this release due to lock-up issues reported by users, the lock-up is caused by "Lock strategy update" commit, which works well on my working machines and continue during my work on SMT sensitive scheduling, that make me believe it was good and stable.


    New "Lock strategy update" debug patch will be posted here for testing. Once it is confirmed work well for other users, the repined 096a will be released.

    ReplyDelete
    Replies
    1. O.k., but it's still running fine on here (without the base features of SMT due to the lack of HW capabilities, and also not enabled in .config).
      I'm not completely sure, but your approach seems to change some response times to the better, even with my setup. Subjective impression. I don't know if it's a false positive (maybe, I'm not able to hit the regression/ progression with my setup, you could read it better).

      Keep up good work and BR,
      Manuel Krause

      Delete
    2. @Manuel
      Based on @Eduardo's testing, there must be something wrong with the "lock strategy update" commit, event it's not triggered in yours and my systems.
      I may overlooked something in the implement. That needs to be seeked out step by step.

      Delete
  5. @Alfred,

    I'll try to check whether can I backport those 3 (and debug patch) to 4.10, or You know aldeady that this idea won't work as patcn for 4.10 and 4.11 differs too much and it's not really doable by simple code merging by hand?

    Thanks and br,
    Eduards

    ReplyDelete
  6. Hi, all,
    I still can't finger out what's wrong with the lock strategy update commit in VRQ096, so I think I have to do this in the hard way ---- change the code step by step and see which one gets wrong. Lucky, it is not a huge commit.

    So here is the #1 lock strategy debug patch, apply upon VRQ095b patch. Please try it out and give your feedback then I'll prepare the #2.

    https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_00.patch

    Thanks for testing, :)

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred,

      since we are talking about task migration, yesterday I finally decided to overclock my Ryzen using custom p-states, so I had to check how it actually behaves frequency wise, so I got to my old habits for quick testing using stress utility.
      I tried VRQ 095b (I have no problems w/ 4.11 on Ryzen) and it was fine and nice when idle (lowest freq on all CPUs) and raised frequency to max only on CPUs, which actually execute tasks stress tasks. As it was very late in the night, I did not run more tests or the benchmarks, anyhow, that was not my concern, I was concerned about frequency behavior when system was overclocked.
      Additionally I observed that VRQ plays rather nice with task migration - since I was able to precisely run controlled amount of hungry tasks on idle system, I ran one or two, they did not really migrate to other CPUs, they stayed on the same one. Which I'd say is very good! Again, I do not know how that will translate in real life and how that impacts interactivity, but I'll keep an eye on it now on.

      Will try to compile patches and test it, but before I have a question - are You interested in results if I apply patch to 4.10 and run it on i7 @ work?

      Thanks and regards,
      Eduardo

      Delete
    2. @Eduardo
      What you have observed is correct. In current design, VRQ does very lazy for migration, which reduce the migration overhead. Migration happens in VRQ when #1 other higher policy task queued in other cpus, #2 for the SMT cpu scheduling reason.

      For your question, I'd suggest you not risk your production machine for testing as we know which commit cause the issue and let's fix it first, so we can move on with 4.11 release. And have you test your i7 @work with VRQ095b with 4.11 kernel?

      Delete
    3. @Alfred,

      I can not really test i7 + 4.11 as it breaks my resolution handling. The issue is that I have 4K native notebook display and FHD external display, due to linux not really ready for 4K + other resolution at the same time, I run my internal display at FHD to mitigate the non-existing scaling. With 4.10 this setup works fine in Unity, but with 4.11 whenever I connect external display it switches internal display to 4K and I'm not able to change it. I have to plug in and out cable several times to apply FHD on internal display as most of the time I can't change resolution, hitting apply button just does nothing.

      Therefore I'm afraid I'm stuck on 4.10 for i7 unless it's fixed. I tested 4.12 mainline on i7, the same problem with resolution.
      I just checked the diff for 095 and 095b, too many changes, I'm afraid I won't be able to backport it to 4.10.

      For Ryzen that's another story, I can test whatever kernel there, no 4K or additional display :) When I get home and will have some free time, I'll compile 095b + debug patch and see whether it works.

      Br, Eduardo

      Delete
    4. @Alfred,
      My test succeeded. Boot without warnings and oopses and one full 16 threaded compilation of kernel on Ryzen machine without problem. I have monitored the process with ksysguard and made comparison to better show my previous problem with efficiency:

      http://imgur.com/a/WC4lT
      Top full kernel build on vrq95b (with debug patch)
      Bottom the same on vanilla kernel. Horizontal timescale is in scale for both graphs.

      BR, Dzon.

      Delete
    5. @Alfred
      vrq95b + lock_strategy_00.patch does boot on i7 without problems, too.

      BR, Dzon.

      Delete
    6. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_00.patch.

      Pedro

      Delete
    7. @Alfred:
      I've also tested the VRQ 095b + debug patch with otherwise unchanged setup on my system, and I'd say from first regular use testing, that it works well. Also kernel compilation test was fine.
      (4.11.5+VRQ+BFQ-without WBT)

      BTW, it looks like that the previously discussed i915 improvements will find their way into 4.11.6 as they are in the queue together with other supporting ones. Nice.

      BR, Manuel Krause

      Delete
    8. @Dzon
      Thanks for sharing your graphic top comparison.
      Does other Ryzen user(@Eduardo) here has the same problem with VRQ095b?

      From your graphic top pics, IMO, that huge difference is likely from different setup. What's your setup for the comparison? Both share same root file system, and use the same copy of kernel code and mount using the same way?

      Delete
    9. @Alfred,
      yes it is the same machine and root filesystem/partition. The only difference is the kernel. I build the vrq kernel, reinstall it over currently running one and reboot. Everything else is the same (hardware, OS, enviroment variables, toolchain, commands to run the build and measure time).
      Interesting is that little part near the end. It seems to be the module compilation part, there it seems to run with perfect efficiency.
      I probably didn't mention it before, but old bfs and early versions of vrq (i think including versions with implemented skiplists) were running better than cfs. Better utilisation, less kernel time and less time taken on most of multithreaded tasks even on i7 (SMT) machine. I was already thinking about finding where the problem started, but was discouraged with the time it would take. With 2 times faster Ryzen I'm going to reconsider.

      BR, Dzon.

      Delete
  7. @all
    Thank you all for the quick test of lock_strategy_00.patch, looks like the first step is good move.

    Here comes the #2 debug patch, just change a little bit and it is applied upon VRQ095b.
    https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_01.patch

    After this, there still two more debug patches are planned.

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred
      Tested on Ryzen machine. Boot ok, one kernel compile job completed without problems. Time 21:33
      I won't be able to test until Tuesday.

      BR, Dzon

      Delete
    2. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_01.patch.
      No errors in the logs.

      Pedro

      Delete
    3. @Alfred,

      I can test them :)
      I tested VRQ 0.95b + lock_strategy_01.patch on Ubuntu gcc 6.3, compiled kernel once, no crash.

      Br, Eduardo

      Delete
    4. @ Alfred:
      Also on my machine now with 4.11.6 (and the previously mentioned i915 commits are in it) everything works fine.

      BTW, is the current "lock_strategy" testing only about properly funtioning or also about performance? (I can only deliver the first.)

      BR, Manuel Krause

      Delete
  8. @Alfred,

    I have used VRQ 095b + lock strategy 01 for couple of days, no problems so far. Compilations, everyday usage and games, both native and wine, work fine.

    Br, Eduardo

    ReplyDelete
  9. @all
    Sorry that I was a little busy last weekend.
    Here comes the #3 debug patch, https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_02.patch

    @Manuel
    These debug patches are just to find out the lock-up issue in 096.

    BR Alfred

    ReplyDelete
  10. 'make -j4 ffmpeg' ran successfully 6 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_02.patch.
    No errors in the logs.

    Pedro

    ReplyDelete
  11. Boot without warnings. One kernel build completed without errors on kernel 4.11.5 with vrq95b and lock_strategy_02.patch (Ryzen machine).

    BR, Dzon.

    ReplyDelete
  12. Also no issues or anomalies with 4.11.6 & lock_strategy_02.patch with BFQ (core2duo).
    BR, Manuel Krause

    ReplyDelete
  13. @all
    Thank you all for testing. Here comes the final debug patch to find out what cause the lock-up.
    #4 https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_03.patch
    (apply upon 095b)
    Depends on the testing result, I may ask your help to double check the original lock strategy commit on VRQ096, let's see how this final debug patch goes first.

    BR Alfred

    ReplyDelete
    Replies
    1. For me the test on 4.11.5 vrq95b with lock_strategy_03.patch passed (clean boot and one full kernel build). So I immediately tried compiling 4.11.5 with vrq96 and a few seconds into kernel build job i got the same:
      Oops: 0003 [#1] PREEMPT SMP

      So lock_strategy_03.patch seems to work, but vrq96 doesn't.
      BR, Dzon.

      Delete
    2. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_03.patch.
      No errors in the logs.

      So I tried linux 4.11.4 with, on top of VRQ 095b, the patches from your git repo :
      - Sync up cpufreq util usage + Lock strategy update : ffmpeg builds fine,
      the system froze one time on shutdown, but I was not able to reproduce the issue
      - VRQ 0.96 : ffmpeg build doesn't complete. Same error as above posts

      So I think the combo 'Lock strategy update' + 'smt sensitive scheduling v0.1' might be the problem.
      Let's see what others say.

      Pedro

      Delete
    3. Still no issues with 4.11.6+ VRQ095b+ lock_strategy_03.patch (core2duo). As I wasn't affected by the BUG, I omit testing the VRQ096 again.
      IMO, Pedro's approach is promising, so that there should be some "cross-testing" with the commit later than the "vrq: Lock strategy update." based on the debug patches.
      But maybe Alfred already prepares new test patches for the "vrq: smt sensitive scheduling v0.1".

      BR, and thank you for your time,
      Manuel Krause

      Delete
  14. @all
    Here is the double check action.
    Please apply the 096 lock strategy update commit *UP ON* VRQ095b, if you don't want to fetch it from git, I have uploaded it to https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_096.patch

    Test it harder if possible. :)

    BR Alfred

    ReplyDelete
    Replies
    1. Used the bitbucket lock_strategy_096.patch over vrq95b and no problems occurred. Tried multiple reboots and 2 full kernel build jobs. Nothing else has changed from yesterdays tests.

      BR, Dzon.

      Delete
    2. @Dzon
      Very interesting. Based on your previous test
      >@Alfred
      >I have ran the builds remotely. Started with 95b, that worked. Then I tested commit b13b01c and that worked too.
      >Kernel with 95b and commits b13b01c and e970c48 started having problems.
      This indicated e970c48 is the commit which introduce the lock-up issue, and bitbucket lock_strategy_096.patch is generated from e970c48.

      Would you please give it more runs to double confirm whether e970c48 cause the issue or not? Many thanks.

      BR Alfred

      Delete
    3. I will try, but with vrq96 the build job locked up in 100% cases. With e970c48
      it successfully ran 2 times. Is it possible the problem is in 4a41e41?

      BR, Dzon.

      Delete
    4. Well with linux 4.11.4 and VRQ 0.95b + b13b01c + e970c48,
      I ran my whole set of tests without problems (18 ffmpeg builds, bz2 and xz compression, lame and x264).
      No errors in the log and the system shutdown fine.

      Pedro

      Delete
    5. 3 more kernel builds and some gigs of backup decompression with summ checks without problems.

      BR, Dzon.

      Delete
    6. @Dzon,

      Since my Ryzen doesn't lock up, at least in kernel compile and web browsing, let's compare systems :)
      Which Ryzen/mobo/ram speed do You have? I have 1700, Asrock x370 Gaming K4, Corsair LPX 3200 CL16 Hynix ram running @2800. Ryzen is p-state overclocked to 3.6GHz (P0), the rest of p-states are auto. No SSD.

      Br, Eduardo

      Delete
    7. @Edurado
      1800X, Asrock x370 Taichi with 2.40 bios (agesa 1.0.0.6), 32GB (2*16) Kingston 3000/CL15 After bios update I am running RAM at 2993/CL15 through XMP profile (before that I painfully achieved stable 2400/CL12 with both timings tested with memtest). CPU is not overclocked nor over/undervolted (yet) with AIO liquid cooling. System on NVMe drive. There is also LSI HWraid with one big array. All drives use mq-deadline scheduler since 4.11. AMD Furry X GPU, until recently also Nvidia 660Ti for GPU passthrough. The latest nouveau was misbehaving for me, but it was blacklisted through all of the tests.

      BR, Dzon.

      Delete
  15. I can not test this until evening today (GMT+2), but as I discovered earlier I got the same error as everyone else affected with 4a41e41 and b13b01c applied to 4.10.17, it seemed to me that e970c48 is not the cause of the problem, as I specifically left this out because of reported failures.
    I will try to apply freq + lock strategy, leaving out smt sched patch and based on result just apply smt as the last one.
    I haven't tried 096 on Ryzen yet, so probably I'll start with that to confirm whether I have the problem at all :)

    Br, Eduardo

    ReplyDelete
    Replies
    1. @Eduardo
      095B + lock strategy patch would be enough. Let's wait for the test result b4 come to a conclusion.

      Delete
    2. @Alfred,

      I'm back from midsummer party, so I tried both 096 and 095b + lock strategy 096, compiled kernel twice and did some other day to day tasks, no lockup. It seems that my Ryzen system do not show the problem at all.
      Will try those 2 kernels on my work i7 machine and will leave 096 on Ryzen, lets see what happens...

      Br, Eduardo

      Delete
    3. @Alfred,

      I did some quick tests on work i7, with 096 I could not even get to the desktop, it froze up while loading desktop after login, with 095b + lock strategy 096, did not lock up yet, couple of hours in use.
      Config: NOHZ_FULL + 300Hz

      One thing to note: while trying to test 096 on Ryzen system, I discovered that BFQ (together with VRQ) sometimes lock up system completely, I tried loading some wine games and I could lock up the system quite reliably, tried max 5 times to load LOL and it locks up. No such problems using elevator=deadline. As far as I know some lockups are fixed in 4.12 kernel, so meanwhile I'm using deadline.
      I write this here to warn You that it might be a good idea to switch off BFQ while testing Alfred's patches to be sure about results.

      Br, Eduardo

      Delete
  16. @all
    Based on the latest testing results. The "Lock strategy update" commit doesn't introduce lock-up issue, just as I have double checked all possible call path of the lock APIs, I don't see any possible unexpected scenario could happen.
    While @Dzon has lock-up issue with commit "4a41e41" -- vrq: smt sensitive scheduling v0.1, I'd plan a improvement update for this commit and release a debug patch ASAP this week.

    BR Alfrerd

    ReplyDelete
  17. @all
    Please try this patch upon VRQ096, which use strict locking when doing smt balancing.
    https://bitbucket.org/alfredchen/linux-gc/downloads/v4.11_vrq096_096a.txt

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred
      still got
      [ 342.338775] Oops: 0003 [#1] PREEMPT SMP
      It seems to me the machine locked up sooner.
      In the meantime I have found this:
      https://community.amd.com/thread/215773
      There is a chance that your code might be triggering a CPU bug. But still, vanilla works and kernel build on vanilla/vrq95b finishes without error.

      BR, Dzon.

      Delete
    2. @Dzon
      Could you please send me a copy of your dmesg with locked-up Oops if possible?
      If compile works well on vanilla and VRQ095B, and VRQ095b + 096lock_strategy, the locked up should be introduced by the new smt scheduling commit. I'd like to look close to your dmesg output.

      Delete
    3. @Alfred
      sorry I only grabbed the part where oops occurred, not the whole dmesg and closed the remote session.
      https://pastebin.com/tVYEhkWt
      As you can see the machine ran a few seconds after that. Previous occurrences halted gradually over a longer period of time.

      BR, Dzon.

      Delete
    4. @Dzon
      Thanks for the quick reply. Now there are two Oops samples(this one and your previous one). 0003 means a write protection fault, I'll see if any scenario will cause Oops.
      Meanwhile, let's see if this new debug patch improve stability for other users.

      Delete
    5. @Alfred,

      make -j16 on kernel, I got this in my logs (4.11.7+096+096a).
      Bad news, I suppose: https://pastebin.com/PSCqPRfE
      NOHZ_FULL + 300Hz

      Br, Eduardo

      Delete
    6. @Alfred,

      Sorry to double-post, but usually I use 100Hz kernels, this is 300Hz, could this be that higher Hz actually makes it crash as there is bigger possibility for race conditions...
      Still 15 minutes after that oops, system still did not crash.
      My crash is bit different, maybe due to NOHZ_FULL.

      Anyhow, I can test today and tomorrow, after that I'm gone for about 8 days, on vacation with no computers :)

      Br, Eduardo

      Delete
    7. @Alfred,

      another one from me while playing game via wine (this looks similar to Dzons oops): https://pastebin.com/bqikiZgU

      Br, Eduardo

      Delete
    8. @all
      Thanks for testing. I was able to reproduce similar error kernel log last night when I increase NR_CPUS > number of cpu cores in kernel config. I am investigating and working on a fix.

      Delete
    9. 'make ffmpeg -j4' failed to build on linux 4.11.4 at 1000Hz + VRQ0.96 + 096_096a.
      But I don't get error in the log.

      Pedro

      Delete
  18. @all
    Here comes the second respined debug patch upon VRQ096, which fix a issue when NR_CPUS > real cpu cores, that leads to schedule task to an un-existed cpu and write protection fault.
    https://bitbucket.org/alfredchen/linux-gc/downloads/v4.11_vrq096_096b.txt

    Hopefully this fix the issue for most of you.

    ReplyDelete
    Replies
    1. @Alfred,
      this seems to work correctly. No warnings or oopses or high kernel cpu time. I think you may have found the culprit. A few boots and one full kernel build without problems. The SMT behaviour looks promising, the time for the full build seems shorter (still higher than cfs, probably caused by the remaining full load problem). I will let it build once more and test some virtual machines. Thanks for the effort.

      BR, Dzon.

      Delete
    2. @Alfred
      and bad news again. The problem seems to be different. Testing the virtual machines the host crashed after a while. I got CPU stuck for.. message and after that forking failure. Partial dmesg:

      https://pastebin.com/FgGtJdqK

      Machine is reporting to ping, but no other network service works.

      BR, Dzon.

      Delete
    3. @Alfred:
      ATM I'm seriously in doubt that I've understood the circumstances for the above mentioned failures correctly -- so I want to ask some questions:

      * Does this only affect SMT enabled (in .config) systems or all, where NR_CPUS > real cpu cores ? {For some reason that I don't recall, I had been running with NR_CPUS=4 all the time, although it's a core2duo without SMT capability.}

      * Is it possible, that when NR_CPUS > real cpu cores --unpatched-- the issue also affects userspace processes, meaning, may lead to protection faults there? {Without your last two patches, I was still suffering from protection faults in random processes after ~8th resume-from-disk. Not earlier and not in kernel processes so far. Maybe quite a different issue, but maybe connected, so I ask. Memtest found my RAM failure-free, BTW.} Nonetheless I'm now testing your latest patch anyways and hope for the best.

      BR, Manuel Krause

      Delete
    4. @Dzon
      You new issue? is a different one. As I can see cgroupxxxx in the trace log, and BFS(original) and VRQ doesn't support cgroup, so I encourage you to disable all CGROUP setting in your kernel config when using VRQ.

      Delete
    5. @Manuel
      * I can confirm that that issue is with SMT enable and NR_CPUS > real cpus.
      * I can't not confirm you second scenario, but I had added a prevention code for other possible unexpected cpumask operation when NR_CPUS > real cpu cores in this new 096b debug patch.

      Delete
    6. I let the vrq96b kernel run from yesterday after reboot and did a light use like browsing and music. Also just ran virtual machines a few times (libvirt/qemu). The problem didn't repeat. It might be related to cgroups, but that is dynamically managed by libvirt and haven't had problems with it before. Haven't tested virtualisation before with vrq96 since it locked up everytime during the build test and randomly otherwise until vrq96b so the cgroups problem could not manifest. There is definitely at least one cause of my previous lockups resolved.
      Also I think that later during yesterdays build tests I had a segfault problem with gcc. One build failed and log already scrolled out of terminal buffer. Nothing in kernel logs and subsequent build succeeded. This might be related to the unresolved Ryzen bug mentioned earlier. This machine is memtested, is not overclocked and CPU temp doesn't go above 50°C with 100% load.

      BR, Dzon.

      Delete
    7. Update on last lockup. It didn't occur after multiple boot/shutdown cycles of multiple types of guests and hours of syntetic y-cruncher inside guest benchmark. I'm blaming it on instability caused by the gcc bug and not problem with vrq (I didn't reboot between gcc fail and virtualisation tests where the lockup occured).

      BR, Dzon.

      Delete
    8. @Alfred:
      Thank you for your answers to my specific questions, in particular the 2nd one.
      I'ts now some months that the segfaults/ protection faults do bug me, IIRC it started after migration to openSUSE 42.2. But there's noone to blame ATM.
      Regarding the kernel part, in the meantime until now, I had opted-in and -out so many different possible fixes and .config changes and tried combinations of them, that here it's highly unlikely to have a culprit.
      I also keep the distro's software updated regularily, so I'm hopefully not affected by older bugs.
      My findigs so far besides the above mentioned: * Has nothing to do with the count of sequential hibernations (can occur after 1st one or the 8th). * Has nothing to do with memory load (much to swap or not with hibernation), including highly filled /dev/shm or not. * Has nothing to do with Firefox being loaded with too many tabs (21 or 104 make no difference).

      ATM I'm following one idea of mine and continue testing and will report back on either success or failure. BTW, your latest patches do work well on here.

      BR, Manuel Krause

      Delete
    9. To not bother in the new thread, I'd report my recent efforts on this ugly but more-or-less off-topic issue right here:

      * removed "kmozillahelper": Apparently this package/library led to fault injections into the KDE base, whenever Firefox contents including flash plugin crashed. In the following a cascade of other processes could have page/protection faults (e.g. plasma-shell) leading systemd-coredump to take action and fill my / to the max., making the system stall for ~2mins, and in the worst case making the KDE unusable at all.

      For now 4 days I'm not experiencing page/ protection faults any more and thus no more nasty coredumps, although Firefox still crashed too often. In my observation it was most likely to crash after resume from s2disk and/or a running time of ~1.5 days, but unpredictably.
      To overcome the FF unreliability I've reset it's config at first, removed and re-added the addons -- unfortunately without stability improvement.

      Today's last step was to replace the FF 52 ESR with the most recent FF 54. Still testing...

      BR, Manuel Krause

      Delete
  19. @all
    Thank you for testing. It turns out the latest debug patch fix the issue in VRQ096 for most users, so VRQ096b is officially released. Feel free to discuss in the new post there.

    ReplyDelete