Wednesday, June 7, 2017

VRQ 0.96 release

Abandon this release due to lock-up issues reported by users, the lock-up is caused by "Lock strategy update" commit, which works well on my working machines and continue during my work on SMT sensitive scheduling, that make me believe it was good and stable.

New "Lock strategy update" debug patch will be posted here for testing. Once it is confirmed work well for other users,  the repined 096a will be released.

VRQ 0.96 is released with the following changes

1. Sync up cpufreq util usage.
2. Lock strategy update, which hopefully fix potential lock issue when task migrating.
3. SMT sensitive scheduling v0.1

Main feature in this release is the first version of SMT sensitive scheduling, which reduce 10s kernel compile benchmark on my test machine(original 7m17s) under 50% workload.
Or, you can easy to observe cpu usage changes when any physical cores available, scheduler will not put task to smt core. For example, if two tasks are running on a 2 cores 4 threads cpu, one will be on cpu 0 or 1, another will be on cpu 2 or 3.

Further improvement for SMT sensitive will be in next release. I'd see if any improvement/simplify can be made to current design.

Enjoy VRQ 0.96 for v4.11 kernel, and unlock your SMT cpu ability with VRQ, :)

code are available at
https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.11.y-vrq
and also
https://github.com/cchalpha/linux-gc/commits/linux-4.1.y-vrq

All-in-one patch is available too.

BR Alfred 

63 comments:

  1. Hi Alfred,
    I'm testing it on Ryzen and i7. I probably broke something on the i7, because it doesn't boot and halts at "triggering udev events". The ryzen build did fully boot up and I was able to start the test compile job and see the SMT magic in effect by watching core utilization graphs (good job btw), but it gradually crashed and these are the weird logs I was able to catch:

    make[3]: Warning: File 'include/config/auto.conf' has modification time 7023 s in the future
    /bin/sh: fork: retry: Zdroj je dočasne neprístupný (translation: source temporarily unavailable)
    make[3]: fork: Zdroj je dočasne neprístupný
    make[3]: warning: Clock skew detected. Your build may be incomplete.

    After that build errors and waiting for jobs to finish from parallel builds and later only fork errors.
    The ssh session in second terminal crashed too.
    The machine didn't die completely, ping works but ssh doesn't respond anymore.
    I was using gcc 7.0 on previous builds and that was updated to 7.1 quite recently and I'm not sure if I had a successful vanilla build on that. So that might be the culprit. Maybe the same as on the i7 machine. I will try to compile vanilla kernel and report if it woks (after I get home and reboot the machine).

    Best regards,
    Dzon

    ReplyDelete
    Replies
    1. @Alfred,
      I have confirmed that freshly compiled vanilla kernel works (survived another kernel compile and normal use for ~1 hour without problem) and kernel compiled with the same environment and config with only applied vrq96 patch on top causes problems. Its 4.11.3 on ryzen machine and patch applies cleanly. These problems are random, from lockup during boot or on login screen up to lockup only after some heavy use (a few minutes in 16thread kernel compile job). Most of the times filesystems crash too and journals are empty, but after a few tries part of the journal managed to get flushed to disk and here it is attached (only kernel logs after first warning to keep it brief):
      https://pastebin.com/ScCe1cWv

      If you have some idea for debugging approach, please let me know.
      Best regards,
      Dzon

      Delete
    2. @Dzon
      1) For Ryzen cpu, how it works with VRQ095b? If it works well with VRQ095b, there are just three commits added to make it VRQ096, so you can simply apply them one by one and find out which one breaks your system. You can checkout the commits at https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-4.11.y-vrq
      2) How is VRQ096 on your intel i7 system?

      Delete
    3. @Alfred,
      95b didn't lockup on 4.11.0. I will try to apply 95b today to 4.11.3 and after that if it works I will try the commits. I didn't test the i7 further, because it wouldn't boot and its my work computer. Will look into that later if time allows (cant test it remotely from home).

      BR, Dzon.

      Delete
    4. @Alfred
      I have ran the builds remotely. Started with 95b, that worked. Then I tested commit b13b01c and that worked too.
      Kernel with 95b and commits b13b01c and e970c48 started having problems.
      While watching cpu utilization in graph I noticed that the build job started similarly as on 95b, but after few minutes the cumulative cpu utilization started to drop in steps of 200% (2 cores relative utilization) down to only 2 threads fully utilized. Sadly I didn't manage to grab a screenshot. After that the machine stopped responding. I noticed just before the end that random processes which, shouldn't have (like my ssh daemon), had 100% cpu and few cores were on 100% kernel time.

      BR, Dzon.

      Delete
  2. @Alfred:
    Thank you for the update! It's in use together with fresh 4.11.4 kernel, now. Except from the kernel changes, that aren't relevant for my HW, only the "vrq: Lock strategy update" would make a difference on here, as I'm without SMT (HW not capable and not configured).
    Regarding Dzon's message above: I have no problems with usual compilation (dualcore and make -j2).

    If you have a little time, can you please explain in short your expected effects of the lock strategy update?

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Thanks for the info Manuel. Has to be something wrong with my toolchain. Never had such problem before. Probably some update (both systems compile without errors in expected time but have problems). Will report back when I got it working.

      Best regards,
      Dzon

      Delete
    2. @Dzon:
      I've forgotten to add in my post above, that I'm still compiling with an older gcc version:
      "gcc (SUSE Linux) 5.4.1 20170331 [gcc-5-branch revision 246615]"
      Don't know if that matters, just want to let you know about another possible difference vs. your results.

      BR, Manuel Krause

      Delete
    3. @Manuel thanks. This is gcc (GCC) 7.1.1 20170516, original package for my distro. Before that I used self compiled git version of 7.0.1, where I would expect problems but it worked flawlessly. Before that 6.3 for a long time also without a hitch. Just tried recompiling vanilla kernel on gcc 7.1.1 and it seems to work. I will let it run longer to confirm.

      Best regards,
      Dzon

      Delete
    4. @Manuel
      For the lock strategy update, it means the task_access_xxxx() APIs. Task now has a ON_RQ_MIGRATING state, which was used only when cpu off line and moving tasks from offline cpu to another cpu.
      But in new SMT sensitive scheduling, most likely tasks will be migrated from one cpu to another, it's kind of balance.
      Now, the lock strategy is updated to aware task ON_RQ_MIGRATING state and wait till task exit ON_RQ_MIGRATING state.

      Delete
  3. Hi Alfred, and thank you for this release of VRQ.

    I had a freeze the two times I ran my usual 'make -j4 ffmpeg' benchmark. I didn't tried a third time. The build freeze, but I'm still able to switch to another tty.
    Here is the error log :
    https://pastebin.com/i41j3zr6

    'make -j1', -j2, -j8 and above are fine though.

    If I find time, I'll try with VRQ 0.95b.

    Pedro

    ReplyDelete
    Replies
    1. Well, I've managed to test VRQ 0.95b earlier than expected.
      All the tests run fine.

      I've forgot to post the details of my config:
      GCC7 from Archlinux repo, linux 4.11.4, SMT_NICE disabled.

      Pedro

      Delete
    2. @Alfred,

      I applied two commits ("smt sensitive scheduling v0.1" and "Sync up cpufreq util usage") to VRQ 0.95 (that is for 4.10) and got the same issue on i7 as Pedro.
      Error is about the same, but just in case it's there: https://pastebin.com/e664MhQD

      Config:
      CONFIG_SCHED_SMT=y
      CONFIG_SMT_NICE=y
      CONFIG_NO_HZ_COMMON=y
      CONFIG_NO_HZ_FULL=y
      CONFIG_NO_HZ_FULL_ALL=y
      CONFIG_NO_HZ=y
      CONFIG_HZ_100=y
      CONFIG_HZ=100

      P.S. Sorry in advance, if posting "backport" related info is inappropriate in this thread

      Br, Eduardo

      Delete
    3. @Eduardo
      Thanks for testing and trying. But the lock strategy update is a pre-required commit of "smt sensitive scheduling v0.1", as the later one will cause task in "ON_RQ_MIGRATING" state a lot.
      So, we need to make the "lock strategy update" right, then let things moving. :)

      Delete
  4. Abandon this release due to lock-up issues reported by users, the lock-up is caused by "Lock strategy update" commit, which works well on my working machines and continue during my work on SMT sensitive scheduling, that make me believe it was good and stable.


    New "Lock strategy update" debug patch will be posted here for testing. Once it is confirmed work well for other users, the repined 096a will be released.

    ReplyDelete
    Replies
    1. O.k., but it's still running fine on here (without the base features of SMT due to the lack of HW capabilities, and also not enabled in .config).
      I'm not completely sure, but your approach seems to change some response times to the better, even with my setup. Subjective impression. I don't know if it's a false positive (maybe, I'm not able to hit the regression/ progression with my setup, you could read it better).

      Keep up good work and BR,
      Manuel Krause

      Delete
    2. @Manuel
      Based on @Eduardo's testing, there must be something wrong with the "lock strategy update" commit, event it's not triggered in yours and my systems.
      I may overlooked something in the implement. That needs to be seeked out step by step.

      Delete
  5. @Alfred,

    I'll try to check whether can I backport those 3 (and debug patch) to 4.10, or You know aldeady that this idea won't work as patcn for 4.10 and 4.11 differs too much and it's not really doable by simple code merging by hand?

    Thanks and br,
    Eduards

    ReplyDelete
  6. Hi, all,
    I still can't finger out what's wrong with the lock strategy update commit in VRQ096, so I think I have to do this in the hard way ---- change the code step by step and see which one gets wrong. Lucky, it is not a huge commit.

    So here is the #1 lock strategy debug patch, apply upon VRQ095b patch. Please try it out and give your feedback then I'll prepare the #2.

    https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_00.patch

    Thanks for testing, :)

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred,

      since we are talking about task migration, yesterday I finally decided to overclock my Ryzen using custom p-states, so I had to check how it actually behaves frequency wise, so I got to my old habits for quick testing using stress utility.
      I tried VRQ 095b (I have no problems w/ 4.11 on Ryzen) and it was fine and nice when idle (lowest freq on all CPUs) and raised frequency to max only on CPUs, which actually execute tasks stress tasks. As it was very late in the night, I did not run more tests or the benchmarks, anyhow, that was not my concern, I was concerned about frequency behavior when system was overclocked.
      Additionally I observed that VRQ plays rather nice with task migration - since I was able to precisely run controlled amount of hungry tasks on idle system, I ran one or two, they did not really migrate to other CPUs, they stayed on the same one. Which I'd say is very good! Again, I do not know how that will translate in real life and how that impacts interactivity, but I'll keep an eye on it now on.

      Will try to compile patches and test it, but before I have a question - are You interested in results if I apply patch to 4.10 and run it on i7 @ work?

      Thanks and regards,
      Eduardo

      Delete
    2. @Eduardo
      What you have observed is correct. In current design, VRQ does very lazy for migration, which reduce the migration overhead. Migration happens in VRQ when #1 other higher policy task queued in other cpus, #2 for the SMT cpu scheduling reason.

      For your question, I'd suggest you not risk your production machine for testing as we know which commit cause the issue and let's fix it first, so we can move on with 4.11 release. And have you test your i7 @work with VRQ095b with 4.11 kernel?

      Delete
    3. @Alfred,

      I can not really test i7 + 4.11 as it breaks my resolution handling. The issue is that I have 4K native notebook display and FHD external display, due to linux not really ready for 4K + other resolution at the same time, I run my internal display at FHD to mitigate the non-existing scaling. With 4.10 this setup works fine in Unity, but with 4.11 whenever I connect external display it switches internal display to 4K and I'm not able to change it. I have to plug in and out cable several times to apply FHD on internal display as most of the time I can't change resolution, hitting apply button just does nothing.

      Therefore I'm afraid I'm stuck on 4.10 for i7 unless it's fixed. I tested 4.12 mainline on i7, the same problem with resolution.
      I just checked the diff for 095 and 095b, too many changes, I'm afraid I won't be able to backport it to 4.10.

      For Ryzen that's another story, I can test whatever kernel there, no 4K or additional display :) When I get home and will have some free time, I'll compile 095b + debug patch and see whether it works.

      Br, Eduardo

      Delete
    4. @Alfred,
      My test succeeded. Boot without warnings and oopses and one full 16 threaded compilation of kernel on Ryzen machine without problem. I have monitored the process with ksysguard and made comparison to better show my previous problem with efficiency:

      http://imgur.com/a/WC4lT
      Top full kernel build on vrq95b (with debug patch)
      Bottom the same on vanilla kernel. Horizontal timescale is in scale for both graphs.

      BR, Dzon.

      Delete
    5. @Alfred
      vrq95b + lock_strategy_00.patch does boot on i7 without problems, too.

      BR, Dzon.

      Delete
    6. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_00.patch.

      Pedro

      Delete
    7. @Alfred:
      I've also tested the VRQ 095b + debug patch with otherwise unchanged setup on my system, and I'd say from first regular use testing, that it works well. Also kernel compilation test was fine.
      (4.11.5+VRQ+BFQ-without WBT)

      BTW, it looks like that the previously discussed i915 improvements will find their way into 4.11.6 as they are in the queue together with other supporting ones. Nice.

      BR, Manuel Krause

      Delete
    8. @Dzon
      Thanks for sharing your graphic top comparison.
      Does other Ryzen user(@Eduardo) here has the same problem with VRQ095b?

      From your graphic top pics, IMO, that huge difference is likely from different setup. What's your setup for the comparison? Both share same root file system, and use the same copy of kernel code and mount using the same way?

      Delete
    9. @Alfred,
      yes it is the same machine and root filesystem/partition. The only difference is the kernel. I build the vrq kernel, reinstall it over currently running one and reboot. Everything else is the same (hardware, OS, enviroment variables, toolchain, commands to run the build and measure time).
      Interesting is that little part near the end. It seems to be the module compilation part, there it seems to run with perfect efficiency.
      I probably didn't mention it before, but old bfs and early versions of vrq (i think including versions with implemented skiplists) were running better than cfs. Better utilisation, less kernel time and less time taken on most of multithreaded tasks even on i7 (SMT) machine. I was already thinking about finding where the problem started, but was discouraged with the time it would take. With 2 times faster Ryzen I'm going to reconsider.

      BR, Dzon.

      Delete
  7. @all
    Thank you all for the quick test of lock_strategy_00.patch, looks like the first step is good move.

    Here comes the #2 debug patch, just change a little bit and it is applied upon VRQ095b.
    https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_01.patch

    After this, there still two more debug patches are planned.

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred
      Tested on Ryzen machine. Boot ok, one kernel compile job completed without problems. Time 21:33
      I won't be able to test until Tuesday.

      BR, Dzon

      Delete
    2. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_01.patch.
      No errors in the logs.

      Pedro

      Delete
    3. @Alfred,

      I can test them :)
      I tested VRQ 0.95b + lock_strategy_01.patch on Ubuntu gcc 6.3, compiled kernel once, no crash.

      Br, Eduardo

      Delete
    4. @ Alfred:
      Also on my machine now with 4.11.6 (and the previously mentioned i915 commits are in it) everything works fine.

      BTW, is the current "lock_strategy" testing only about properly funtioning or also about performance? (I can only deliver the first.)

      BR, Manuel Krause

      Delete
  8. @Alfred,

    I have used VRQ 095b + lock strategy 01 for couple of days, no problems so far. Compilations, everyday usage and games, both native and wine, work fine.

    Br, Eduardo

    ReplyDelete
  9. @all
    Sorry that I was a little busy last weekend.
    Here comes the #3 debug patch, https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_02.patch

    @Manuel
    These debug patches are just to find out the lock-up issue in 096.

    BR Alfred

    ReplyDelete
  10. 'make -j4 ffmpeg' ran successfully 6 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_02.patch.
    No errors in the logs.

    Pedro

    ReplyDelete
  11. Boot without warnings. One kernel build completed without errors on kernel 4.11.5 with vrq95b and lock_strategy_02.patch (Ryzen machine).

    BR, Dzon.

    ReplyDelete
  12. Also no issues or anomalies with 4.11.6 & lock_strategy_02.patch with BFQ (core2duo).
    BR, Manuel Krause

    ReplyDelete
  13. @all
    Thank you all for testing. Here comes the final debug patch to find out what cause the lock-up.
    #4 https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_03.patch
    (apply upon 095b)
    Depends on the testing result, I may ask your help to double check the original lock strategy commit on VRQ096, let's see how this final debug patch goes first.

    BR Alfred

    ReplyDelete
    Replies
    1. For me the test on 4.11.5 vrq95b with lock_strategy_03.patch passed (clean boot and one full kernel build). So I immediately tried compiling 4.11.5 with vrq96 and a few seconds into kernel build job i got the same:
      Oops: 0003 [#1] PREEMPT SMP

      So lock_strategy_03.patch seems to work, but vrq96 doesn't.
      BR, Dzon.

      Delete
    2. 'make -j4 ffmpeg' ran successfully 3 times in a row with linux 4.11.4 and VRQ 0.95b+lock_strategy_03.patch.
      No errors in the logs.

      So I tried linux 4.11.4 with, on top of VRQ 095b, the patches from your git repo :
      - Sync up cpufreq util usage + Lock strategy update : ffmpeg builds fine,
      the system froze one time on shutdown, but I was not able to reproduce the issue
      - VRQ 0.96 : ffmpeg build doesn't complete. Same error as above posts

      So I think the combo 'Lock strategy update' + 'smt sensitive scheduling v0.1' might be the problem.
      Let's see what others say.

      Pedro

      Delete
    3. Still no issues with 4.11.6+ VRQ095b+ lock_strategy_03.patch (core2duo). As I wasn't affected by the BUG, I omit testing the VRQ096 again.
      IMO, Pedro's approach is promising, so that there should be some "cross-testing" with the commit later than the "vrq: Lock strategy update." based on the debug patches.
      But maybe Alfred already prepares new test patches for the "vrq: smt sensitive scheduling v0.1".

      BR, and thank you for your time,
      Manuel Krause

      Delete
  14. @all
    Here is the double check action.
    Please apply the 096 lock strategy update commit *UP ON* VRQ095b, if you don't want to fetch it from git, I have uploaded it to https://bitbucket.org/alfredchen/linux-gc/downloads/lock_strategy_096.patch

    Test it harder if possible. :)

    BR Alfred

    ReplyDelete
    Replies
    1. Used the bitbucket lock_strategy_096.patch over vrq95b and no problems occurred. Tried multiple reboots and 2 full kernel build jobs. Nothing else has changed from yesterdays tests.

      BR, Dzon.

      Delete
    2. @Dzon
      Very interesting. Based on your previous test
      >@Alfred
      >I have ran the builds remotely. Started with 95b, that worked. Then I tested commit b13b01c and that worked too.
      >Kernel with 95b and commits b13b01c and e970c48 started having problems.
      This indicated e970c48 is the commit which introduce the lock-up issue, and bitbucket lock_strategy_096.patch is generated from e970c48.

      Would you please give it more runs to double confirm whether e970c48 cause the issue or not? Many thanks.

      BR Alfred

      Delete
    3. I will try, but with vrq96 the build job locked up in 100% cases. With e970c48
      it successfully ran 2 times. Is it possible the problem is in 4a41e41?

      BR, Dzon.

      Delete
    4. Well with linux 4.11.4 and VRQ 0.95b + b13b01c + e970c48,
      I ran my whole set of tests without problems (18 ffmpeg builds, bz2 and xz compression, lame and x264).
      No errors in the log and the system shutdown fine.

      Pedro

      Delete
    5. 3 more kernel builds and some gigs of backup decompression with summ checks without problems.

      BR, Dzon.

      Delete
    6. @Dzon,

      Since my Ryzen doesn't lock up, at least in kernel compile and web browsing, let's compare systems :)
      Which Ryzen/mobo/ram speed do You have? I have 1700, Asrock x370 Gaming K4, Corsair LPX 3200 CL16 Hynix ram running @2800. Ryzen is p-state overclocked to 3.6GHz (P0), the rest of p-states are auto. No SSD.

      Br, Eduardo

      Delete
    7. @Edurado
      1800X, Asrock x370 Taichi with 2.40 bios (agesa 1.0.0.6), 32GB (2*16) Kingston 3000/CL15 After bios update I am running RAM at 2993/CL15 through XMP profile (before that I painfully achieved stable 2400/CL12 with both timings tested with memtest). CPU is not overclocked nor over/undervolted (yet) with AIO liquid cooling. System on NVMe drive. There is also LSI HWraid with one big array. All drives use mq-deadline scheduler since 4.11. AMD Furry X GPU, until recently also Nvidia 660Ti for GPU passthrough. The latest nouveau was misbehaving for me, but it was blacklisted through all of the tests.

      BR, Dzon.

      Delete
  15. I can not test this until evening today (GMT+2), but as I discovered earlier I got the same error as everyone else affected with 4a41e41 and b13b01c applied to 4.10.17, it seemed to me that e970c48 is not the cause of the problem, as I specifically left this out because of reported failures.
    I will try to apply freq + lock strategy, leaving out smt sched patch and based on result just apply smt as the last one.
    I haven't tried 096 on Ryzen yet, so probably I'll start with that to confirm whether I have the problem at all :)

    Br, Eduardo

    ReplyDelete
    Replies
    1. @Eduardo
      095B + lock strategy patch would be enough. Let's wait for the test result b4 come to a conclusion.

      Delete
    2. @Alfred,

      I'm back from midsummer party, so I tried both 096 and 095b + lock strategy 096, compiled kernel twice and did some other day to day tasks, no lockup. It seems that my Ryzen system do not show the problem at all.
      Will try those 2 kernels on my work i7 machine and will leave 096 on Ryzen, lets see what happens...

      Br, Eduardo

      Delete
    3. @Alfred,

      I did some quick tests on work i7, with 096 I could not even get to the desktop, it froze up while loading desktop after login, with 095b + lock strategy 096, did not lock up yet, couple of hours in use.
      Config: NOHZ_FULL + 300Hz

      One thing to note: while trying to test 096 on Ryzen system, I discovered that BFQ (together with VRQ) sometimes lock up system completely, I tried loading some wine games and I could lock up the system quite reliably, tried max 5 times to load LOL and it locks up. No such problems using elevator=deadline. As far as I know some lockups are fixed in 4.12 kernel, so meanwhile I'm using deadline.
      I write this here to warn You that it might be a good idea to switch off BFQ while testing Alfred's patches to be sure about results.

      Br, Eduardo

      Delete
  16. @all
    Based on the latest testing results. The "Lock strategy update" commit doesn't introduce lock-up issue, just as I have double checked all possible call path of the lock APIs, I don't see any possible unexpected scenario could happen.
    While @Dzon has lock-up issue with commit "4a41e41" -- vrq: smt sensitive scheduling v0.1, I'd plan a improvement update for this commit and release a debug patch ASAP this week.

    BR Alfrerd

    ReplyDelete
  17. @all
    Please try this patch upon VRQ096, which use strict locking when doing smt balancing.
    https://bitbucket.org/alfredchen/linux-gc/downloads/v4.11_vrq096_096a.txt

    BR Alfred

    ReplyDelete
    Replies
    1. @Alfred
      still got
      [ 342.338775] Oops: 0003 [#1] PREEMPT SMP
      It seems to me the machine locked up sooner.
      In the meantime I have found this:
      https://community.amd.com/thread/215773
      There is a chance that your code might be triggering a CPU bug. But still, vanilla works and kernel build on vanilla/vrq95b finishes without error.

      BR, Dzon.

      Delete
    2. @Dzon
      Could you please send me a copy of your dmesg with locked-up Oops if possible?
      If compile works well on vanilla and VRQ095B, and VRQ095b + 096lock_strategy, the locked up should be introduced by the new smt scheduling commit. I'd like to look close to your dmesg output.

      Delete
    3. @Alfred
      sorry I only grabbed the part where oops occurred, not the whole dmesg and closed the remote session.
      https://pastebin.com/tVYEhkWt
      As you can see the machine ran a few seconds after that. Previous occurrences halted gradually over a longer period of time.

      BR, Dzon.

      Delete
    4. @Dzon
      Thanks for the quick reply. Now there are two Oops samples(this one and your previous one). 0003 means a write protection fault, I'll see if any scenario will cause Oops.
      Meanwhile, let's see if this new debug patch improve stability for other users.

      Delete
    5. @Alfred,

      make -j16 on kernel, I got this in my logs (4.11.7+096+096a).
      Bad news, I suppose: https://pastebin.com/PSCqPRfE
      NOHZ_FULL + 300Hz

      Br, Eduardo

      Delete
    6. @Alfred,

      Sorry to double-post, but usually I use 100Hz kernels, this is 300Hz, could this be that higher Hz actually makes it crash as there is bigger possibility for race conditions...
      Still 15 minutes after that oops, system still did not crash.
      My crash is bit different, maybe due to NOHZ_FULL.

      Anyhow, I can test today and tomorrow, after that I'm gone for about 8 days, on vacation with no computers :)

      Br, Eduardo

      Delete
    7. @Alfred,

      another one from me while playing game via wine (this looks similar to Dzons oops): https://pastebin.com/bqikiZgU

      Br, Eduardo

      Delete