Tuesday, November 17, 2015

New vrq patch for v4.3

I'd like to wait for v4.3.1 before official release new gc and vrq code, but it looks like a test patch would be welcome before that.

Here it comes, please download the vrq_v4.3_0465_2.patch from bitbucket download page and it contains the bfs 0465 rebase and part3 of task caching commit in it. Feel free to git it a try and report back.

PS, I got feedback from a user who report vrq with wine gaming has better experience like mouse movement etc. It turns out the initial idea of vrq about reducing grq lock sessions helps.

BR Alfred

21 comments:

  1. ... yes, very welcome indeed! :-)))
    Thank you, Alfred! And I really like your quick reactions upon users' wishes.

    It's up and running very well so far and now in my "reliability testing" ;-) No anomalies, except for... maybe... it's even snappier than before. Great work!

    BR Manuel

    ReplyDelete
  2. Thx, Alfred, runs fine

    ReplyDelete
  3. I know... not enough uptime so far, but the first testing hours show, that something has greatly improved with this VRQ patch. :-)))
    Atm I want to especially point out, that the reliability of hibernations with TuxOnIce has increased to 100%. I mean with that, in practice, that each resume from hibernation (1.) succeeded and (2.) succeeded at the first attempt. I never had such a row of 5 (up to now) working hibernations since kernel ~4.2.3+vrq (if at all). I've kept my original .config for this one, changed the memory/shm/swap load for each hibernation, and need to add the remark, that my testings of plain BFS 0465+BFQ haven't shown that high reliability.

    You can't estimate how thankful I am, Alfred!
    This patch minimizes the risk of headaches, hair loss ;-) and, of course, wasted lifetime.

    BR Manuel

    ReplyDelete
    Replies
    1. As I already assumed, the report above was written too early.
      Also this revision of VRQ can fail on TuxOnIce's resumes from hibernation -->on my system<--. And yes, the failures needn't be caused by BFS/VRQ scheduler. What I can say from my humble tally sheet statistics is, only, that this VRQ version highly reduces resume failures vs. previous VRQs and vs. plain BFS for me.
      Since 2 days I've planned to test more internal settings for the i915 module and xorg.conf parameters and some of the TuxOnIce sysconfig options -- to reduce the rest risk of resume failures.

      What I don't understand, @ Alfred: How can the success rate increase with your new VRQ patch vs. the old -- if it's NOT scheduler related?

      BR Manuel Krause

      Delete
    2. The delta between 0463 and 0465 may contribute to the increment of the rate as you report that bfs0465 works better than the -gc branch. And the vrq code seems works better than gc, right? So it also contributed.
      I don't know the real cause of this trouble, but I will try to make the code works in the correct way. And wish that would help to solve the issues in a common way.

      Delete
    3. Your summary is exact and thoughtful.
      This issue itself drives me mad. (Not your fault.) Yesterday afternoon I meant to have found some settings that rised the rate to 10/10 (ten of ten) following successful resumes without any retries(!) -- each time only the content of /dev/shm and browser tabs and videos playing has changed -- up to and including this morning's resume. So, I thought to have configured my system in the right way.
      Later on, at noon, hibernating again, the system decided it to be the "odd" day.... Rows and rows of needed resume retries and very very rare successes. No settings changed.

      Of course, you can only cover your code. And I know that you're (and have always been) doing your very best to make it work correctly.

      Maybe there's another function in all the related resume code including TuxOnIce & i915 that wants correct return values...
      Another, maybe crude, idea: is there a way to speed up CPU1 (on here my second core) to get online within the resume process? Either by parameters or in the code? What leads me to this idea: My sticking resumes always fail at a stage where the GFX should be (re)set and/or the second of two cores (CPU1) should come up, in order to make the planned read-in of caches begin (by TuxOnIce), where both can fail.
      Only a suggestion for more thoughts from your side.

      My very kind regards and thanks for your work.
      Manuel

      Delete
    4. Tonight, I've even tried the same setup + test scenario but with a ->CFS<- compiled kernel (apart from that same .config). It also fails at the same resume stage and I stopped testing it in the first round after 25 failing resume attempts (no successes). Just to be sure.
      BR Manuel

      Delete
    5. If it fails with CFS, then most likely it is not a scheduler related issue, IMO.

      Delete
    6. Yes, thank you, Alfred. This is also my conclusion for the moment. To prove it was the reason that I've re-checked with CFS. Most probably I won't bother you with that issue again ;-)

      Unfortunately I haven't found many bugreports that pinpoint this issue to the i915 code. I've read one or two in the bugzillas that show the same failure behaviour (without TuxOnIce) that would slightly indicate a timing issue in i915 resume code (not scheduler related, not TuxOnIce caused => gfx restore related). IMO.

      Best regards, and thanks for your assistance,
      Manuel

      Delete
  4. Sadly this one is crashing for me If I get the system under load and play a video seems sound related video without sound is working with sound the system crashes.
    It's a usb sound can't and I will setup the netconsole to provide a stack trace.

    ReplyDelete
  5. so here we go crash log:
    http://pastebin.com/XUW5iuBY
    .config:
    http://pastebin.com/k87NgRCk
    the older vrq0 patch works without that problem

    ReplyDelete
    Replies
    1. I have done a quick check about your config file and find that CONFIG_SMT_NICE is enable, which I have never tried myself and do not suggest it for gc and vrq. And unfortunately there is a bug in current vrq related to CONFIG_SMT_NICE, which may cause unexpected result if it is enabled. So, simply disable CONFIG_SMT_NICE and see if this fix the issue.

      BR Alfred

      Delete
    2. From the crash log and assemble code of bfs.c, I am petty sure it's caused by the "return;" in task_preemptable_rq() when CONFIG_SMT_NICE is enabled. The fixed code should be look like below, you can have a try, but I'm not grantee SMT_NICE works well.

      1612 task_preemptable_rq(struct task_struct *p, int only_preempt_idle)
      1 {
      ...
      54 #ifdef CONFIG_SMT_NICE
      55 >-------if (!smt_should_schedule(p, target_cpu))
      56 >------->-------return NULL;
      57 #endif

      PS, sorry that I have merged this fix into a previous commit so can't provide you a simple patch file.

      BR Alfred

      Delete
    3. The related code, for what you've shown a fix is also in the "old" vrq0 patch. @Anonymous doesn't have problems wit the "old" patch.
      I hope that I didn't miss some #ifdef ... #endif lines.

      BR Manuel Krause

      Delete
    4. Use "return" instead of return a value in a function require a return value, which indeed the caller will has a value but it's unpredictable. It may happen to be an zero, and maybe it always be an zero, but once it's not, it's a mass.

      PS, I'm not intend to return a non-value, I changed the return type of this function but since I never tried CONFIG_SMP_NICE, so no compile warning to catch my attention that I miss this one.

      Delete
    5. just for information this fixed the crash

      Delete
    6. That's good. Thanks for testing, and, SMT_NICE works as expected on vrq?

      Delete
    7. Mmm. @Alfred: Sometimes your English is unreadable:
      Did you mean:
      Using simply "return" instead of "return VALUE" in a function that gives back a VALUE confuses the caller, regarding the VALUE.
      ?
      Anyway,: Has it been only a coincidence, that @ Anonymous didn't face this issue with the =same= earlier vrq0 code?

      BR Manuel

      Delete
    8. Yes. He must have been very lucky that the caller always get an zero in earlier vrq kernel build.

      Delete
    9. CONFIG_SMT_NICE was not set in the earlier vrq patch it wasn't available.

      Delete
    10. @Anonymous:
      Thank you, for coming back to clarify this!
      BR Manuel

      Delete