Monday, February 9, 2015

VRQ: About Issue found in 0.3

Issue
Thanks Manuel for testing -vrq and report an issue that while background(SCHED_BATCH, nice 19) workload is running, one of the cpu failed to pick up normal/system tasks.
The detail of the issue can be found from http://cchalpha.blogspot.com/2014/12/vrq-03-updates.html

Cause
After asking Manuel about his usage for a few rounds, I finally be able to reproduced the issue and used bisect to find out that "bfs: vrq: RQ niffy solution." which introduce the issue.
The root cause is that the difference of niffy of each cpu goes very large after system keep on running, I recorded 6+ seconds difference after system up for 16+hours in my system. This cause the tasks which run on the cpu with lower niffy has earlier deadline than others, so it failed to pick up other normal/system load.

Solution
Simplely revert the commit can fix the issue but it is against the intention of commit "bfs: vrq: RQ niffy solution.", to make update_clocks() grq lock free.
After testing, I found that rq->clock which based on sched_clock_cpu() is stable on my system and the difference among cpus are small enough for deadline calculation. So I give the rq->clock a try and remove sched_clock sanity checks. I also notice that CK add the sanity check for "crazy sched_clock interface", so there may be some unexpected behaviors on some hardware, specially old machines, if this unexpected behaviors is still popular, I will add some kind of sanity check back.

The new solution will be posted with the incoming 3.19 -vrq branch, as this bug is found on -vrq and I would like the -vrq branch stay on itself for one more release before merging them into -gc branch.

9 comments:

  1. Hi Alfred, I'm definitely looking forward to the new -VRQ release with the "new solution".

    Best regards,
    Manuel

    ReplyDelete
    Replies
    1. Just a side question/ asking: Would it be possible for you to provide the fixed versions of the omitted patches (bfs: vrq: dedicated xxxx_schedule(),
      bfs: vrq, refactory wake_up_new_task,
      bfs: vrq: RQ niffy solution) for the 3.18.x kernel series, e.g. in your -vrq repository? This would be really nice :-)

      Manuel

      Delete
  2. Hi Alfred!
    By coincidence I've found your actualized repository 3.19.y-new before you've even announced it. I've tested the 43 patches and am not glad with this solution: When the CPU is 100% loaded (e.g. with worldcommunitygrid, you remember) every desktop interaction slows down extremely. Video playback is unusable with lost frames and audio stuttering. This happens no matter what scheduling policy the wcg client is running at (BATCH, IDLE or NORMAL).
    Sidenote: For my test I've omitted No. 25 "[PATCH] bfs: xxxx_schedule() stat debug." {linux-gc-9cb48030eb5503ec72bbd01c73665debbcd6a8c6.diff} and carefully removed the related references in the following patches. Don't you experience a similar behaviour on your system? Do you have any idea what may be going on here? I'm going to rebuild the kernel with your original patches to exclude any errors that I may have introduced and report back.

    Best regards,
    Manuel

    ReplyDelete
    Replies
    1. No, sorry for you, just the same wrong behaviour with the all original 43 patches. 100% CPU load leads to massive stuttering everywhere on the system/desktop.

      Manuel

      Delete
  3. Mmmh, I've now tried the latest two add-on patches on top. But they don't heal the behaviour. It may be a bit better now. Still stuttering with video playing, KDE menus, etc. To tell in a humorous way: Ship it with the tag: "Improves snappyness. But don't ever move your mouse!".

    I've now reverted the patches:
    0d2b828 bfs: preempt task, v1.1.
    6c9722c bfs: cache task solution, fix.
    48016a2 bfs: preempt task, v1
    c98993f bfs: cache task solution, v1

    => and the resulting kernel is running very well. Maybe snappier than original BFS/CK.

    So the issue is introduced in those mentioned commits.

    Best regards, and keep up your good work,
    Manuel Krause

    ReplyDelete
    Replies
    1. Sorry for the late reply, I am out of town last week again. The -new branch is used to sync up the git tree between my desktop and notebook when I code remotely from home.

      I don't mean to public the -new branch. Yes, as the result of your testing, the last four commits introduced something new but not well tested. The first 2 of them was finished before I was leaving last week and basic tests show there are improvements for both 50% and 100% workload.

      While I was working with my notebook with this kernel, I do notify lag behaviors when compiling kernel. As I am using a very lightweight DE, the desktop still usable.

      The most likely commit which introduce the lag is the cache task solution, which is a replacement for the original sticky task solution in BFS, but it is designed to be LLC sensitive. It seems to me that the deadline adjustment is too much aggressive.

      Because it's short of environment and tools for testing in my notebook, I have to wait till home then tune the design.

      Thanks again for your testing these experimental commits, hopefully it doesn't waste much of your time.

      Delete
  4. BTW, a question regarding the introduced statistics and debug code: Does it cause much CPU overhead?

    Manuel

    ReplyDelete
    Replies
    1. It doesn't cause much overhead. I used to watch these info to understand the call model.

      Delete
  5. Thank you for your answers that make me understand things better.

    If you'd say it's helpful for you that I test your code then it doesn't mean a waste of time for me. :-)

    BR, Manuel

    ReplyDelete