Thursday, August 27, 2015

The BFS unpluged io issue

We traced the unplugged_io issue these two weeks, most discussion are in the replies of a-big-commit-added-to-41-vrq

At first I though that

"I guess the sched_submit_work() doesn't work for bfs b/c bfs use grq_lock instead task_lock() in mainline which a combine of task's pi_lock and rq->lock, the checking of tsk_is_pi_blocked(tsk) is not enough for BFS."

After investigation, it turns out that tsk_is_pi_blocked() is introduced in v3.3
3c7d518 sched/rt: Do not submit new work when PI-blocked
And it's not indicate tsk->pi_lock is held as I used to think it was.

So the question is back again, when sched_submit_works() is introduced in mainline 3.1, it moves the blk_schedule_flush_plug(tsk) call outside from schedule(), but relaxing the checking when not calling it. This code change is ok for mainline CFS but it's not for BFS somehow.
Adding back those checking is the current solution. The last patch for this issue is unchanged. I'd update -gc and -vrq branch soon to include it.

BR Alfred


  1. Hi Alfred Chen,

    I have some trobles to understand your branches.
    There is a gc-branch and a vrq-branch, what's the difference?
    How I'm supposed to try your bfs modifications, I can't find a patch, like ck provides it.
    I could clone your repo, but then it's somewhat difficult to see all your modifications across the kerenel sources.

    1. In short, -gc branch are stable commits, -vrq branch is more like a feature branch or experimental branch.
      In 4.1 release, I have provided an all-in-one patch file, you can check my post
      But the patches come out after it is not included, you need to include them manually.

    2. I see,

      current gc-branch is highly unstable for me.
      - oops on boot
      - I was able to boot once, some tests that I did bofore it froze suggest a larger latency then bfs

    3. Do you pick up
      It's not in the all-in-one patch file yet.

    4. I'm speaking of the current gc-branch (linux-4.1.y-gc) in your repo, it has this patch included

    5. Do you have the oops trace log or screen-shot picture I can have a look into?

    6. I don't have time at the moment, I'll try to send you some logs later

    7. I tried your vrq-branch it doesn't crash for me, unfortunately throughput is very bad on it.

      time echo "scale=4000; a(1)*4" | bc -l
      real 0m28.884s
      user 0m28.826s
      sys 0m0.001s

      taskset -c 0 time echo "scale=4000; a(1)*4" | bc -l
      real 0m15.502s
      user 0m15.488s
      sys 0m0.001s

      I took a quick look into your idle-cpu selection. Are you really trying to place a task first on a smt before trying a real core?

    8. More precise:
      idling smt on the same cpu bofore a complete idling (= no threads running on it) cpu?

    9. @Anonymous
      Good news is it's first time I hear that -vrq is stable than -gc, :)
      If you still got chance to try -gc, plz help to catch the oops log, it will be helpful to seek undiscovered issue in the code.

      You finding are valuable. The story is, at the time I wrote the code, it was purely from the cache distance view and I don't have SMT machine to test and improve the code. As now I got such SMT machine, I'll see how it goes with the code and put it to the todo list in next release if you have checked my previous post.

      I have done a quick test follow your provide command line, but the result on my ivy cpu is quit close to each other.
      time echo "scale=4000; a(1)*4" | bc -l
      real 0m10.969s
      user 0m10.952s
      sys 0m0.002s

      time taskset -c 0 echo "scale=4000; a(1)*4" | bc -l
      real 0m10.950s
      user 0m10.935s
      sys 0m0.001s

      So, what's your system workload when run this test and have you enable SMT_NICE?

      BR Alfred

    10. My bachelor thesis was about massive parallel sortign algorithms. There I did some smt benchmarks. I could only get speedups about 10%-15% by using smt, my algorithm scaled nearly perfect.

      Concerning my troughput benchmarks:
      You're right, something is wrong on my system.
      I don't use a frequency scaler, so the cores should do the clocking withot OS support, but it works only on core 0, if I bind the benchmark to any core different then zero the core reclocks in a strange way. The I tested it with the performance governer, same resulst. Then ondemand, and it started to reclock in the right way. It's very strange that if I have first enable ondemad and then the performance governor it reclocks also in the right way even when running on the performace governor. This happens also when using cfs. So maybe there is a bug in linux or my bios is doing strange things. So yes after this I'm getting the same results as you, so it's not a scheduler problem

    11. Here wo go

    12. Would you plz try

      It's the 3rd time this kind of issue reported, the patch remove the resched_closest_idle() calls(total 2 of them), if it works for you, would you plz try remove just one call of them and see which one or need to remove both of them to make it works?

      BR Alfred