Monday, February 9, 2015

VRQ: About Issue found in 0.3

Issue
Thanks Manuel for testing -vrq and report an issue that while background(SCHED_BATCH, nice 19) workload is running, one of the cpu failed to pick up normal/system tasks.
The detail of the issue can be found from http://cchalpha.blogspot.com/2014/12/vrq-03-updates.html

Cause
After asking Manuel about his usage for a few rounds, I finally be able to reproduced the issue and used bisect to find out that "bfs: vrq: RQ niffy solution." which introduce the issue.
The root cause is that the difference of niffy of each cpu goes very large after system keep on running, I recorded 6+ seconds difference after system up for 16+hours in my system. This cause the tasks which run on the cpu with lower niffy has earlier deadline than others, so it failed to pick up other normal/system load.

Solution
Simplely revert the commit can fix the issue but it is against the intention of commit "bfs: vrq: RQ niffy solution.", to make update_clocks() grq lock free.
After testing, I found that rq->clock which based on sched_clock_cpu() is stable on my system and the difference among cpus are small enough for deadline calculation. So I give the rq->clock a try and remove sched_clock sanity checks. I also notice that CK add the sanity check for "crazy sched_clock interface", so there may be some unexpected behaviors on some hardware, specially old machines, if this unexpected behaviors is still popular, I will add some kind of sanity check back.

The new solution will be posted with the incoming 3.19 -vrq branch, as this bug is found on -vrq and I would like the -vrq branch stay on itself for one more release before merging them into -gc branch.