Wednesday, November 26, 2014

-gc updates and VRQ benchmark in 3.17

There is BFS 0458 release for 3.17, I have synced up -gc with it then rebase -gc branch with v3.17.3 then v3.17.4, you can fetch them from my git.

As 0458 BFS released, I kicked off another new branch of testing to get the result of BFS 0458, Baseline of VRQ patch and the VRQ patch.
The result highlighted are
a. For low workload, Baseline&VRQ is ~3 seconds(1%) faster than original BFS.
b. For heavy workload, VRQ is ~5 seconds(3%) faster than original BFS and the baseline code.

It's kind of busy these days I have tried some new idea based on VRQ solution, but the result isn't that good, so I spend most of time in debugging and haven't update the blog, :)

Based on the test result of VRQ, I believe it is a good direction to go. Especially it shows little overhead under heavy load than original BFS. And recently, I am thinking about improving BFS/VRQ by reducing grq usage, but the debug code prove an existed bug in current VRQ code, now I have to rework the whole VRQ code allover again to debug it. So in the coming 0.3 release of VRQ, I plan to fix the known issues in vrq branch and clean-up the code, which will help to stabilize VRQ code and also helpful for new idea which based on VRQ solution. At the meantime, I have update the -gc branch and put detail comments into bfs commits(if needed), hopefully CK will pick-up some of them in next BFS release.

Stay tuned. :D

#1 Jobs/Cores ratio 50%

3.17_0458_50
5m20.933s
5m20.896s
5m20.882s
5m20.867s
5m20.849s
5m20.743s
5m20.663s
5m20.621s
5m20.340s
5m20.251s

3.17_Baseline_50
5m18.065s
5m18.019s
5m17.956s
5m17.953s
5m17.941s
5m17.896s
5m17.891s
5m17.839s
5m17.826s
5m17.794s

3.17_VRQ_50
5m17.897s
5m17.856s
5m17.852s
5m17.788s
5m17.759s
5m17.748s
5m17.721s
5m17.673s
5m17.642s
5m17.634s


#2 Jobs/Cores ratio 150%

3.17_0458_150
2m56.468s
2m56.379s
2m56.377s
2m56.367s
2m56.364s
2m56.335s
2m56.322s
2m56.321s
2m56.261s
2m56.184s

3.17_Baseline_150
2m56.472s
2m56.407s
2m56.341s
2m56.312s
2m56.303s
2m56.293s
2m56.289s
2m56.288s
2m56.259s
2m56.202s

3.17_VRQ_150
2m51.553s
2m51.516s
2m51.473s
2m51.472s
2m51.448s
2m51.379s
2m51.358s
2m51.339s
2m51.338s
2m51.292s

#3 Jobs/Cores ratio 100%

3.17_0458_100
2m52.031s
2m50.718s
2m50.663s
2m50.650s
2m50.624s
2m50.603s
2m50.559s
2m50.541s
2m50.485s
2m50.468s

3.17_Baseline_100
2m50.945s
2m50.719s
2m50.683s
2m50.639s
2m50.602s
2m50.597s
2m50.583s
2m50.530s
2m50.521s
2m50.486s

3.17_VRQ_100
2m49.896s
2m49.675s
2m49.670s
2m49.658s
2m49.604s
2m49.538s
2m49.501s
2m49.480s
2m49.463s
2m49.457s

#4 Jobs/Cores ratio 100%

3.17_0458_100_50
2m52.026s
2m51.384s
2m51.151s
2m51.039s
2m50.983s
2m50.901s
2m50.863s
2m50.852s
2m50.810s
2m50.699s

3.17_Baseline_100_50
2m52.161s
2m51.007s
2m50.934s
2m50.864s
2m50.861s
2m50.857s
2m50.842s
2m50.817s
2m50.789s
2m50.733s

3.17_VRQ_100_50
2m49.801s
2m49.783s
2m49.740s
2m49.730s
2m49.716s
2m49.703s
2m49.700s
2m49.666s
2m49.662s
2m49.606s

Tuesday, November 4, 2014

What's new for 3.17 gc patch set

In the last kernel release cycle, I played around some ideas with BFS, ideas like how to select a task to run, sticky task etc. The result is not as good as expected but all these trying let me knows BFS code better. So there will be no huge changes in BFS VRQ solution branch, I may need another round to think it all over again. Instead, there are some changes which can barely apply upon the original bfs grq lock solution are back-ported to -gc branch.

3e678e6 bfs: sync with mainline sched_setscheduler() logic.
-- There is new parameter called sched_attr is introduced during 3.14 in sched_setscheduler() and related functions. BFS is not fully sync with these changes yet.

f3a98f9 bfs: Refactory online_cpus() checking in try_preempt().
9c72372 bfs: Refactory needs_other_cpu().
--Two changes in try_preempt().

65f06e8 bfs: priodl to speed up try_preempt().
-- Introduced task priodl, which first 8 bits is the task prio and the last 56 bits are the task deadline(higher 56 bits) to speed up try_preempt() calculation.

1fa3bb5 bfs: Full cpumask based and LLC sensitive cpu selection.
-- Rewrite the cpu selection logic for tasks, by using full cpumask based calculation. The benefit of cpumask based calculation is that the cost is not scaled with cpu numbers when it is among a certain range(64 cpus for 64bits system and 32 cpus for 32bit system). The cost to transfer tasks among cpu which shares same LLC(Last Level Cache) should be consider free.
The best_mask_cpu() cpu selection logic now follows below orders:
* Non scaled same cpu as task originally runs on
* Non scaled SMT of the cup
* Non scaled cores/threads shares last level cache
* Scaled same cpu as task originally runs on
* Scaled same cpu as task originally runs on
* Scaled SMT of the cup
* Scaled cores/threads shares last level cache
* Non scaled cores within the same physical cpu
* Non scaled cpus/Cores within the local NODE
* Scaled cores within the same physical cpu
* Scaled cpus/Cores within the local NODE
* All cpus avariable
To implement full cpumask calculation, non_scaled_cpumask is introduced in grq structure. The plug-in code in cpufreq and intel_pstate drivers also be modified to avoid multi-trigger when scaling down from max cpu freq(intel_pstate driver just pass compile test, I have no hardware which runs on intel_pstate driver)

Here are the test result of the 3.17-gc, comparing to vrq-02-baseline-test-result, in low workload, these patches give another 3~4 seconds improvement.

#1  3.17.2-gc 50% tasks/cores ratio
5m18.630s
5m18.509s
5m18.504s
5m18.494s
5m18.489s
5m18.487s
5m18.481s
5m18.461s
5m18.383s
5m18.339s

#2 3.17.2-gc 150% tasks/cores ratio
2m56.475s
2m56.447s
2m56.418s
2m56.410s
2m56.402s
2m56.367s
2m56.349s
2m56.197s
2m56.173s
2m56.128s

#3 3.17.2-gc 100% tasks/cores ratio
2m52.318s
2m50.698s
2m50.654s
2m50.596s
2m50.578s
2m50.543s
2m50.534s
2m50.508s
2m50.437s
2m50.430s

#4 3.17.2-gc 100%+50% tasks/cores ratio
2m52.115s
2m51.246s
2m50.892s
2m50.880s
2m50.847s
2m50.819s
2m50.805s
2m50.804s
2m50.789s
2m50.645s

If you want to try these new patches, please check my linux-3.17.y-gc git branch.