Based on the new implementation of skip list, I am adding -VRQ commits upon it. All commits will be pushed in linux-4.7.y-sl branch. Unlike previous -vrq release, I am releasing additional checkpoint tags, so users can use these tags to check their issue and narrow code changes which introducing the issue.
4. 4.7_0472_sl_new_gc and all in one patch.
This patch include all former -gc commits, most important one is the "
Full cpumask based and LLC sensitive cpu selection" commit, which help with performance under low workload.
5. 4.7_0472_sl_new_vrq and all in one patch.
This is not the same as VRQ3 patch, but only include two major feature "VRQ solution" and "preempt task solution". I want tag a checkpoint here before the latest stick/cache code changes as it is not yet finalized.
First one is an embedded rewritten version of CK's original skip list in 0480 bfs. It is similar to the code changes what CK has made in 0497, but my implementation are different in
1. Just keep one data structure, that's skiplist_node.
2. Remove kmalloc/kfree completely.
3. Remove value in skiplist_node and use container_of() to obtain the pointer of structure which skiplist_node is embedded in.
A customized search function should be defined using DEFINE_SKIPLIST_INSERT macro and be used for skip list insert operation.
Random Level should be customized implemented and set to node->level then pass to the customized skiplist_insert function.
Levels start at zero and go up to (NUM_SKIPLIST_LEVEL -1).
NUM_SKIPLIST_LEVEL in this implementation is 8 instead of origin 16, considering that there will be 256 entries to enable the top level when using random level p=0.5, and that number is more than enough for a run queue in a scheduler usage. And it also help to reduce the memory usage of the embedded skip list node in task_struct to about 50%.
Based on testing, the first 8 bits in microseconds of niffies are suitable for random level population. find_first_bit() is used to satisfy p = 0.5 between each levels, and there should be platform hardware supported instruction(known as ctz/clz) to speed up this function.
The skiplist level for a task is populated when task is created and doesn't change in task's life time. When task is being inserted into run queue, this skiplist level is set to task's sl_node->level, the skiplist insert function may change it based on current level of the skip lsit.
And there is a lot enhancement in insert/delete function.
The hackbench and kernel compilation sanity both show improvement comparing the baseline version.
It's great that CK release huge changes for BFS in 0480 release, the changes not only include cpufreq trigger and cpufreq governor change in recent kernel release but also the most fun part, use skip list to replace the bitmap mask linked queue in global run queue data structure. May be there are many changes at a time, so BFS 0480 is not settle down yet, most issue are related to cpufreq trigger and cpufreq governor code. I was planed to rework the about 60 -vrq commits upon 0480, but now I'd had to wait till it is calmed down. Meanwhile, I have done some tests for the skiplist design in BFS. That means only three patches in 0480 is used, they are
bfs472-fix_set_task_cpu.patch skiplists.patch bfs472-skiplist.patch For my kernel compilation sanity tests, sl is almost as same as bfs0472, only different is sl is a little bit better than 0472 at 300% workload. For hackbench(from https://lwn.net/Articles/351058/), 20 groups, 0472 finished at about 4sec, sl finished at 3.6sec. Consider -vrq(0472 based) finished at 3.8sec and cfs at 3.2sec, I can't wait to see how it goes when sl combine with -vrq. The skiplist design for BFS, IMO, as an initial release, it looks pretty good and very potential. Here are my comments for the skiplist in 0480 #1 In earliest_deadline_task(), RT tasks is now impacted by interactive setting. IMO, it should keep the same behaviors as it is at 0472. #2 When interactivity is enabled and consider both normal policy tasks and idle policy tasks are in the queue, for example tasks in queue are in this order (N1, N2, D1, D2), in current implement, it's possible that earliest_deadline_task() go through all 4 tasks to pick up one among them. But normal tasks should always has higher priority than idle ones, so just go through the tasks with normal policy(that's the N1, N2 in this example) will be enough. #3 kmalloc is used for current skip list implementation. It is not a good design for critical code like a scheduler. I have rewriten the skip list implementation to embedded the data structure into the task_struct etc. (PS, CK address this in 0497, meanwhile I have rewritten an embedded implementation too) #4 Currently, there are same probability that randomLevel() returns 0~15. But "where an element in layer i appears in layer i+1 with some fixed probability p (two commonly used values for p are 1/2 or 1/4)" -- from https://en.wikipedia.org/wiki/Skip_list , so there should be some improvement to adjust the random level.
I have finished #1 and #2 propose enhancement patch and take it as a baseline for my rest SL related embedded/improvement skip list implementation patches. Once the code has been cleaned up, I will push the rest patches to the linux-4.7.y-sl branch in my git repositories.