How is current grq lock
Currently BFS is using single grq lock for almost everything, which including task info accessing, rq structure accessing and grq accessing. It's a simple design but all operations about task/rq/grq will need to acquire the grq lock.
The run queue lock
In VBRL, a raw_spin_lock will be add to current bfs run queue structure, to guard run queue structure data accessing cross cpus. At the meantime, local run queue data accessing is still safe with preempt and interrupt disabled.
Lock strategy for cross cpu data accessing
After run queue lock is introduced, the "all grq" locking strategy will be changed as below
a. To accessing task structure info, if task is currently running, task's rq lock needs to be held, which prevent task from switching out of cpu/rq; otherwise, grq lock needs to be held, which prevent task from taken from or put into grq.
task_access_lock...()/task_access_unlock...() inlined routines are introduced for this purpose.
b. To accessing run queue structure, rq's lock needs to be held.
For the convenience of accessing BOTH task's structure and task's rq structure, task_vrq_lock_...()/task_vrq_
c. To accessing grq structure, grq lock needs to be held.
rq_grq_lock...()/rq_grq_
Minimize grq locking session
Adding the above locking strategy will not automatically give performance improvement. Instead, it generates additional locking overhead which impact performance. The performance gain comes by identify and narrow grq sessions, in both session numbers and the run-time spend in it.
To keep things simple and no surprise when play with the grp sessions, the re-factory method is used. That means, just change the code itself but *NOT* the bfs functionality logic, as possible as it can be.
First Stage
In this stage, the first priority task is apply new locking strategy to bfs and make it works. Second priority task is focused on schedule() re-factory, as it is the hot spot of scheduling core code. For the rest part of code, conservative rule is used.
There are 4 concept in schedule().
a. the rq/cpu which schedule() happens on.
b. prev task, the task rq is running (rq->curr) at the beginning schedule() is run.
c. next task, the task rq will run on after schedule().
d. grq, which stores the queued task and next task comes from and prev task puts back into when switch happens.
Changes in schedule()
First of all, grq lock is replaced by rq->lock when locking strategy applied.
In the first part of schedule(), from need_resched label to blk_schedule_flush_plug() block, most codes are safe under rq lock guarding as just prev task data structure is accessed.
Only try_to_wake_up_local() needs grq lock to be held in this part. A debug test shows that it is not the hot code path of schedule(), only less than 50 hits when system boots up and a suspend/resume cycle, so simply around try_to_wake_up_local() by a grq_lock() and grq_unlock() pair.
Second half of schedule(), idle!=prev block, fetch next block and context_switch block. To narrow the grq block session, the spots which need read/write access to grq are highlight then find a way to re-factory it. The code tells what it finally has been done. In brielf, update_clocks() has been re-factoried into 2 calls, update_rq_clock() and update_grq_clock(), the last one just need to be called when idle!=prev and before check_deadline(); update_cpu_clock_switch() has been divided into two versions, which add _idle and _nonidle subfix and be called in individual branch. The grq locking session ends before context_switch().
All these re-factory should *NOT* change the original bfs functionality logic. *EXCEPT* that the prev task needs to be taken care of, as prev task has been returned to grq before context_switch(), once grq lock releases before context_switch(), it may be fetched by other rq/cpu and run on other rq/cpu, but actually prev might haven't done yet in this rq. There are 2 options to solve this
1. dequeue prev task from grq and hold it within rq during context_switch(), once context_switch() done, return prev to grq.
2. hold grq lock during context_switch() so prev is still safe in grq.
Currently, option 1 is used, will make the choice later based on further testing and optimization.
In the first part of schedule(), from need_resched label to blk_schedule_flush_plug() block, most codes are safe under rq lock guarding as just prev task data structure is accessed.
Only try_to_wake_up_local() needs grq lock to be held in this part. A debug test shows that it is not the hot code path of schedule(), only less than 50 hits when system boots up and a suspend/resume cycle, so simply around try_to_wake_up_local() by a grq_lock() and grq_unlock() pair.
Second half of schedule(), idle!=prev block, fetch next block and context_switch block. To narrow the grq block session, the spots which need read/write access to grq are highlight then find a way to re-factory it. The code tells what it finally has been done. In brielf, update_clocks() has been re-factoried into 2 calls, update_rq_clock() and update_grq_clock(), the last one just need to be called when idle!=prev and before check_deadline(); update_cpu_clock_switch() has been divided into two versions, which add _idle and _nonidle subfix and be called in individual branch. The grq locking session ends before context_switch().
All these re-factory should *NOT* change the original bfs functionality logic. *EXCEPT* that the prev task needs to be taken care of, as prev task has been returned to grq before context_switch(), once grq lock releases before context_switch(), it may be fetched by other rq/cpu and run on other rq/cpu, but actually prev might haven't done yet in this rq. There are 2 options to solve this
1. dequeue prev task from grq and hold it within rq during context_switch(), once context_switch() done, return prev to grq.
2. hold grq lock during context_switch() so prev is still safe in grq.
Currently, option 1 is used, will make the choice later based on further testing and optimization.
Scalability&Critical Spot
In VRQL, the run queue lock is per cpu design so it is scalable when cpu number goes up. The grq lock sessions are still the critical spots of scheduling code.
In VRQL, the run queue lock is per cpu design so it is scalable when cpu number goes up. The grq lock sessions are still the critical spots of scheduling code.
Testing
a. Four jobs/cores ratio tests are designed to test possible code path in schedule(), 50%, 100% and 150% jobs/cores ratio tests, which simplely modify the -j<num> according to the cpu core number of the tested machines. 100% + 50%, which runs 50% ratio mprime testing at SCHED_IDLEPRIO priority using schedtool, then run 100% ratio kernel compilation.
b. Kernel compilation throughput tests are done on Intel Xeon L5420 (quad core).
c. Two different bfs code-base kernel are used, they are
1. 3.16_Baseline 3.16 kernel with BFS 0449 + my bfs patch set
2. 3.16_VRQ 3.16 kernel with BFS 0449 + my bfs patch set + VRQ locking patch
Test steps:
1. Boot with targeting kernel into text mode.
2. First-Time-Compile for the kernel code(First-Time-Compile always takes longer, used to cache in files)
3. Compile the kernel code with -j<num> and record real time used, change <num> based on what jobs/cores radio for test and machine is using.
4. Repeat step3 10 times, exclude the max and min time recorded and calculate the average time of the rest.
50% 100% 150% 100%+50%
3.16_Baseline 5m32.978s 2m50.706s 2m56.419s 2m50.656s
3.16_VRQ 5m35.576s 2m50.432s 2m56.642s 2m50.731s
Conclusion
The first stage tests show that the gained performance improvement is comparable to the overhead caused by additional locking. 2 remarkable items list here
1. In 50% jobs/cores radio test, because not all cores are used, the performance gain from refine grq locking is less than additional locking overhead.
a. Four jobs/cores ratio tests are designed to test possible code path in schedule(), 50%, 100% and 150% jobs/cores ratio tests, which simplely modify the -j<num> according to the cpu core number of the tested machines. 100% + 50%, which runs 50% ratio mprime testing at SCHED_IDLEPRIO priority using schedtool, then run 100% ratio kernel compilation.
b. Kernel compilation throughput tests are done on Intel Xeon L5420 (quad core).
c. Two different bfs code-base kernel are used, they are
1. 3.16_Baseline 3.16 kernel with BFS 0449 + my bfs patch set
2. 3.16_VRQ 3.16 kernel with BFS 0449 + my bfs patch set + VRQ locking patch
Test steps:
1. Boot with targeting kernel into text mode.
2. First-Time-Compile for the kernel code(First-Time-Compile always takes longer, used to cache in files)
3. Compile the kernel code with -j<num> and record real time used, change <num> based on what jobs/cores radio for test and machine is using.
4. Repeat step3 10 times, exclude the max and min time recorded and calculate the average time of the rest.
50% 100% 150% 100%+50%
3.16_Baseline 5m32.978s 2m50.706s 2m56.419s 2m50.656s
3.16_VRQ 5m35.576s 2m50.432s 2m56.642s 2m50.731s
Conclusion
The first stage tests show that the gained performance improvement is comparable to the overhead caused by additional locking. 2 remarkable items list here
1. In 50% jobs/cores radio test, because not all cores are used, the performance gain from refine grq locking is less than additional locking overhead.
2. Even VRQ 50% test is
worse than the Baseline, but in 100% test, it gets equal result of the
Baseline, which shows VRQ gain performance by avoid some lock contention, as what it is designed to be.
Hopefully in the coming stage of VRQ coding, the VRQ tests can get better results.
The Code
New branch has been created for VRQ, the code should be considered highly experimental, and should only use for testing.
New branch has been created for VRQ, the code should be considered highly experimental, and should only use for testing.
Branch: https://bitbucket.org/alfredchen/linux-gc/downloads/0450-enable-polling-check.patch
3.16_Baseline at commit: badf57e bfs: Fix runqueues decarelation.
3.16_VRQ at commit: ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
3.16_VRQ at commit: ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
VRQ related commits are:
ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
d59b75c bfs: Make grq cacheline aligned.
e5de0ef bfs: (6/6) Implement task_vrq_lock...() grq_test_lock() logic.
307cc5e bfs: (5/6) Refine schedule() grq locking session.
e2f1be3 bfs: (4/6) use rq&grq double locking.
a61ee81 bfs: (3/6) isolate prev task from grq.
9aad936 bfs: (2/6) other code use grq&rq double locking.
482345f bfs: (1/6) schedule() use grq&rq double locking.
ebe6b8e bfs: (5/6) Refine schedule() grq locking session, cnt.
d59b75c bfs: Make grq cacheline aligned.
e5de0ef bfs: (6/6) Implement task_vrq_lock...() grq_test_lock() logic.
307cc5e bfs: (5/6) Refine schedule() grq locking session.
e2f1be3 bfs: (4/6) use rq&grq double locking.
a61ee81 bfs: (3/6) isolate prev task from grq.
9aad936 bfs: (2/6) other code use grq&rq double locking.
482345f bfs: (1/6) schedule() use grq&rq double locking.
What next
1. It will be nice to have feedback from more testers, especial from who runs on multiple cores or cores with SMT, and the owner with amd cpus.
2. Sync-up with 3.16 code and new BFS release.
3. Identify what next stage code change should be focused and move on.
No comments:
Post a Comment