Friday, September 25, 2015

Consider cache in task scheduling Part 2

In this part, let's look at the first factor of the task caching -- the cpu cache size. Talking about cache, there is a fighting between Intel and AMD about the cache size in their cpu design years ago. Intel tends to have large cpu cache size while AMD uses less. I remembered one of AMD's explain is software for gaming doesn't use large cache size. I'm kind of agree that.

IMO, cpu cache size, especially the LLC(Last Level Cache) size determined the hardware capacity of how many data can be cached for cpu. And look at this in another way, giving a sequence of tasks switching, the cpu cache size determined how long the data of a task can be kept in cache. For system with large workload, large number of tasks are running at the same time, cpu with larger cache size will help for keeping task's data in cache than cpu with less cache size. For system workload for short response time, like gaming, large cache size doesn't help much. So, AMD is right for this, but large cache size design is good for common workload usage, not just for the workload like gaming.

Task scheduling should take cpu cache size(llc size) into account. In the latest 4.2 vrq branch, there is a new commit implements the first version of code change, which aware of llc cache size of cpu and auto adjust the cache_scost_threshold(cache switch cost threshold) by it. (For the concept of cache switch cost threshold, please reference to Part 1). Here is the brief of what it has been done.

1. export a scheduler interface to cacheinfo(drivers/base/cacheinfo.c) as a call back once cacheinfo is ready.
2. implement a simple linear formula to auto cache_scost_threshold adjustment.
  • Now this formula is considered based on intel core2 cpu topology and 64bit kernel. Every 512KB LLC size increase CACHE_SCOST_THRESHOLD value by 3.
  • For 32bit kernel, consider 32bit use less data/instruction size, every 256KB LLC size increase CACHE_SCOST_THRESHOLD value by2, but I don't have 32bit systems to prove this yet.
  • The new cpu topology like smt is not yet be considered in the formula, because there are bounces in the benchmark result when SMT enable, so it's hard to compare the result of different CACHE_SCOST_THRESHOLD value.
3. A kernel boot parameter "cache_scost_threshold" is introduced, it can be used for manually assigning the CACHE_SCOST_THRESHOLD value to scheduler if
  • cacheinfo is not available in your arch(like arm, most?) or
  • the simple linear formula is not covering your cpu topology and you want to find out the best value of it.
It's still at its first version, in next version, I'd like to complete the formula to cover the SMT topology.

... to be continued (Part 3)

Thursday, September 24, 2015

VRQ branch for 4.2 release

Just want to post a notice that vrq branch for 4.2 is released, at both bitbucket and github, and it's version is tagged v4.2_0463_2_vrq1.

What's new:
1. Combine a few commits into VRQ solution and bump it's version
3edb4d4 bfs/vrq: VRQ solution v0.6
2. Rename rq->nr_interruptible to rq->nr_uninterruptible, to keep sync with mainline code
dcdc064 bfs/vrq: Instroduce rq->nr_uninterruptible 
3. Task caching scheduler part 2, there will be an individual post for it
8f5e7d2 bfs/vrq: cache scost threshold adjust by llc size 
 
Meanwhile, I'm trying to improve performance for SMT/using task caching property,  unfortunately not every try turns out a positive result, as usual. Anyway, have fun with this new 4.2 vrq release.

BR Alfred

Wednesday, September 16, 2015

gc_v4.2_0463_2 released

As title, just two new updates in this release

1. Remove resched_closest_idle() as planned in previous release post. I haven't got the feedback yet, but considering more reports are coming, both calls are removed at this release. The modified commit is

da50716 bfs: Full cpumask based and LLC sensitive cpu selection, v4

2. Fix update_cpu_load_nohz() redefine error when undefine CONFIG_NO_HZ_COMMON, this is a missing for the 4.2 sync up, the modified commit is

75fd9b3 bfs: [Sync] 4.2 sync up v2

New -gc branch can be found at bitbucket and github. An all-in-one patch includes all bfs related changes(*NOT* all commits in -gc branch) is also provided. 

Have fun with this new -gc release, and -vrq branch update is in-coming.

BR Alfred

Friday, September 11, 2015

Consider cache in task scheduling Part 1

Recently, in the past two release, I am working on the idea taking task caching states into account in scheduler code and implement the first part in later commits on 4.0 and  4.1 release. Now it's in 4.2 and I am working on the second part of code, so it's time to give it a brief introduction.

Till now, the cpu cache is still like a "back-hole" of hardware piece from the view of software it impacts computer  performance greatly, so it's still worthy to emulate task caching usage even it's not 100% accurate. Here is the factors that should be considered
1. Cache size of the cpu

  • Larger cache size gives larger hardware capability to cache a task. Talking about the cache size, we specially means the LLC(Last Level Cache) size, which is usual has the largest amount of cache size among all levels of cache and shared with multi cpu cores/smts.

2. How many tasks have been switched and their caching status and memory footprint after the given task has been switched off the cpu. 

  • More tasks have been switched, more likely the given task's caching data will be flushed away.
  • Switch to a task already been cached, will likely less impact to other task's cache data.
  • Switch to a task with more memory footprint will likely impact other task's cache data than the task with less memory footprint.
3. How long the given task has been switched off the cpu

And these factors are restrict each others. The commit
ed20056 bfs: vrq: [2/2] scost for task caching v0.7
focuses on the 2nd factor above and introduce switch cost value in each run queue to emulate the impact of task switching to the cache data on the cpu. The implement is in function  update_rq_switch().

And when a task switches off a run queue, it records run queue's current switch cost value into it's cache_scost of the task structure. This information combines the cpu which the task run on and the time being switched off(not consider yet) are the caching property of the task, and can be used to emulate the task is cache hot or cool late after.

The algorithm is very simple as just one factor is taken into account in this commit. Whether the task is cache hot or cool is compare the delta of scost value current the rq has and the one when task switched off to a given threshold. The given threshold is now a hard-coded value which optimized for my test-bed system. And the threshold will be adjusted when other factors are taken into account.

Currently, the task caching proprieties are just used to replace the sticky task implementation in original BFS. And for sure it will be used in further development. Talking about the sticky task replacement implementation using task caching, the codes are simpler now
1. run queue doesn't need to reference to it's sticky task, and multiple tasks from a run queue can be "sticky"(caching) at the same time.
2. run queue doesn't need to trace sticky task's sticky state, now the cached state of the task is "auto" updated during task selection(in the earliest_deadline_task() function).


Pros:

With this commit and optimize the cache scost threshold for my test-bed system, the sanity test shows better performance in all kind of workload.

Cons:
Depends on what the threshold is chosen, the task may be waiting for the cache hot cpu to pick it up in grq longer than original BFS "sticky task" design. This will impact task's response time and an improvement is planned in further development.

... to be continue (Part 2)

Thursday, September 3, 2015

4.2 Sync up completed for -gc branch

It's good to announce that my 4.2 bfs sync up work has been completed for -gc branch. Lots of sync up work in 4.2, considering the code diff between the -gc and -vrq, some of the sync-up code would be just done in -vrq branch.

Basically, there is no new code for -gc branch, just pick up

bfs: [Fix] Add additional checking in sched_submit_work()
from my latest post.

There are total three reports that there are issue caused in resched_closest_idle(). I am considering remove this function in -gc as there is a total replacement implement in -vrq branch, but I'm still waiting for the feedback to decide to remove both two calls or may be just one of them. So there will be an update once it's finalized. And as upstream asm code clean-up, some X86 cpumask api is no longer supported, but good news is we using more generic ones, this happens in 

bfs: Full cpumask based and LLC sensitive cpu selection, v3

At this point of time, some upstream patch, like BFQ is not yet updated, so the official -gc branch is not created yet, but there is a linux-4.2.y-bfs branch on bitbucket and github, which contains all bfs related commits in -gc branch for kernel 4.2. And there is an all-in-one patch file you can apply upon vanilla kernel tree easily.

Have fun with 4.2 and reports back if any related issue with this porting of bfs.

BR Alfred

Edit:
Missed one api changes for SMT, here is the update patch.