Monday, March 11, 2019

BMQ Scheduler call out for testing

BMQ(BitMap Queue) Scheduler is a band new CPU scheduler development from PDS and inspired by the scheduler in zircon project(google). It has been developed for months and it's time for open testing(current version is 0.89, on linux kernel 5.0).

For more design detail of BMQ, pls reference to Documentation/scheduler/sched-BMQ.txt in the repository. The documentation is not yet completed, because the scheduler is still under development and major features are not finalized.

Here is the list of major user visible difference of BMQ with PDS.

1. *NOT* support SCHED_ISO, pls use "nice --20" instead.
2. *NO* rr_interval, but a compile time kernel config CONFIG_SCHED_TIMESLICE(default 4ms) is available for similar usage. Yet, it is *strongly NOT recommended* to change it.
3. "yield_type" is still supported, but only value 0, 1(default) are available, 2 is still accept from interface, but it's same as value 1. (Will be changed when yield implementation is finalized)
4. BATCH and IDLE tasks are treated as the same policy. They compete CPU with NORMAL policy tasks, but they just don't boost. To control the priority of NORMAL/BATCH/IDLE tasks, simply use nice level.
5. BMQ will auto adjust(boost/deboost) task priority within +/- MAX_PRIORITY_ADJ(default 4) ranges. For example, from top/htop and other cpu monitors program, task of nice level 0 may be saw running as nice in cpu time accounting.

BMQ has been running smoothly on 3 machines(NUC Desktop, NAS file server and 7*24 raspberry pi) for ~1 month. Suspend/Resume on NUC Desktop and NAS file server are tested. BMQ shows promising in Desktop activity and kernel compilation sanity comparing to PDS. More benchmark is still on going.

BMQ is simple in design compare to PDS and result in ~20KB less in patch size and ~4KB in compressed kernel binary size.

Full kernel tree repository can be found at https://gitlab.com/alfredchen/linux-bmq
And all-in-one patch can be found at gitlab.

Thanks for testing and your feedback will be welcome.

28 comments:

  1. Sounds nice I will install it promptly and report back if I find any issues.

    ReplyDelete
  2. @Alfred, very good news, will try it out as soon as I can.
    Thanks for new scheduler!
    BR, Eduardo

    ReplyDelete
  3. Oh, I was trying to choose between PDS and MuQSS, now there's BMQ too.... I really need some benchmarks for daily desktop use!

    ReplyDelete
    Replies
    1. Not sure if the plan is "..PDS Too", cos i imagine BMQ will take over for PDS, and the latest (save for some bug fixes) is PDS 0.99

      Correct me if i'm wrong tho.

      PS. MuQSS is pretty stable, and have (imo) good performance for 5.0, but if anyone care to do some comparison benchmarks between the 3 (PDS-BMQ-MuQSS), i would be very happy.

      Delete
    2. @Sveinar
      Correct. BMQ will take over PDS if no major issue is reported in the incoming weeks.
      I am considering put sanity test results along with the all-in-one patch files. So they can be traced for further reference.

      Delete
  4. What is not clear to me is what makes this like the Zircon scheduler? The Zircon kernel is so different than the Linux kernel inherently. Zircon is preemptable and also supports pre-empting other cores. Linux by default does neither.

    Plus Zircon is just so different than Linux in so many other ways it seems strange that a scheduler on one would be that valuable on the other? Linux I/O executes on the same core requesting for example. Versus every I/O on Zircon involves an IPC and you use shared memory and can therefore do a type of pipelining.

    I am super curious on the title and hoping it is not some clickbait thing. I would not have spent time on this if there was not the "Zircon" in the title. I am super excited about Zircon.

    ReplyDelete
    Replies
    1. Hey bartturner,

      A copy paste doesn't make sense here.

      Delete
    2. @Anonymous,
      Alfred wrote it rather clearly "...inspired by the scheduler in zircon project(google)." Inspired, which is the keyword here, does NOT mean based or closely followed or the like.

      I'm sure Alfred can explain it nice and clear, but to me words like "clickbait" (hey, zircon is not even in title!) are not exactly appropriate just for the source of ideas.
      If You refer to Phoronix title, please go and complain to Michael ;)

      And, to my knowledge btw, we are talking about CPU scheduler here, nothing more, no need to involve the rest of the kernel.
      BR, Eduardo

      Delete
    3. @Eduardo is right. It's all about CPU scheduler code here.
      The "BitMap queue" data structure used in zircon cpu scheduler code is very similar to the data structure which has been used in BFS. The sparkle in the zircon cpu scheduler is the boost priority adjustment.

      Delete
    4. Thanks Alfred! That helps a ton. I now get it. I do think the bitmap queue might make a more sense with how Zircon is architected than it would with Linux.

      Have spent a decent amount of time learning the Zircon kernel and really like what I see. I do hope with all the chip hiring going on by Google hey will do a SoC optimized for Zircon.

      There is pretty obvious design decisions you would make for Zircon that are very different than you would for Linux.

      Hopefully that is what Google will do and will be able to optimize all the way up the stack from silicon to Flutter.

      Delete
  5. Builds/boots fine here, thanks. I'll deliver this to a wider audience via -pf.

    ReplyDelete
    Replies
    1. @pf
      Thanks for that. BMQ is based on PDS code base, though there still tune to be made, but it is stable for open testing. :)

      Delete
  6. running on 3 hosts so far nor problems. Will try laptop when i915 problems are fixed (black screen on laptop display and no X but thats 5.0 related not bmq)

    ReplyDelete
  7. I've been running PDS for a few weeks and now I've switched to BMQ and it seems to behave not worse than PDS. I didn't do any benchmarks, but interactivity seems to be as fine as it was. I've been using yield_type of 0, but maybe I'll try to switch it to 1 to see if it changes anything.

    ReplyDelete
  8. @Alfred:
    I was really curious for your new CPU scheduler approach.
    For now ~24h, BMQ is running very well on my old dual-core notebook (without HT) in my daily use (webbrowser, CAD, libreoffice, video-playback). I haven't seen any drawback at 5.0.1 BMQ vs. my last PDS at kernel 4.20.12. IMHO, but not benchmarked, it lowers CPU usage by a little bit.
    So your initial tuning values seem to be wisely chosen.

    Many thanks for your great work!

    BR, Manuel

    ReplyDelete
  9. Compile fails for x86-UP (modified to show arrow pointing to "sched_rq_pending_mask"):
    ================================
    kernel/sched/bmq.c: In function ‘dequeue_task’:
    kernel/sched/bmq.c:609:34: error: ‘sched_rq_pending_mask’ undeclared (first use in this function); did you mean ‘sched_rq_watermark’?
    cpumask_clear_cpu(cpu_of(rq), &sched_rq_pending_mask);
    ~~~~~~~~~~~~~~~~~~~~^
    sched_rq_watermark
    kernel/sched/bmq.c:609:34: note: each undeclared identifier is reported only once for each function it appears in
    kernel/sched/bmq.c: In function ‘enqueue_task’:
    kernel/sched/bmq.c:634:32: error: ‘sched_rq_pending_mask’ undeclared (first use in this function); did you mean ‘sched_rq_watermark’?
    cpumask_set_cpu(cpu_of(rq), &sched_rq_pending_mask);
    ~~~~~~~~~~~~~~~~~~~^
    sched_rq_watermark
    make[2]: *** [scripts/Makefile.build:277: kernel/sched/bmq.o] Error 1
    make[1]: *** [scripts/Makefile.build:492: kernel/sched] Error 2

    ReplyDelete
    Replies
    1. @jwh7
      Thanks for reporting. Pls also report it in https://gitlab.com/alfredchen/bmq in case I forget it. Currently, most time is reserved for regression investigation during the tests.

      Delete
    2. Done; looks like I'm #1 :-P
      https://gitlab.com/alfredchen/bmq/issues/1
      Thanks Alfred!

      Delete
  10. Had a strange bug today. Happens when I try to emerge anything, inet hangs, emerge hangs.
    [ 92.095994] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 50s!
    [ 92.096000] Showing busy workqueues and worker pools:
    [ 92.096000] workqueue events: flags=0x0
    [ 92.096001] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
    [ 92.096003] pending: destroy_super_work, gen6_pm_rps_work
    [ 92.096011] workqueue mm_percpu_wq: flags=0x8
    [ 92.096012] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
    [ 92.096013] pending: vmstat_update
    [ 92.096023] workqueue netns: flags=0xe000a
    [ 92.096023] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/1
    [ 92.096025] in-flight: 122:cleanup_net
    [ 92.096037] pool 16: cpus=0-7 flags=0x4 nice=0 hung=0s workers=9 idle: 121 123 120 60 7 107 125 124

    ReplyDelete
    Replies
    1. [ 119.616051] Not tainted 5.0.0-pf3_1+ #6
      [ 119.616052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 119.616052] uksmd D 0 127 2 0x80000000
      [ 119.616053] Call Trace:
      [ 119.616055] ? __schedule+0x4b9/0xf70
      [ 119.616056] ? schedule+0x2a/0xa0
      [ 119.616057] ? schedule_timeout+0x18e/0x270
      [ 119.616059] ? wait_for_common+0x132/0x160
      [ 119.616060] ? wake_up_process+0x10/0x10
      [ 119.616061] ? __flush_work+0xf8/0x190
      [ 119.616063] ? flush_workqueue_prep_pwqs+0x130/0x130
      [ 119.616065] ? lru_add_drain+0x30/0x30
      [ 119.616066] ? lru_add_drain_all+0x112/0x150
      [ 119.616069] ? uksm_do_scan+0x1d59/0x2b50
      [ 119.616071] ? uksm_do_scan+0x2b50/0x2b50
      [ 119.616072] ? uksm_scan_thread+0x113/0x150
      [ 119.616074] ? __kthread_parkme+0x47/0x60
      [ 119.616075] ? kthread+0x107/0x120
      [ 119.616076] ? kthread_create_on_node+0x40/0x40
      [ 119.616078] ? ret_from_fork+0x1f/0x30
      [ 119.616084] INFO: task sandbox:1825 blocked for more than 5 seconds.
      [ 119.616085] Not tainted 5.0.0-pf3_1+ #6
      [ 119.616085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 119.616085] sandbox D 0 1825 1813 0x80000002
      [ 119.616087] Call Trace:
      [ 119.616088] ? __schedule+0x4b9/0xf70
      [ 119.616089] ? schedule+0x2a/0xa0
      [ 119.616090] ? schedule_timeout+0x18e/0x270
      [ 119.616092] ? release_pages+0x28b/0x2c0
      [ 119.616093] ? wait_for_common+0x132/0x160
      [ 119.616094] ? wake_up_process+0x10/0x10
      [ 119.616095] ? __wait_rcu_gp+0xfb/0x130
      [ 119.616097] ? synchronize_rcu+0x4d/0x60
      [ 119.616098] ? kfree_call_rcu+0x10/0x10
      [ 119.616099] ? rcu_panic+0x10/0x10
      [ 119.616101] ? kern_unmount+0x22/0x50
      [ 119.616103] ? put_ipc_ns+0x32/0x70
      [ 119.616104] ? free_nsproxy+0x34/0xa0
      [ 119.616107] ? do_exit+0x2cc/0xa80
      [ 119.616108] ? do_group_exit+0x2e/0xa0
      [ 119.616110] ? __x64_sys_exit_group+0xf/0x10
      [ 119.616111] ? do_syscall_64+0x39/0xe0
      [ 119.616113] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

      Delete
    2. [ 849.869200] rcu: INFO: rcu_preempt self-detected stall on CPU
      [ 849.869204] rcu: 4-....: (17999 ticks this GP) idle=e7e/1/0x4000000000000002 softirq=13908/13908 fqs=5955
      [ 849.869204] rcu: (t=18001 jiffies g=28961 q=57021)
      [ 849.869206] NMI backtrace for cpu 4
      [ 849.869208] CPU: 4 PID: 5242 Comm: emerge Not tainted 5.0.0-pf3_1+ #6
      [ 849.869209] Hardware name: LENOVO qqqq /qqqq , BIOS 8BET56WW (1.36 ) 01/19/2012
      [ 849.869209] Call Trace:
      [ 849.869211]
      [ 849.869215] ? dump_stack+0x46/0x60
      [ 849.869216] ? nmi_cpu_backtrace.cold.0+0x13/0x50
      [ 849.869219] ? lapic_can_unplug_cpu.cold.5+0x42/0x42
      [ 849.869220] ? nmi_trigger_cpumask_backtrace+0xa8/0xb1
      [ 849.869222] ? rcu_dump_cpu_stacks+0x80/0xac
      [ 849.869223] ? rcu_check_callbacks.cold.44+0x199/0x448
      [ 849.869225] ? update_process_times+0x23/0x60
      [ 849.869227] ? tick_sched_timer+0x36/0x70
      [ 849.869228] ? tick_sched_handle.isra.6+0x50/0x50
      [ 849.869236] ? __hrtimer_run_queues+0xee/0x190
      [ 849.869237] ? hrtimer_interrupt+0xef/0x200
      [ 849.869239] ? smp_apic_timer_interrupt+0x48/0x80
      [ 849.869241] ? apic_timer_interrupt+0xf/0x20
      [ 849.869241]

      Delete
    3. Maybe you want to re-test with current vanilla kernel + BMQ?
      I also needed some time to fix -pf2 for kernel 5.0.1, regarding BFQ hassles.
      But I haven't seen any of those messages with my mentioned combination.

      BR, Manuel

      Delete
    4. @Anonymous
      Pls report bug to https://gitlab.com/alfredchen/bmq
      Blog is not a good place for log display and issue focus discussion.:)

      Delete
  11. Thanks Alfred !
    I've done throughput benchmarks of PDS and BMQ.
    You can find them here :
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit#gid=1309629120

    BMQ is very promising and already on par with PDS.

    BMQ and PDS are configured with NO_HZ_FULL and HZ=1000.
    What is the recommended configuration for BMQ ?

    Pedro

    ReplyDelete
    Replies
    1. @Pedro:
      Many thanks for your benchmarks!!!
      Can it be that in the posted sheet some coloring went wrong?
      I mean, e.g., regarding "make j16"? From the values BMQ performs better there than PDS. Can you please have a look?

      BR,
      Manuel

      Delete
    2. @Manuel
      I've seen this one. I believe it's the first time it happens, but it's not an error.
      It's because of the way the means are compared.

      I’ll explain here how I apply the colors, because I’m not sure I’ve got the statistic thing right.
      If someone sees an error, please let me know.

      I use a Welch t-test to assert if 2 results are significantly different.
      The Welch t-test gives the probability (p-value) that we are wrong when we assume that 2 means are the same. I choose to compare this p-value with a 0.05 threshold (the type I error rate in statistical terms).
      So in the t-test table of the sheet, when the p-value of a benchmark result is less than 0.05, it means we have less than 5 % chances to be wrong when we assume the means are the same, and green or red colors are applied. If the p-value is more than 0.05, then we have more than 5 % chances to be wrong, so I consider the results are not significantly different, and blue color is applied.
      I’ve written blue means within 95 % of confidence interval, but I’m unsure about this. I’ll read some more on the subject.

      In the make j16 test, PDS result’s is assumed equal to CFS’s, whereas BMQ’s is considered different, and as it’s the worst of all the 4 results assumed different, it’s in red.
      I admit it’s surprising. I think it’s because BMQ stdev is quite small on make j16 and this « bias » the t-test between results that are close (0.4 % and 0.2 % of CFS).

      Pedro

      Delete
    3. @Manuel
      Thanks for the tests :)

      What you COULD do aswell is to test some arbitrary benchmarks WHILE compiling.

      Set up some compile thing in a loop (or huge source), so that you can benchmark something while eg. doing -j2, -j4 and so on compile in the background and see the comparison.

      The differences might be a lot more visible between schedulers when doing something that really requires a scheduler. As i have posted someplace else, I dont know the situation with current 5.0 kernel (with CFS and cgroups++), but older kernels made things pretty much unusable while doing a -j12 (for my I7) WHILE gaming or other stuff. This is where PDS/MuQSS/BMQ would shine i think.

      Delete
  12. Alfred, can you suggest the optimal settings for:
    - Hz: periodic, no_hz, no_hz_full
    - Preemption: server, desktop, preempt
    - Tick frequency: 100, 250, 300, 1000
    Thank you!

    ReplyDelete