Comments on Alfred Chen's Blog: A big commit added to 4.1 VRQ

Aaaahhh. O.k. Forget my posting. I've just rea...

2015-09-01T12:32:23.367-07:00

Aaaahhh. O.k. Forget my posting. I've just read the switch "Load more" at the very bottom of the page.

Mmmh. I've written a comment to the long threa...

2015-09-01T12:28:46.224-07:00

Mmmh. I've written a comment to the long thread above last night, but can't see it. And the comment count increased by 1 then (and by 2) until now. Can't see the reply. Strange interface.

BR Manuel

@Manuel: not sure where your post did go: yes tha...

2015-09-01T11:48:22.484-07:00

@Manuel:

not sure where your post did go: yes that change "fixed" it for me,

@Alfred:

to calm your mind: the lockup I experienced during the ZFS send (twice) it's not scheduler related - well, it appears to be to some point but the focus lies on other system parts (rcu, IRQs, hardware, drivers, etc.)

so it's not caused by BFS or your BFS changes :)

Thanks !

@kernelOfTruth: Although Alfred already named this...

2015-08-31T13:01:00.163-07:00

@kernelOfTruth:
Although Alfred already named this thead getting off-topic... some new off-topic comment ;-)

I'm also using the threadirqs kernel command line option and have not seen direct(!) negative effects. This refers to my postings especially regarding my tests fron August 24th+. These involved a USB 2.0 stick? drive (FAT formatted for compatibility reasons; friends^^).

Have you been able to finish the transfer without the "threadirqs" option successfully?
(The lkml thread is... somekind of... old? Do you think it's still relevant for the issue? Honest question.)

BTW, I'm still searching for "something" (driver, setting, patch e.g.) responsible for TuxOnIce being unreliable sometimes. What I've seen is, that reliability got much better with a) kernel 4.1 up to 4.1.6, b) Alfred's -gc enhancements, equal or better with: c) the -vrq patches' addons. The -vrq patched kernel may fail really rarely, but if it then failed, once in ~one week with ~21 hibernations, the TuxOnIce image is gone.

Best regards,
Manuel Krause

@Alfred: most likely related to threadirqs (as I ...

2015-08-29T12:25:55.421-07:00

@Alfred:

most likely related to threadirqs (as I expected), I got another hardlock during attempt of transferring ZFS snapshots (around 400 GiB out of 2 TiB - so I have to start over again XD )

this time without threadirqs

related thread: https://lkml.org/lkml/2013/12/31/144 [3.13 <= rc6. Using USB 2.0 devices is braking the system when using "threadirqs" kernel optio]

Just write a new post about the issue. In short, n...

2015-08-27T19:28:08.281-07:00

Just write a new post about the issue. In short, no new patches for testing, last one seems good.

@Manuel, take it easy :). @Alfred, more patches t...

2015-08-27T01:23:07.495-07:00

@Manuel, take it easy :).

@Alfred, more patches to test are coming?

I'm still investigating the unplugged_io patch...

2015-08-26T19:35:03.592-07:00

I'm still investigating the unplugged_io patch and try to improve it. For kernel's new ZFS trace, I believe rcu preempt checking mostly likely happens at schedule time, so it's hard to tell it's a scheduler issue.

For the next patch for testing, currently I think preempt should be disabled for the additional checking but it may impact performance, so I need a benchmark to see how it goes. I'll start a new post once it is done. This one is growing long and off-topic, :)

Maybe I also misused the word "bad". I ...

2015-08-26T12:41:18.885-07:00

Maybe I also misused the word "bad". I just see the other side of the medal, too: Even "bad" news, those regarding failures, are "good" news -- as they would lead to fixes, sooner or later, for our beloved Linux operating system.

Best regards,
Manuel

@post-factum: Sorry, you've definitely got me ...

2015-08-26T12:27:30.269-07:00

@post-factum:
Sorry, you've definitely got me wrong. I meant: As long as we don't get lockup messages from your side, everything seems good for the time you're doing testing until now. Longer, but more precisely.

I didn't intend to say that you're only bringing bad news.
I really appreciate your work and testing time and would never want to be impolite to you,

Best regards,
Manuel

> no bad news from post-factum is good news? Is...

2015-08-26T11:41:31.865-07:00

> no bad news from post-factum is good news? Isn't it?

Oh, jerk off with that :/. As if I bring bad news only.

Anyway, second patch still works OK for me.

2015-08-26T11:40:40.340-07:00

This comment has been removed by the author.

I think, no bad news from post-factum is good news...

2015-08-26T10:52:30.000-07:00

I think, no bad news from post-factum is good news? Isn't it?

What about the new patch you've mentioned August 25, 2015 at 8:20 AM -- or are you still investigating, whether kernelOfTruth's traces may be scheduler related or not?

BR Manuel

@kernelOfTruth Most likely not. But I'm sure i...

2015-08-25T19:19:50.485-07:00

@kernelOfTruth
Most likely not. But I'm sure it's not the unplugged_io issue we are tracing.

Just had a hardlock during ZFS snapshot send: Aug...

2015-08-25T15:38:21.432-07:00

Just had a hardlock during ZFS snapshot send:

Aug 26 00:29:13 morpheus kernel: [69082.418467] INFO: rcu_preempt detected stalls on CPUs/tasks:
Aug 26 00:29:13 morpheus kernel: [69082.418477] 4: (0 ticks this GP) idle=9f9/140000000000000/0 softirq=3923228/3923228 fqs=12328 last_accelerate: f53f/85c8, nonlazy_posted: 0, L.
Aug 26 00:29:13 morpheus kernel: [69082.418481] 5: (1 GPs behind) idle=8c7/140000000000001/0 softirq=2298621/2298622 fqs=12328 last_accelerate: f53f/85c8, nonlazy_posted: 0, L.
Aug 26 00:29:13 morpheus kernel: [69082.418482] (detected by 3, t=37002 jiffies, g=1688364, c=1688363, q=13497)
Aug 26 00:29:13 morpheus kernel: [69082.418485] Task dump for CPU 4:
Aug 26 00:29:13 morpheus kernel: [69082.418486] irq/23-ehci_hcd R running task 0 353 2 0x00000008
Aug 26 00:29:13 morpheus kernel: [69082.418488] ffffffff81e796ae ffffffff81e7b192 0000000000000003 ffff8807f9850000
Aug 26 00:29:13 morpheus kernel: [69082.418490] ffff8800cf1a0000 ffff8800cf19fd68 ffff8807f4b2cf00 ffff8807f4e40800
Aug 26 00:29:13 morpheus kernel: [69082.418492] ffff8807f4e40800 ffff8800cf1a0000 ffffffff8114d640 ffff8800cf19fd88
Aug 26 00:29:13 morpheus kernel: [69082.418494] Call Trace:
Aug 26 00:29:13 morpheus kernel: [69082.418508] [] ? __schedule+0x11ae/0x2c60
Aug 26 00:29:13 morpheus kernel: [69082.418510] [] ? schedule+0x32/0xc0
Aug 26 00:29:13 morpheus kernel: [69082.418513] [] ? irq_thread_fn+0x40/0x40
Aug 26 00:29:13 morpheus kernel: [69082.418516] [] ? usb_hcd_irq+0x21/0x40
Aug 26 00:29:13 morpheus kernel: [69082.418517] [] ? irq_forced_thread_fn+0x2e/0x70
Aug 26 00:29:13 morpheus kernel: [69082.418519] [] ? irq_thread+0x13f/0x170
Aug 26 00:29:13 morpheus kernel: [69082.418520] [] ? wake_threads_waitq+0x30/0x30
Aug 26 00:29:13 morpheus kernel: [69082.418521] [] ? irq_thread_dtor+0xb0/0xb0
Aug 26 00:29:13 morpheus kernel: [69082.418524] [] ? kthread+0xf2/0x110
Aug 26 00:29:13 morpheus kernel: [69082.418528] [] ? sched_clock+0x9/0x10
Aug 26 00:29:13 morpheus kernel: [69082.418530] [] ? kthread_create_on_node+0x2f0/0x2f0
Aug 26 00:29:13 morpheus kernel: [69082.418532] [] ? ret_from_fork+0x42/0x70
Aug 26 00:29:13 morpheus kernel: [69082.418533] [] ? kthread_create_on_node+0x2f0/0x2f0
Aug 26 00:29:13 morpheus kernel: [69082.418534] Task dump for CPU 5:
Aug 26 00:29:13 morpheus kernel: [69082.418535] irq/33-xhci_hcd R running task 0 840 2 0x00000008
Aug 26 00:29:13 morpheus kernel: [69082.418537] 0000000000000003 ffff88066ef1eb80 ffff8800be358000 00000000f9852300
Aug 26 00:29:13 morpheus kernel: [69082.418539] 00000000296b0ad0 ffff8807f5593d68 ffff8807f550d100 ffff8807f51c5a00
Aug 26 00:29:13 morpheus kernel: [69082.418541] ffff8807f51c5a00 ffff8807f50d4600 ffffffff8114d640 ffff8807f5593d88
Aug 26 00:29:13 morpheus kernel: [69082.418543] Call Trace:
Aug 26 00:29:13 morpheus kernel: [69082.418544] [] ? irq_thread_fn+0x40/0x40
Aug 26 00:29:13 morpheus kernel: [69082.418557] [] ? xhci_msi_irq+0xc/0x10 [xhci_hcd]
Aug 26 00:29:13 morpheus kernel: [69082.418558] [] ? irq_forced_thread_fn+0x2e/0x70
Aug 26 00:29:13 morpheus kernel: [69082.418559] [] ? irq_thread+0x13f/0x170
Aug 26 00:29:13 morpheus kernel: [69082.418561] [] ? wake_threads_waitq+0x30/0x30
Aug 26 00:29:13 morpheus kernel: [69082.418562] [] ? irq_thread_dtor+0xb0/0xb0
Aug 26 00:29:13 morpheus kernel: [69082.418563] [] ? kthread+0xf2/0x110
Aug 26 00:29:13 morpheus kernel: [69082.418565] [] ? sched_clock+0x9/0x10
Aug 26 00:29:13 morpheus kernel: [69082.418567] [] ? kthread_create_on_node+0x2f0/0x2f0
Aug 26 00:29:13 morpheus kernel: [69082.418568] [] ? ret_from_fork+0x42/0x70
Aug 26 00:29:13 morpheus kernel: [69082.418570] [] ? kthread_create_on_node+0x2f0/0x2f0
Aug 26 00:32:17 morpheus kernel: [ 0.000000] Initializing cgroup subsys cpuset

looks like it's most likely not related to the scheduler, no ?

Thanks all of you for testing. While waiting for p...

2015-08-25T08:20:50.986-07:00

Thanks all of you for testing. While waiting for pf's final confirm, I'd like to prepare another patch for testing.

BR Alfred

Also: === pf@defiant:~ » uptime ...

2015-08-25T06:58:02.370-07:00

Also:

===
pf@defiant:~ » uptime
16:57:31 up 5:43, 1 user, load average: 3.51, 1.92, 1.17
pf@defiant:~ » sudo btrfs scrub status /
scrub status for 14140a7f-23bc-4dab-b263-f2f46f5d70aa
scrub started at Tue Aug 25 16:55:10 2015 and finished after 00:02:15
total bytes scrubbed: 76.83GiB with 0 errors
===

Still works OK, but uptime is too small, need more time.

Stupid blogger interface ? Where did my post go ?...

2015-08-25T05:59:13.658-07:00

Stupid blogger interface ?

Where did my post go ?

@Alfred:

Great news !

it survived the first 2 minutes and finished without hardlocks (5-6 hours)

Once there's enough changes to the system I'll attempt another stage4 backup and see whether that hardlocks the system - but I doubt it will :)

Awesome work !

Compiling and testing sched_submit_work_02.patch, ...

2015-08-25T01:05:17.694-07:00

Compiling and testing sched_submit_work_02.patch, stay tuned.

I don't see/ feel negative subjective experien...

2015-08-24T15:48:50.337-07:00

I don't see/ feel negative subjective experiences with -vrq and the new patch. Uptime ~9h.

BR Manuel

Will test perhaps at the weekend or earlier, the ...

2015-08-24T15:34:57.379-07:00

Will test perhaps at the weekend or earlier,

the lockups would mostly occur with Btrfs,

I haven't used ext4 for a long time so I'm not sure if there are still quirks with it

Crosses fingers that this fixes it =)

@kernelOfTruth & @post-factum: Now it seems t...

2015-08-24T15:25:55.776-07:00

@kernelOfTruth & @post-factum:

Now it seems to be at you, to prove that the new https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work_02.patch

works for you even on btrfs scrub.
I'm running it on the -vrq branch, btw.

Thank you all for your participation,

Manuel

Noone could ever count on crossposting. But especi...

2015-08-24T15:19:50.605-07:00

Noone could ever count on crossposting. But especially on here? ;-)

You've seen, that I've done some compressing with ext4 partitions' content without issues. It was only about 1.2 GiB.

Thank you for your added info.

Manuel

Need to add: all involved partitions are ext4. ^^ ...

2015-08-24T15:10:10.257-07:00

Need to add: all involved partitions are ext4. ^^ *MK

Yes, it's another possible trigger scenario, ...

2015-08-24T15:07:18.704-07:00

Yes, it's another possible trigger scenario,

not concurrently, yes separately, there also was running certain rsync jobs but that doesn't seem to apply here

sure:

/mnt/*
/boot/*
/tmp/*
/proc/*
/home/*
/sys/*
/usb/*
/var/cache/edb/dep/*
/var/cache/squid/*
/var/tmp/*
/media/*
/usr/portage/*
/usr/gentoo/*

There were issues with the restored system when including /dev/* in that least, so I deliberately left it out

Also I've a separate backup command for /boot, but that doesn't matter for this purpose - it's simply for causing a high i/o, cpu and scheduler load

yes, mmt equals the cores, afaik it should do it automatically (?) but I remember having had issues in the past without it (less throughput)

It's rooted in Gentoo's stages and backup procedures

http://badpenguins.com/gentoo-build-test/

http://www.gentoo-wiki.info/HOWTO_Custom_Stage4

https://wiki.gentoo.org/wiki/Handbook:AMD64/Installation/Media#What_are_stages_then.3F

stage4 in that case would be fully installed and configured system :)

stage3 is where you usually start when following the gentoo handbook