LiS-2.18.0 : hang on close and i

Discussion:

LiS-2.18.0 : hang on close and i_unlink

e***@netscape.net

2005-11-16 00:29:45 UTC

Hi,

I'm curious if people saw hangs with the trace like this:

#0 [f5cb0c5c] schedule at c02cf361
#1 [f5cb0cbc] __down_interruptible at c02cea62
#2 [f5cb0cf4] __down_failed_interruptible at c02ceace
#3 [f5cb0d00] .text.lock.KBUILD_BASENAME (via lis_down_fcn) at f956be61
#4 [f5cb0d20] lis_down_nosig_fcn at f956bcb3
#5 [f5cb0d48] lis_await_qsched at f9544dcb
#6 [f5cb0e1c] lis_qdetach at f954c33f
#7 [f5cb0eac] lis_dismantle at f954d074
#8 [f5cb0ec0] lis_doclose at f9554fb1
#9 [f5cb0f68] lis_strclose at f955920a
#10 [f5cb0f9c] __fput at c015a8fc
#11 [f5cb0fb0] filp_close at c0159540
#12 [f5cb0fc0] system_call at c02d10c8

or like this:

#0 [f465dbd4] schedule at c02cf361
#1 [f465dc34] __down_interruptible at c02cea62
#2 [f465dc6c] __down_failed_interruptible at c02ceace
#3 [f465dc78] .text.lock.KBUILD_BASENAME (via lis_down_fcn) at f956be61
#4 [f465dc98] lis_down_nosig_fcn at f956bcb3
#5 [f465dcc0] lis_await_qsched at f9544dcb
#6 [f465dd94] lis_i_unlink at f9556049
#7 [f465de14] lis_strioctl at f9556fcb
#8 [f465dfa8] sys_write at c0159eef
#9 [f465dfc0] system_call at c02d10c8

The piece of code in 2.16.18, which evolved into the lis_await_qsched()
function in 2.18.0, used to have lis_spin_lock_irqsave(&lis_qhead_lock, &psw) calls.

The lis_await_qsched() from 2.18.0 do not do these lis_spin_lock_irqsave(&lis_qhead_lock, &psw)
calls anymore. I'm thinking that might be the problem resulting in the aforementioned hangs.

Any comments?
--
Eugene

___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com

Brian F. G. Bidulock

2005-11-16 01:52:01 UTC

Permalink

Eugene,

I don't know that it will help but I have run into a similar hang with
LiS when doing performance tests against Linux Fast-STREAMS. Open a
stream (or a pipe) and push "pipemod" on the stream 50 or so times and
somewhere along the way it will hang (push_mod() also calls
lis_await_qsched). In other cases, with 20 or so modules pushed, I can
transfer data on the pipe but when the pipe is closed, LiS hangs while
popping the modules (pop_mod() also calls lis_await_qsched()). I run
into the same problem on UP kernels as well as SMP kernels running on
UP. Same problem for both 2.4 and 2.6 kernels. The faster the kernel
(later 2.6 kernels) the fewer the number of modules that need to be
pushed to cause the problem.

If the module pushing problem is the same you might be able to narrow it
down. BTW, Linux Fast-STREAMS does not have this problem. I am
wrapping a public release of Linux Fast-STREAMS this week: you might
consider giving it a whirl.

--brian

Post by e***@netscape.net
Hi,
#0 [f5cb0c5c] schedule at c02cf361
#1 [f5cb0cbc] __down_interruptible at c02cea62
#2 [f5cb0cf4] __down_failed_interruptible at c02ceace
#3 [f5cb0d00] .text.lock.KBUILD_BASENAME (via lis_down_fcn) at
f956be61
#4 [f5cb0d20] lis_down_nosig_fcn at f956bcb3
#5 [f5cb0d48] lis_await_qsched at f9544dcb
#6 [f5cb0e1c] lis_qdetach at f954c33f
#7 [f5cb0eac] lis_dismantle at f954d074
#8 [f5cb0ec0] lis_doclose at f9554fb1
#9 [f5cb0f68] lis_strclose at f955920a
#10 [f5cb0f9c] __fput at c015a8fc
#11 [f5cb0fb0] filp_close at c0159540
#12 [f5cb0fc0] system_call at c02d10c8
#0 [f465dbd4] schedule at c02cf361
#1 [f465dc34] __down_interruptible at c02cea62
#2 [f465dc6c] __down_failed_interruptible at c02ceace
#3 [f465dc78] .text.lock.KBUILD_BASENAME (via lis_down_fcn) at
f956be61
#4 [f465dc98] lis_down_nosig_fcn at f956bcb3
#5 [f465dcc0] lis_await_qsched at f9544dcb
#6 [f465dd94] lis_i_unlink at f9556049
#7 [f465de14] lis_strioctl at f9556fcb
#8 [f465dfa8] sys_write at c0159eef
#9 [f465dfc0] system_call at c02d10c8
The piece of code in 2.16.18, which evolved into the
lis_await_qsched()
function in 2.18.0, used to have
lis_spin_lock_irqsave(&lis_qhead_lock, &psw) calls.
The lis_await_qsched() from 2.18.0 do not do these
lis_spin_lock_irqsave(&lis_qhead_lock, &psw)
calls anymore. I'm thinking that might be the problem resulting in the
aforementioned hangs.
Any comments?
--
Eugene
_________________________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
[1]http://mail.netscape.com
References
1. http://mail.netscape.com/

--
Brian F. G. Bidulock ¦ The reasonable man adapts himself to the ¦
***@openss7.org ¦ world; the unreasonable one persists in ¦
http://www.openss7.org/ ¦ trying to adapt the world to himself. ¦
¦ Therefore all progress depends on the ¦
¦ unreasonable man. -- George Bernard Shaw ¦

e***@netscape.net

2005-11-18 01:05:24 UTC

Permalink

Hello,

The other unusual thing I observed while testing 2.18.0 on 2.6.9
were the following messages in the syslog:

..kernel: lis_down(sem=f4529108 head/head.c 3350) uninitialized semaphore

This is in the lis_strrsrv(q) - stream head read service procedure:

err = lis_lockq(q) ;

..kernel: lis_down(sem=f5a36508 head/stream.c 263) uninitialized semaphore

This one is in queuerun():

if (lis_lockq(q) < 0)

That is usually seen during application startup when it opens a lot of stream devices.

It looks like service functions for particular queue may kick in prior to the queue semaphore
being completly initialized.
This is intermitent(sp?), i.e. it does not happen every time, but pretty often.

I'm curious if anybody observed the same messages in syslog?
Is it known problem?

[Brian]: I am wrapping a public release of Linux Fast-STREAMS this week: you might
consider giving it a whirl.

Brian, please keep us posted on that one.
Time permitting I'd certainly give it a try.

--
Eugene

___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com

Brian F. G. Bidulock

2005-11-18 03:38:50 UTC

Permalink

eugenelisstreams,

Yah, I saw that one too. However, if I clean the build directory,
recompile from scratch and reinstall it seems to go away. You could try
that. If that doesn't cure the problem, try:

http://www.openss7.org/LiS-2.18.1.tar.gz

and see if it goes away. If you're running on EL4 there are a lot of
quirks that I got rid of with that release. RedHat always highly
customizes their kernels and making decisions based on KERNEL_VERSION is
a bad idea with their kernels. They are always applying patches from
some number of versions ahead. Their 2.6.9 kernels seem to have be
closer to kernel.org 2.6.11 than kernel.org 2.6.9, plus some custom
things that haven't made it into a kernel.org kernel yet.

Many things in the OpenSS7 release above, and everything in Linux
Fast-STREAMS, use autoconf macros to interrogate the actual kernel
sources to determine which way a thing occurs in the kernel, regardless
of supposed KERNEL_VERSION. The OpenSS7 packaging adapts far better to
a wider range of custom production kernels.

Post by e***@netscape.net

[Brian]: I am wrapping a public release of Linux Fast-STREAMS
this week: you might consider giving it a whirl.

Brian, please keep us posted on that one.
Time permitting I'd certainly give it a try.

I'll post an announcement to this list when it is released.

--brian