Discussion:
LiS 2-16 : OOPS in lis_safe_SAMESTR()
phil
2005-11-23 09:51:55 UTC
Permalink
Hello Gentlemen,

I analyzed a oops that occured on an "old" version of LiS : 2-16 in a
process which is closing a streams pipe. I've attached the oops and the
corresponding section of code hereafter.
The process is multithreaded and we've perheaps reached a race
condition where two threads are dealing with the closing of the same
pipe ...

I found similar errors reported on Linux archive and I understand that
such race conditions have been fixed with upper release of LiS. (i.e
2-18).
In particular, in the below mail there's a reference to a patch for
problem 3) "synchronization problem between pipes." And I think that it
could be the root cause of my problem.
http://www.mail-archive.com/linux-***@gsyc.escet.urjc.es/msg01855.html

My question is : beyond the fact that LiS 2-18 will certainly fixed
this issue, I was wondering if someone could point me to the specific
modification that has fixed this specific issue (perheaps the patch
mentioned in the email from Jeff)

Best regards,
-Philippe

-----------------------------------------------------------------------------------------------------------------------------

== Section of code where the oops occured
-----------------------------------------------------------------------------------------------------------------------------


int lis_safe_SAMESTR(queue_t *q, char *f, int l)
{
if ( lis_check_q_magic(q,f,l)
&& q->q_next != NULL
&& lis_check_q_magic(q->q_next,f,l)
)
return ((q->q_flag&QREADR) == (*q->q_next*->q_flag&QREADR));
^^^^^^^^^
/ksymoops
showed that q->q_next is null
in spite of
the condition test in if statement. /

return 0;
}


-----------------------------------------------------------------------------------------------------------------------------

== The OOPS :
-----------------------------------------------------------------------------------------------------------------------------

Unable to handle kernel NULL pointer dereference at virtual address
0000001c
printing eip:
f8b332b9
*pde = 32ba1001
*pte = 58e29067
Oops: 0000
netconsole nfs lockd sunrpc streams swrmm parport_pc lp parport autofs4
audit e1000 tg3 ipv6 floppy sg microcode keybdev mousedev hid input
usb-ohci usbcore e
CPU: 3
EIP: 0060:[<f8b332b9>] Tainted: P
EFLAGS: 00010002

EIP is at lis_safe_SAMESTR [streams] 0x41 (2.4.21-20.ELsmp/i686)
eax: 00000000 ebx: cb2852ac ecx: f8bd4ad4 edx: 00000020
esi: 0000068a edi: f8bd4ad4 ebp: d6f61ed8 esp: d6f61ebc
ds: 0068 es: 0068 ss: 0068
Process oracle (pid: 1131, stackpage=d6f61000)
Stack: cb2852ac 00000000 cb2852ac f8b32c80 cb2852ac f8bd4ad4 0000068a
00000202
d6f61f00 f4f272b4 00003a98 f4f2753c f8b291d1 cb2852ac f4f272b4
f8c0c960
00000286 00000282 f4f2753c d6f61f30 00000000 f4f272b4 f8b29639
f4f272b4
Call Trace: [<f8b32c80>] lis_qcountstrm [streams] 0x50 (0xd6f61ec8)
[<f8bd4ad4>] .rodata.str1.4 [streams] 0x4584 (0xd6f61ed0)
[<f8b291d1>] close_action [streams] 0x1c9 (0xd6f61eec)
[<f8c0c960>] lis_stdata_sem [streams] 0x0 (0xd6f61ef8)
[<f8b29639>] lis_doclose [streams] 0x2c9 (0xd6f61f14)
[<f8b29850>] lis_strclose [streams] 0x178 (0xd6f61f44)
[<c016472a>] __fput [kernel] 0xea (0xd6f61f78)
[<c01628ae>] filp_close [kernel] 0x8e (0xd6f61f94)
[<c0162956>] sys_close [kernel] 0x66 (0xd6f61fb0)

Code: 8b 40 1c 83 e0 10 83 e2 10 39 c2 0f 94 c0 0f b6 c0 eb d3 55
-----------------------------------------------------------------------------------------------------------------------------
Brian F. G. Bidulock
2005-11-23 10:43:04 UTC
Permalink
Phil,

Well, no it was not just an SMP problem. The problem exists on UP under
LiS-2.18.0 as well. There I put two lines in LiS-2.18.1 that made the
problem go away on UP at least. Just

if (q == 0)
return (0);

at the top of lis_await_qsched() in LiS/head/head.c This correction was
made to LiS-2.18.1 in July 2005.

The problem is a lack of proper plumbing locks in LiS. The situation is
aggravated if there are different processes at either end of the pipe (the
normal case) closing their ends at about the same time. It seems on 2.18
that queue service procedure run after the queue pairs have been severed.
For LiS-2.16 it is also a race.

For LiS-2.16 and before this fix to 2.18, I had to put this in all module
and driver put procedure and service procedures that used pipes: (I use
pipes extensively for testing modules and drivers, so that meant about all
of them.)

int
xxx_put(queue_t *q, mblk_t *mp)
{
#if defined LIS
if (q->q_next == NULL || OTHERQ(q)->q_next == NULL) {
cmn_err(CE_WARN, "%s: %s: LiS pipe bug: called with NULL q->q_next pointer.",
MOD_NAME, __FUNCTION__);
freemsg(mp);
return (0);
}
#endif
...
}

int
xxx_srv(queue_t *q)
{
#if defined LIS
if (q->q_next == NULL || OTHERQ(q)->q_next == NULL) {
cmn_err(CE_WARN, "%s: %s: LiS pipe bug: called with NULL q->q_next pointer.",
MOD_NAME, __FUNCTION__);
return (0);
}
#endif
...
}

And in close procedures that try to pass messages.

LiS is full of races. Be amazed if it runs on SMP at all.

Linux Fast-STREAMS does not have this problem because it has proper
plumbing locks and no put or service procedure can run while or after
it has severed the queue pair. (Thus the #ifdef's above.)

Any particular reason why you're still using 2.16?

I'm just wrapping documentation for release:
Would you like to try using Linux Fast-STREAMS?

--brian
Post by phil
Hello Gentlemen,
I analyzed a oops that occured on an "old" version of LiS : 2-16 in a
process which is closing a streams pipe. I've attached the oops and
the corresponding section of code hereafter.
The process is multithreaded and we've perheaps reached a race
condition where two threads are dealing with the closing of the same
pipe ...
I found similar errors reported on Linux archive and I understand that
such race conditions have been fixed with upper release of LiS. (i.e
2-18).
In particular, in the below mail there's a reference to a patch for
problem 3) "synchronization problem between pipes." And I think that
it could be the root cause of my problem.
55.html
My question is : beyond the fact that LiS 2-18 will certainly fixed
this issue, I was wondering if someone could point me to the specific
modification that has fixed this specific issue (perheaps the patch
mentioned in the email from Jeff)
Best regards,
-Philippe
----------------------------------------------------------------------
-------------------------------------------------------
== Section of code where the oops occured
----------------------------------------------------------------------
-------------------------------------------------------
int lis_safe_SAMESTR(queue_t *q, char *f, int l)
{
if ( lis_check_q_magic(q,f,l)
&& q->q_next != NULL
&& lis_check_q_magic(q->q_next,f,l)
)
return ((q->q_flag&QREADR) == (q->q_next->q_flag&QREADR));
^^^^^^^^^
ksymoops
showed that q->q_next is null
in spite
of the condition test in if statement.
return 0;
}
----------------------------------------------------------------------
-------------------------------------------------------
----------------------------------------------------------------------
-------------------------------------------------------
Unable to handle kernel NULL pointer dereference at virtual address
0000001c
f8b332b9
*pde = 32ba1001
*pte = 58e29067
Oops: 0000
netconsole nfs lockd sunrpc streams swrmm parport_pc lp parport
autofs4 audit e1000 tg3 ipv6 floppy sg microcode keybdev mousedev hid
input usb-ohci usbcore e
CPU: 3
EIP: 0060:[<f8b332b9>] Tainted: P
EFLAGS: 00010002
EIP is at lis_safe_SAMESTR [streams] 0x41 (2.4.21-20.ELsmp/i686)
eax: 00000000 ebx: cb2852ac ecx: f8bd4ad4 edx: 00000020
esi: 0000068a edi: f8bd4ad4 ebp: d6f61ed8 esp: d6f61ebc
ds: 0068 es: 0068 ss: 0068
Process oracle (pid: 1131, stackpage=d6f61000)
Stack: cb2852ac 00000000 cb2852ac f8b32c80 cb2852ac f8bd4ad4 0000068a
00000202
d6f61f00 f4f272b4 00003a98 f4f2753c f8b291d1 cb2852ac f4f272b4
f8c0c960
00000286 00000282 f4f2753c d6f61f30 00000000 f4f272b4 f8b29639
f4f272b4
Call Trace: [<f8b32c80>] lis_qcountstrm [streams] 0x50 (0xd6f61ec8)
[<f8bd4ad4>] .rodata.str1.4 [streams] 0x4584 (0xd6f61ed0)
[<f8b291d1>] close_action [streams] 0x1c9 (0xd6f61eec)
[<f8c0c960>] lis_stdata_sem [streams] 0x0 (0xd6f61ef8)
[<f8b29639>] lis_doclose [streams] 0x2c9 (0xd6f61f14)
[<f8b29850>] lis_strclose [streams] 0x178 (0xd6f61f44)
[<c016472a>] __fput [kernel] 0xea (0xd6f61f78)
[<c01628ae>] filp_close [kernel] 0x8e (0xd6f61f94)
[<c0162956>] sys_close [kernel] 0x66 (0xd6f61fb0)
Code: 8b 40 1c 83 e0 10 83 e2 10 39 c2 0f 94 c0 0f b6 c0 eb d3 55
----------------------------------------------------------------------
-------------------------------------------------------
References
--
Brian F. G. Bidulock ¦ The reasonable man adapts himself to the ¦
***@openss7.org ¦ world; the unreasonable one persists in ¦
http://www.openss7.org/ ¦ trying to adapt the world to himself. ¦
¦ Therefore all progress depends on the ¦
¦ unreasonable man. -- George Bernard Shaw ¦
phil
2005-11-28 11:43:09 UTC
Permalink
Hello Brian,

Thanks a lot for your explanations.

Concerning your questions :
We used LiS 2-16 for historical reason, in fact for an old project based
on this version of LiS.
I don't know Linux Fast-STREAMS,and I'm afraid that, within the scope
of this project, we can't move from LiS 2-16, however for future
project, I would be very interested if you have a link to the
documentation for Linux Fast Streams. (I couldn't find it on openss7.org).

Best regards,
-Philippe
Post by Brian F. G. Bidulock
Phil,
Well, no it was not just an SMP problem. The problem exists on UP under
LiS-2.18.0 as well. There I put two lines in LiS-2.18.1 that made the
problem go away on UP at least. Just
if (q == 0)
return (0);
at the top of lis_await_qsched() in LiS/head/head.c This correction was
made to LiS-2.18.1 in July 2005.
The problem is a lack of proper plumbing locks in LiS. The situation is
aggravated if there are different processes at either end of the pipe (the
normal case) closing their ends at about the same time. It seems on 2.18
that queue service procedure run after the queue pairs have been severed.
For LiS-2.16 it is also a race.
For LiS-2.16 and before this fix to 2.18, I had to put this in all module
and driver put procedure and service procedures that used pipes: (I use
pipes extensively for testing modules and drivers, so that meant about all
of them.)
int
xxx_put(queue_t *q, mblk_t *mp)
{
#if defined LIS
if (q->q_next == NULL || OTHERQ(q)->q_next == NULL) {
cmn_err(CE_WARN, "%s: %s: LiS pipe bug: called with NULL q->q_next pointer.",
MOD_NAME, __FUNCTION__);
freemsg(mp);
return (0);
}
#endif
...
}
int
xxx_srv(queue_t *q)
{
#if defined LIS
if (q->q_next == NULL || OTHERQ(q)->q_next == NULL) {
cmn_err(CE_WARN, "%s: %s: LiS pipe bug: called with NULL q->q_next pointer.",
MOD_NAME, __FUNCTION__);
return (0);
}
#endif
...
}
And in close procedures that try to pass messages.
LiS is full of races. Be amazed if it runs on SMP at all.
Linux Fast-STREAMS does not have this problem because it has proper
plumbing locks and no put or service procedure can run while or after
it has severed the queue pair. (Thus the #ifdef's above.)
Any particular reason why you're still using 2.16?
Would you like to try using Linux Fast-STREAMS?
--brian
Post by phil
Hello Gentlemen,
I analyzed a oops that occured on an "old" version of LiS : 2-16 in a
process which is closing a streams pipe. I've attached the oops and
the corresponding section of code hereafter.
The process is multithreaded and we've perheaps reached a race
condition where two threads are dealing with the closing of the same
pipe ...
I found similar errors reported on Linux archive and I understand that
such race conditions have been fixed with upper release of LiS. (i.e
2-18).
In particular, in the below mail there's a reference to a patch for
problem 3) "synchronization problem between pipes." And I think that
it could be the root cause of my problem.
55.html
My question is : beyond the fact that LiS 2-18 will certainly fixed
this issue, I was wondering if someone could point me to the specific
modification that has fixed this specific issue (perheaps the patch
mentioned in the email from Jeff)
Best regards,
-Philippe
----------------------------------------------------------------------
-------------------------------------------------------
== Section of code where the oops occured
----------------------------------------------------------------------
-------------------------------------------------------
int lis_safe_SAMESTR(queue_t *q, char *f, int l)
{
if ( lis_check_q_magic(q,f,l)
&& q->q_next != NULL
&& lis_check_q_magic(q->q_next,f,l)
)
return ((q->q_flag&QREADR) == (q->q_next->q_flag&QREADR));
^^^^^^^^^
ksymoops
showed that q->q_next is null
in spite
of the condition test in if statement.
return 0;
}
----------------------------------------------------------------------
-------------------------------------------------------
----------------------------------------------------------------------
-------------------------------------------------------
Unable to handle kernel NULL pointer dereference at virtual address
0000001c
f8b332b9
*pde = 32ba1001
*pte = 58e29067
Oops: 0000
netconsole nfs lockd sunrpc streams swrmm parport_pc lp parport
autofs4 audit e1000 tg3 ipv6 floppy sg microcode keybdev mousedev hid
input usb-ohci usbcore e
CPU: 3
EIP: 0060:[<f8b332b9>] Tainted: P
EFLAGS: 00010002
EIP is at lis_safe_SAMESTR [streams] 0x41 (2.4.21-20.ELsmp/i686)
eax: 00000000 ebx: cb2852ac ecx: f8bd4ad4 edx: 00000020
esi: 0000068a edi: f8bd4ad4 ebp: d6f61ed8 esp: d6f61ebc
ds: 0068 es: 0068 ss: 0068
Process oracle (pid: 1131, stackpage=d6f61000)
Stack: cb2852ac 00000000 cb2852ac f8b32c80 cb2852ac f8bd4ad4 0000068a
00000202
d6f61f00 f4f272b4 00003a98 f4f2753c f8b291d1 cb2852ac f4f272b4
f8c0c960
00000286 00000282 f4f2753c d6f61f30 00000000 f4f272b4 f8b29639
f4f272b4
Call Trace: [<f8b32c80>] lis_qcountstrm [streams] 0x50 (0xd6f61ec8)
[<f8bd4ad4>] .rodata.str1.4 [streams] 0x4584 (0xd6f61ed0)
[<f8b291d1>] close_action [streams] 0x1c9 (0xd6f61eec)
[<f8c0c960>] lis_stdata_sem [streams] 0x0 (0xd6f61ef8)
[<f8b29639>] lis_doclose [streams] 0x2c9 (0xd6f61f14)
[<f8b29850>] lis_strclose [streams] 0x178 (0xd6f61f44)
[<c016472a>] __fput [kernel] 0xea (0xd6f61f78)
[<c01628ae>] filp_close [kernel] 0x8e (0xd6f61f94)
[<c0162956>] sys_close [kernel] 0x66 (0xd6f61fb0)
Code: 8b 40 1c 83 e0 10 83 e2 10 39 c2 0f 94 c0 0f b6 c0 eb d3 55
----------------------------------------------------------------------
-------------------------------------------------------
References
Loading...