Hi,
When downloading headers form multiple groups Agent often locks up.
The program does the following:
-(Main). main thread creates a sub thread:
| 0009:Call kernel32.CreateThread(00000000,00002000,006648e0,41ac6cb0,00000004,41ac6cb0) ret=006649c9 | 0009: *fd* 23 <- 41 | 0009: new_thread( suspend=1, inherit=0, request_fd=23 ) | 0009: new_thread() = 0 { tid=0021, handle=0xcc } | 0021: *fd* 27 <- 42 | 0021: *fd* 29 <- 45 | 0021: init_thread( unix_pid=4966, unix_tid=-1, teb=0x40319000, entry=0x404fa110, reply_fd=27, wait_fd=29 ) | 0021: init_thread() = 0 { pid=0008, tid=0021, boot=0, version=126 } | 0021: select( flags=8, cookie=0x4031ac78, timeout=1068729666.690492 (-0.000050), handles={} ) | 0021: select() = PENDING | 0009:Ret kernel32.CreateThread() retval=000000cc ret=006649c9
-(SUB). Sub thread updates some files and finishes with sending a WM_NULL message to the main thrad:
| 0021:Call user32.PostMessageA(00010026,00000000,00000000,00000000) ret=0058d0d6 | 0021:Call kernel32.GetTickCount() ret=407924ca | 0021:Ret kernel32.GetTickCount() retval=000ab0db ret=407924ca | 0021: send_message( id=0009, type=6, flags=0, win=0x10026, msg=00000000, wparam=00000000, lparam=00000000, x=0, y=0, time=000ab0db, info=00000000, timeout=-1, callback=(nil), data={} ) | 0021: send_message() = 0 | 0021:Ret user32.PostMessageA() retval=00000001 ret=0058d0d6
-(SUB). Close the thread handle:
| 0021:Call kernel32.CloseHandle(000000cc) ret=006648d0 | 0021: close_handle( handle=0xcc ) | 0021: close_handle() = 0 { fd=-1 } | 0021:Ret kernel32.CloseHandle() retval=00000001 ret=006648d0
-(SUB). terminates:
| 0021:Call kernel32.ExitThread(00000000) ret=006648df | 0021: terminate_thread( handle=0xfffffffe, exit_code=0 ) | 0021: terminate_thread() = 0 { self=1, last=0 } [snip] | 0021:Ret PE DLL (proc=0x4079ad40,module=0x406f0000 L"user32.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0021:Call PE DLL (proc=0x40cef0a0,module=0x40cb0000 L"x11drv.dll",reason=THREAD_DETACH,res=(nil)) | 0021:Ret PE DLL (proc=0x40cef0a0,module=0x40cb0000 L"x11drv.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0021:Call PE DLL (proc=0x404c8980,module=0x40480000 L"kernel32.dll",reason=THREAD_DETACH,res=(nil)) | 0021:Ret PE DLL (proc=0x404c8980,module=0x40480000 L"kernel32.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0021: *killed* exit_code=0
-(Main). In the meantime the main thread: - gets data from the server (winsok send and rcv) - writes to the file - processes queued messages until it is empty - then is wait on the handle of the subthread:
| 0009:Call kernel32.WaitForSingleObject(000000cc,ffffffff) ret=004ed693
-(Main). which ends immedeately:
| 0009: select( flags=4, cookie=0x406cf4e0, timeout=0, handles={0xcc} ) | 0009: select() = INVALID_HANDLE | 0009:Ret kernel32.WaitForSingleObject() retval=ffffffff ret=004ed693
-(Main). close the subthread handle.
| 0009:Call kernel32.CloseHandle(000000cc) ret=004ed69c | 0009: close_handle( handle=0xcc ) | 0009: close_handle() = INVALID_HANDLE { fd=0 } | 0009:Ret kernel32.CloseHandle() retval=00000000 ret=004ed69c
This may repeat a number of times but then without warning:
-(SUB). terminates as usual:
| 0023:Ret PE DLL (proc=0x4079ad40,module=0x406f0000 L"user32.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0023:Call PE DLL (proc=0x40cef0a0,module=0x40cb0000 L"x11drv.dll",reason=THREAD_DETACH,res=(nil)) | 0023:Ret PE DLL (proc=0x40cef0a0,module=0x40cb0000 L"x11drv.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0023:Call PE DLL (proc=0x404c8980,module=0x40480000 L"kernel32.dll",reason=THREAD_DETACH,res=(nil)) | 0023:Ret PE DLL (proc=0x404c8980,module=0x40480000 L"kernel32.dll",reason=THREAD_DETACH,res=(nil)) retval=1 | 0023: *killed* exit_code=0
-(Main). Finishes its message processing and the waits:
| 0009:Call kernel32.WaitForSingleObject(000000cc,ffffffff) ret=004ed693 | 0009: select( flags=4, cookie=0x406cf4e0, timeout=0, handles={0xcc} ) | 0009: select() = PENDING
And this wait never returns until explicitly killed:
| 0009: *killed* exit_code=0
I looked at many of these sequences. In all of them the subthread finished well before the main thread executed its Wait. I was looking for timing differences between the processing of the two threads, checking whether and when the WM_NULL message gets processed. Nothing seems to stand out.
Any suggestions how to debug this?
A large cut from the +relay,+server trace is on http://www.xs4all.nl/~rklazes/temp/agserver.log.bz2
Rein.
Rein Klazes rklazes@xs4all.nl writes:
I looked at many of these sequences. In all of them the subthread finished well before the main thread executed its Wait. I was looking for timing differences between the processing of the two threads, checking whether and when the WM_NULL message gets processed. Nothing seems to stand out.
Any suggestions how to debug this?
It's a bug in Agent, it is waiting on a handle that it has already closed. It works in most cases because the handle is simply invalid, but from time to time the handle has already been reused and the wait blocks. I guess if it never happens under Windows then it's probably because there's somewhere that we allocate a handle that Windows doesn't. You should look where the 0xcc handle got reallocated the last time.
On Sat, 15 Nov 2003 16:41:44 -0800, you wrote:
Rein Klazes rklazes@xs4all.nl writes:
I looked at many of these sequences. In all of them the subthread finished well before the main thread executed its Wait. I was looking for timing differences between the processing of the two threads, checking whether and when the WM_NULL message gets processed. Nothing seems to stand out.
Any suggestions how to debug this?
It's a bug in Agent, it is waiting on a handle that it has already closed. It works in most cases because the handle is simply invalid, but from time to time the handle has already been reused and the wait blocks.
Bugs in Agent are quite possible of course. I will send the makers bug reports but I start looking at these problems assuming at first they are Wine problems.
The maker seem to think that the handle must be closed both by sub-thread and main thread before it is removed. That is how I read the traces as well. So that is not true?
I guess if it never happens under Windows then it's probably because there's somewhere that we allocate a handle that Windows doesn't. You should look where the 0xcc handle got reallocated the last time.
It must be here :
| 0023:Call ntdll.RtlAllocateHeap(40340000,00000000,00000080) ret=4076cd9c | 0009:Call ntdll.RtlFreeHeap(40340000,00000000,403f0ca0) ret=40772536 | 0009: create_semaphore( initial=00000000, max=00000001, inherit=0, name=L"" ) | 0009: create_semaphore() = 0 { handle=0xcc } | trace:ntdll:NTDLL_wait_for_multiple_objects count 1 timeout 0x406cf928 | 0009: select( flags=12, cookie=0x406cf82c, timeout=1068729674.153190 (+4.999948), handles={0xcc} ) | 0009: select() = PENDING | 0023: release_semaphore( handle=0xcc, count=00000001 ) | 0009: *wakeup* signaled=0 cookie=0x406cf82c | 0023: release_semaphore() = 0 { prev_count=00000000 } | 0023:Ret ntdll.RtlAllocateHeap() retval=40400e78 ret=4076cd9c | 0023: get_window_children( parent=0x1002c, atom=0000, tid=0000 ) | 0023: get_window_children() = 0 { count=0, children={} } | 0023:Call ntdll.RtlFreeHeap(40340000,00000000,40400e78) ret=4076ce3b | trace:ntdll:NTDLL_wait_for_multiple_objects count 1 timeout 0x422ff0d0 | 0023: select( flags=12, cookie=0x422fefd4, timeout=1068729674.153927 (+4.999945), handles={0xcc} ) | 0023: select() = PENDING | 0009: release_semaphore( handle=0xcc, count=00000001 ) | 0023: *wakeup* signaled=0 cookie=0x422fefd4 | 0009: release_semaphore() = 0 { prev_count=00000000 } | 0009:Ret ntdll.RtlFreeHeap() retval=00000001 ret=40772536 [snip] | 0023:Ret ntdll.RtlFreeHeap() retval=00000001 ret=4076ce3b
Rein.
Rein Klazes rklazes@xs4all.nl writes:
The maker seem to think that the handle must be closed both by sub-thread and main thread before it is removed. That is how I read the traces as well. So that is not true?
No, handles are process wide, so closing a handle once closes it for the whole process.
It must be here :
| 0023:Call ntdll.RtlAllocateHeap(40340000,00000000,00000080) ret=4076cd9c | 0009:Call ntdll.RtlFreeHeap(40340000,00000000,403f0ca0) ret=40772536 | 0009: create_semaphore( initial=00000000, max=00000001, inherit=0, name=L"" ) | 0009: create_semaphore() = 0 { handle=0xcc }
OK that makes sense, it's because of the lazy init of the critical sections, which Windows doesn't do. I guess we'll need to remove that optimization then, unless we can convince the Agent folks to fix the bug.
On Sun, 16 Nov 2003 11:58:42 -0800, you wrote:
I guess we'll need to remove that optimization then, unless we can convince the Agent folks to fix the bug.
No problem, they are very convinced. It is fixed in the latest beta.
Rein.