Thor master does not end after job abort

Description

0004850D 2019-09-11 05:03:08.670 881164 881164 "ThorMaster terminated OK"
0004850F 2019-09-11 05:03:09.671 881164 881164 "Stopping watchdog"
00048510 2019-09-11 05:03:09.671 881164 881164 "Stopped watchdog"
00048513 2019-09-11 05:03:29.685 881164 881164 "Thor closing down 5"
00048514 2019-09-11 05:03:29.685 881164 881164 "Thor closing down 4"
00048515 2019-09-11 05:03:29.685 881164 881164 "Thor closing down 3"
00048516 2019-09-11 05:03:29.687 881164 881164 "Thor closing down 2"
... stays there forever ...

TID 881164:
#0 0x00007f0280526adb do_futex_wait.constprop.1
#1 0x00007f0280526b6f __new_sem_wait_slow.constprop.0
#2 0x00007f0280526c0b sem_wait@@GLIBC_2.2.5
#3 0x00007f028366bb0d Semaphore::wait(unsigned int)
#4 0x00007f02836999c6 Thread::join(unsigned int)
#5 0x00007f02858b424f closeSubscriptionManager()
#6 0x00007f0285811348 closedownClientProcess()

TID 881756:
#0 0x00007f02805274ed __lll_lock_wait
#1 0x00007f0280522de6 _L_lock_941
#2 0x00007f0280522cdf pthread_mutex_lock
#3 0x00007f0285ffb609 CJobQueue::csubs::notify(long long, char const*, SDSNotifyFlags, unsigned int, void const*)
#5 0x00007f02858b4916 CDaliPublisherClient::processMessage(CMessageBuffer&)
#6 0x00007f028580f8c4 CMessageHandler<CDaliPublisherClient>::main(CMessageBuffer&)

Conclusion

None

Activity

Show:

Mark Kelly September 21, 2019 at 2:08 PM
Edited

CDaliPublisherClient::run() is ending, which suggests stopped is set, but it appears this happens after a non-cancel message that has started down the handleMessage()>processMessage()>notify()->crit.enter() path.

Mark Kelly September 20, 2019 at 8:06 PM

The lock that CJobQueue::csubs::notify() is trying to get is also used by:
CJobQueue::cancelWaitStatsChange()
CJobQueue::cancelAcceptConversation()
CJobQueue::Cconnlockblock() - and this is called from many member functions.

Mark Kelly September 20, 2019 at 1:11 PM

I think we can check for stopped while waiting for the lock in processMessage->notify()
If a non-cancel msg arrives just at the wrong time it can deadlock like this.
I'll look for where the CJobQueueBase::crit is held.

PID 881164 - process
TID 881164:
#0 0x00007f0280526adb do_futex_wait.constprop.1
#1 0x00007f0280526b6f __new_sem_wait_slow.constprop.0
#2 0x00007f0280526c0b sem_wait@@GLIBC_2.2.5
#3 0x00007f028366bb0d Semaphore::wait(unsigned int)
#4 0x00007f02836999c6 Thread::join(unsigned int)
#5 0x00007f02858b424f closeSubscriptionManager()
#6 0x00007f0285811348 closedownClientProcess()
#7 0x000000000040db9a main
#8 0x00007f028016e495 __libc_start_main
#9 0x000000000040e3f0 _start
TID 881754:
#0 0x00007f0280526adb do_futex_wait.constprop.1
#1 0x00007f0280526b6f __new_sem_wait_slow.constprop.0
#2 0x00007f0280526c0b sem_wait@@GLIBC_2.2.5
#3 0x00007f028366bb0d Semaphore::wait(unsigned int)
#4 0x00007f02836a17d6 CThreadPool::~CThreadPool()
#5 0x00007f0283613a3c CInterface::Release() const
#6 0x00007f02858b688c CDaliPublisherClient::run()
#7 0x00007f028369989c Thread::begin()
#8 0x00007f028369b27e Thread::_threadmain(void*)
#9 0x00007f0280520dd5 start_thread
#10 0x00007f028024a02d __clone
TID 881756:
#0 0x00007f02805274ed __lll_lock_wait
#1 0x00007f0280522de6 _L_lock_941
#2 0x00007f0280522cdf pthread_mutex_lock
#3 0x00007f0285ffb609 CJobQueue::csubs::notify(long long, char const*, SDSNotifyFlags, unsigned int, void const*)
#4 0x00007f02858231ed CSDSSubscriberProxy::notify(MemoryBuffer&)
#5 0x00007f02858b4916 CDaliPublisherClient::processMessage(CMessageBuffer&)
#6 0x00007f028580f8c4 CMessageHandler<CDaliPublisherClient>::main(CMessageBuffer&)
#7 0x00007f02836a06a8 CPooledThreadWrapper::run()
#8 0x00007f028369989c Thread::begin()
#9 0x00007f028369b27e Thread::_threadmain(void*)
#10 0x00007f0280520dd5 start_thread
#11 0x00007f028024a02d __clone

Jacob Cobbett-Smith September 19, 2019 at 10:15 AM

- do you have the other thread stacks?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Created September 17, 2019 at 2:38 PM
Updated October 1, 2019 at 1:01 PM
Resolved October 1, 2019 at 1:01 PM