Checking dafilesrv on 10.144.88.98 I saw: 0001E3D1 2015-10-08 09:04:10.078 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3D2 2015-10-08 09:04:10.078 11865 11867 "CRemoteFileServerPool: 6601079 threads started in last 258845 seconds, total delay = 60007.96 milliseconds, average delay = 0.01 milliseconds, currently running = 100" 0001E3D3 2015-10-08 09:05:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3D4 2015-10-08 09:05:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 101" 0001E3D5 2015-10-08 09:06:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3D6 2015-10-08 09:06:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 102" 0001E3D7 2015-10-08 09:07:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3D8 2015-10-08 09:07:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 103" 0001E3D9 2015-10-08 09:08:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3DA 2015-10-08 09:08:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 104" 0001E3DB 2015-10-08 09:09:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3DC 2015-10-08 09:09:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 105" 0001E3DD 2015-10-08 09:10:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool" 0001E3DE 2015-10-08 09:10:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 106"
I then notified Jake who responded with: Captured some stacks and that dafilesrv doesn't look healthy.
Thread 108 (Thread 0x7fbceffff700 (LWP 12139)): #0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00007fbcf6b3916b in _L_lock_9503 () from /lib64/libc.so.6 #2 0x00007fbcf6b366a6 in malloc () from /lib64/libc.so.6 #3 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so #4 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so #5 0x00007fbcf8cbdfa1 in StringBuffer::appendf(char const*, ...) () from /opt/HPCCSystems/lib/libjlib.so #6 0x00007fbcf8c1c9c2 in excsighandler(int, siginfo*, void*) () from /opt/HPCCSystems/lib/libjlib.so #7 <signal handler called> #8 0x00007fbcf6b31f85 in malloc_consolidate () from /lib64/libc.so.6 #9 0x00007fbcf6b35125 in _int_malloc () from /lib64/libc.so.6 #10 0x00007fbcf6b366b1 in malloc () from /lib64/libc.so.6 #11 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so #12 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so #13 0x00007fbcf8c4c656 in CLogMsgManager::report_va(LogMsgCategory const&, LogMsgJobInfo const&, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so #14 0x0000000000403800 in PROGLOG(char const*, ...) () #15 0x00007fbcf8922ecb in CRemoteFileServer::processCommand(unsigned char, MemoryBuffer&, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*, CRemoteFileServer::CT hrottler*) () from /opt/HPCCSystems/lib/libremote.so #16 0x00007fbcf8925019 in CRemoteFileServer::CThrottler::addCommand(unsigned char, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*) () from /opt/HPCCSystems/lib /libremote.so #17 0x00007fbcf8925830 in CRemoteFileServer::CRemoteClientHandler::throttleCommand(MemoryBuffer&) () from /opt/HPCCSystems/lib/libremote.so #18 0x00007fbcf8925b63 in CRemoteFileServer::cCommandProcessor::main() () from /opt/HPCCSystems/lib/libremote.so #19 0x00007fbcf8cc635a in CPooledThreadWrapper::run() () from /opt/HPCCSystems/lib/libjlib.so #20 0x00007fbcf8cc4faf in Thread::begin() () from /opt/HPCCSystems/lib/libjlib.so #21 0x00007fbcf8cc3dfc in Thread::_threadmain(void*) () from /opt/HPCCSystems/lib/libjlib.so #22 0x00007fbcf6e579d1 in start_thread () from /lib64/libpthread.so.0 #23 0x00007fbcf6ba48fd in clone () from /lib64/libc.so.6
NB: malloc_consolidate and excsighandler above in the stack
Then there are a 100 or so threads stuck like this:
Thread 80 (Thread 0x7fbcc79f8700 (LWP 10425)): #0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00007fbcf6b38cbc in _L_lock_3517 () from /lib64/libc.so.6 #2 0x00007fbcf6b3396d in arena_get2 () from /lib64/libc.so.6 #3 0x00007fbcf6b36794 in malloc () from /lib64/libc.so.6 #4 0x00007fbcf8c0bce0 in checked_realloc(void*, unsigned long, unsigned long, int) () from /opt/HPCCSystems/lib/libjlib.so
I've no idea cause though, I suspect it's going to need restarting.
Can you open a JIRA re. this please?
Conclusion
None
Activity
Show:
Jacob Cobbett-Smith October 15, 2015 at 2:04 PM
Found this in one of the logs on the same cluster, which might be related:
And in different dafilesrv core:
Fixed
Pinned fields
Click on the next to a field label to start pinning.
Eclagent WU's were complaining about
0000000C 2015-10-08 09:38:29.446 2272 2314 "WARNING: Remote connection 10.144.88.98: timeout expired
0000000D 2015-10-08 09:38:29.446 2272 2316 "WARNING: Remote connection 10.144.88.98: timeout expired
Checking dafilesrv on 10.144.88.98 I saw:
0001E3D1 2015-10-08 09:04:10.078 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D2 2015-10-08 09:04:10.078 11865 11867 "CRemoteFileServerPool: 6601079 threads started in last 258845 seconds, total delay = 60007.96 milliseconds, average delay = 0.01 milliseconds, currently running = 100"
0001E3D3 2015-10-08 09:05:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D4 2015-10-08 09:05:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 101"
0001E3D5 2015-10-08 09:06:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D6 2015-10-08 09:06:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 102"
0001E3D7 2015-10-08 09:07:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D8 2015-10-08 09:07:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 103"
0001E3D9 2015-10-08 09:08:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DA 2015-10-08 09:08:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 104"
0001E3DB 2015-10-08 09:09:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DC 2015-10-08 09:09:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 105"
0001E3DD 2015-10-08 09:10:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DE 2015-10-08 09:10:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 106"
I then notified Jake who responded with:
Captured some stacks and that dafilesrv doesn't look healthy.
Thread 108 (Thread 0x7fbceffff700 (LWP 12139)):
#0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fbcf6b3916b in _L_lock_9503 () from /lib64/libc.so.6
#2 0x00007fbcf6b366a6 in malloc () from /lib64/libc.so.6
#3 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
#4 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#5 0x00007fbcf8cbdfa1 in StringBuffer::appendf(char const*, ...) () from /opt/HPCCSystems/lib/libjlib.so
#6 0x00007fbcf8c1c9c2 in excsighandler(int, siginfo*, void*) () from /opt/HPCCSystems/lib/libjlib.so
#7 <signal handler called>
#8 0x00007fbcf6b31f85 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007fbcf6b35125 in _int_malloc () from /lib64/libc.so.6
#10 0x00007fbcf6b366b1 in malloc () from /lib64/libc.so.6
#11 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
#12 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#13 0x00007fbcf8c4c656 in CLogMsgManager::report_va(LogMsgCategory const&, LogMsgJobInfo const&, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#14 0x0000000000403800 in PROGLOG(char const*, ...) ()
#15 0x00007fbcf8922ecb in CRemoteFileServer::processCommand(unsigned char, MemoryBuffer&, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*, CRemoteFileServer::CT
hrottler*) () from /opt/HPCCSystems/lib/libremote.so
#16 0x00007fbcf8925019 in CRemoteFileServer::CThrottler::addCommand(unsigned char, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*) () from /opt/HPCCSystems/lib
/libremote.so
#17 0x00007fbcf8925830 in CRemoteFileServer::CRemoteClientHandler::throttleCommand(MemoryBuffer&) () from /opt/HPCCSystems/lib/libremote.so
#18 0x00007fbcf8925b63 in CRemoteFileServer::cCommandProcessor::main() () from /opt/HPCCSystems/lib/libremote.so
#19 0x00007fbcf8cc635a in CPooledThreadWrapper::run() () from /opt/HPCCSystems/lib/libjlib.so
#20 0x00007fbcf8cc4faf in Thread::begin() () from /opt/HPCCSystems/lib/libjlib.so
#21 0x00007fbcf8cc3dfc in Thread::_threadmain(void*) () from /opt/HPCCSystems/lib/libjlib.so
#22 0x00007fbcf6e579d1 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fbcf6ba48fd in clone () from /lib64/libc.so.6
NB: malloc_consolidate and excsighandler above in the stack
Then there are a 100 or so threads stuck like this:
Thread 80 (Thread 0x7fbcc79f8700 (LWP 10425)):
#0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fbcf6b38cbc in _L_lock_3517 () from /lib64/libc.so.6
#2 0x00007fbcf6b3396d in arena_get2 () from /lib64/libc.so.6
#3 0x00007fbcf6b36794 in malloc () from /lib64/libc.so.6
#4 0x00007fbcf8c0bce0 in checked_realloc(void*, unsigned long, unsigned long, int) () from /opt/HPCCSystems/lib/libjlib.so
I've no idea cause though, I suspect it's going to need restarting.
Can you open a JIRA re. this please?