dafilesrv Pool limit exceeded for CRemoteFileServerPool

Environment

http://10.144.140.19:8010/

Description

Eclagent WU's were complaining about
0000000C 2015-10-08 09:38:29.446 2272 2314 "WARNING: Remote connection 10.144.88.98: timeout expired
0000000D 2015-10-08 09:38:29.446 2272 2316 "WARNING: Remote connection 10.144.88.98: timeout expired

Checking dafilesrv on 10.144.88.98 I saw:
0001E3D1 2015-10-08 09:04:10.078 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D2 2015-10-08 09:04:10.078 11865 11867 "CRemoteFileServerPool: 6601079 threads started in last 258845 seconds, total delay = 60007.96 milliseconds, average delay = 0.01 milliseconds, currently running = 100"
0001E3D3 2015-10-08 09:05:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D4 2015-10-08 09:05:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 101"
0001E3D5 2015-10-08 09:06:10.079 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D6 2015-10-08 09:06:10.079 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 102"
0001E3D7 2015-10-08 09:07:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3D8 2015-10-08 09:07:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 103"
0001E3D9 2015-10-08 09:08:10.080 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DA 2015-10-08 09:08:10.080 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.93 milliseconds, average delay = 60007.93 milliseconds, currently running = 104"
0001E3DB 2015-10-08 09:09:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DC 2015-10-08 09:09:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 105"
0001E3DD 2015-10-08 09:10:10.081 11865 11867 "WARNING: Pool limit exceeded for CRemoteFileServerPool"
0001E3DE 2015-10-08 09:10:10.081 11865 11867 "CRemoteFileServerPool: 1 threads started in last 60 seconds, total delay = 60007.94 milliseconds, average delay = 60007.94 milliseconds, currently running = 106"

I then notified Jake who responded with:
Captured some stacks and that dafilesrv doesn't look healthy.

Thread 108 (Thread 0x7fbceffff700 (LWP 12139)):
#0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fbcf6b3916b in _L_lock_9503 () from /lib64/libc.so.6
#2 0x00007fbcf6b366a6 in malloc () from /lib64/libc.so.6
#3 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
#4 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#5 0x00007fbcf8cbdfa1 in StringBuffer::appendf(char const*, ...) () from /opt/HPCCSystems/lib/libjlib.so
#6 0x00007fbcf8c1c9c2 in excsighandler(int, siginfo*, void*) () from /opt/HPCCSystems/lib/libjlib.so
#7 <signal handler called>
#8 0x00007fbcf6b31f85 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007fbcf6b35125 in _int_malloc () from /lib64/libc.so.6
#10 0x00007fbcf6b366b1 in malloc () from /lib64/libc.so.6
#11 0x00007fbcf8cbd9ce in StringBuffer::_realloc(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
#12 0x00007fbcf8cbdb58 in StringBuffer::limited_valist_appendf(unsigned int, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#13 0x00007fbcf8c4c656 in CLogMsgManager::report_va(LogMsgCategory const&, LogMsgJobInfo const&, char const*, __va_list_tag*) () from /opt/HPCCSystems/lib/libjlib.so
#14 0x0000000000403800 in PROGLOG(char const*, ...) ()
#15 0x00007fbcf8922ecb in CRemoteFileServer::processCommand(unsigned char, MemoryBuffer&, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*, CRemoteFileServer::CT
hrottler*) () from /opt/HPCCSystems/lib/libremote.so
#16 0x00007fbcf8925019 in CRemoteFileServer::CThrottler::addCommand(unsigned char, MemoryBuffer&, CRemoteFileServer::CRemoteClientHandler*) () from /opt/HPCCSystems/lib
/libremote.so
#17 0x00007fbcf8925830 in CRemoteFileServer::CRemoteClientHandler::throttleCommand(MemoryBuffer&) () from /opt/HPCCSystems/lib/libremote.so
#18 0x00007fbcf8925b63 in CRemoteFileServer::cCommandProcessor::main() () from /opt/HPCCSystems/lib/libremote.so
#19 0x00007fbcf8cc635a in CPooledThreadWrapper::run() () from /opt/HPCCSystems/lib/libjlib.so
#20 0x00007fbcf8cc4faf in Thread::begin() () from /opt/HPCCSystems/lib/libjlib.so
#21 0x00007fbcf8cc3dfc in Thread::_threadmain(void*) () from /opt/HPCCSystems/lib/libjlib.so
#22 0x00007fbcf6e579d1 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fbcf6ba48fd in clone () from /lib64/libc.so.6

NB: malloc_consolidate and excsighandler above in the stack

Then there are a 100 or so threads stuck like this:

Thread 80 (Thread 0x7fbcc79f8700 (LWP 10425)):
#0 0x00007fbcf6bb405e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007fbcf6b38cbc in _L_lock_3517 () from /lib64/libc.so.6
#2 0x00007fbcf6b3396d in arena_get2 () from /lib64/libc.so.6
#3 0x00007fbcf6b36794 in malloc () from /lib64/libc.so.6
#4 0x00007fbcf8c0bce0 in checked_realloc(void*, unsigned long, unsigned long, int) () from /opt/HPCCSystems/lib/libjlib.so

I've no idea cause though, I suspect it's going to need restarting.

Can you open a JIRA re. this please?

Conclusion

None

Activity

Show:

Jacob Cobbett-Smith October 15, 2015 at 2:04 PM

Found this in one of the logs on the same cluster, which might be related:

And in different dafilesrv core:

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Pull Request URL

Affects versions

Created October 8, 2015 at 2:23 PM
Updated October 20, 2015 at 8:47 AM
Resolved October 20, 2015 at 8:47 AM