I just did an upgrade (on a test, non-production cluster) to ES 6.8.2 and RoR PRO 1.18.4. Now, all three of this cluster’s ES nodes won’t stay up for more than a couple of hours; the service crashes and the last thing in the log file is this:
[2019-08-06T11:47:44,231][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [qim-elastic1-d7] fatal error in thread [Health Check Thread for LDAPConnectionPool(serverSet=RoundRobinServerSet(servers={qim-dc-05.qim.com:636, qim-dc-06.qim.com:636, ny4-dc-02.qim.com:636}, includesAuthentication=false, includesPostConnectProcessing=false), maxConnections=10)], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_222]
at java.lang.Thread.start(Thread.java:717) [?:1.8.0_222]
at com.unboundid.ldap.sdk.LDAPConnectionInternals.(LDAPConnectionInternals.java:160) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:860) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:760) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:710) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.(LDAPConnection.java:534) ~[?:?]
at com.unboundid.ldap.sdk.RoundRobinServerSet.getConnection(RoundRobinServerSet.java:391) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.createConnection(LDAPConnectionPool.java:1285) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.createConnection(LDAPConnectionPool.java:1258) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.handleDefunctConnection(LDAPConnectionPool.java:2171) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.invokeHealthCheck(LDAPConnectionPool.java:2893) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.invokeHealthCheck(LDAPConnectionPool.java:2806) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.doHealthCheck(LDAPConnectionPool.java:2776) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPoolHealthCheckThread.run(LDAPConnectionPoolHealthCheckThread.java:94) ~[?:?]
So it seems that the routine health check that checks each LDAP server in my Round Robin group for liveness is unable to get a new Thread from Java. It’s possible that the association with the LDAP health check is just coincidence, because on the other two servers that crashed most recently, the error is more generic…
[2019-08-06T11:47:45,515][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [qim-elastic1-d8] fatal error in thread [elasticsearch[qim-elastic1-d8][generic][T#7]], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_222]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
So still “unable to create new native thread”, but it’s not the LDAP checker that is trying to do so this time, it’s more of a generic thread executor function. It might or might not be be that the LDAP checker is the guy that consumed all the available threads and some other thing is then unable to use the next one.
Anyway, this symptom never happened on any previous version, so I expect it is a regression. To start troubleshooting, I could temporarily change my LDAP to Failover instead of Round Robin mode and see if it stops doing this. Or if you have any other things you’d like me to try, I will be glad to. Thank you.
– Jeff Saxe
Quantitative Investment Management
Charlottesville, VA, US