Possible memory or thread leak in LDAP Round Robin Health Check?

JeffSaxe · August 6, 2019, 7:23pm

I just did an upgrade (on a test, non-production cluster) to ES 6.8.2 and RoR PRO 1.18.4. Now, all three of this cluster’s ES nodes won’t stay up for more than a couple of hours; the service crashes and the last thing in the log file is this:

[2019-08-06T11:47:44,231][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [qim-elastic1-d7] fatal error in thread [Health Check Thread for LDAPConnectionPool(serverSet=RoundRobinServerSet(servers={qim-dc-05.qim.com:636, qim-dc-06.qim.com:636, ny4-dc-02.qim.com:636}, includesAuthentication=false, includesPostConnectProcessing=false), maxConnections=10)], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_222]
at java.lang.Thread.start(Thread.java:717) [?:1.8.0_222]
at com.unboundid.ldap.sdk.LDAPConnectionInternals.(LDAPConnectionInternals.java:160) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:860) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:760) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.connect(LDAPConnection.java:710) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnection.(LDAPConnection.java:534) ~[?:?]
at com.unboundid.ldap.sdk.RoundRobinServerSet.getConnection(RoundRobinServerSet.java:391) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.createConnection(LDAPConnectionPool.java:1285) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.createConnection(LDAPConnectionPool.java:1258) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.handleDefunctConnection(LDAPConnectionPool.java:2171) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.invokeHealthCheck(LDAPConnectionPool.java:2893) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.invokeHealthCheck(LDAPConnectionPool.java:2806) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPool.doHealthCheck(LDAPConnectionPool.java:2776) ~[?:?]
at com.unboundid.ldap.sdk.LDAPConnectionPoolHealthCheckThread.run(LDAPConnectionPoolHealthCheckThread.java:94) ~[?:?]

So it seems that the routine health check that checks each LDAP server in my Round Robin group for liveness is unable to get a new Thread from Java. It’s possible that the association with the LDAP health check is just coincidence, because on the other two servers that crashed most recently, the error is more generic…

[2019-08-06T11:47:45,515][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [qim-elastic1-d8] fatal error in thread [elasticsearch[qim-elastic1-d8][generic][T#7]], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_222]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) ~[?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

So still “unable to create new native thread”, but it’s not the LDAP checker that is trying to do so this time, it’s more of a generic thread executor function. It might or might not be be that the LDAP checker is the guy that consumed all the available threads and some other thing is then unable to use the next one.

Anyway, this symptom never happened on any previous version, so I expect it is a regression. To start troubleshooting, I could temporarily change my LDAP to Failover instead of Round Robin mode and see if it stops doing this. Or if you have any other things you’d like me to try, I will be glad to. Thank you.

– Jeff Saxe
Quantitative Investment Management
Charlottesville, VA, US

JeffSaxe · August 6, 2019, 8:05pm

Hmm, OK, I have done that change to FAILOVER mode and I will see what goes on, but in the process of changing this setting, I maybe have stumbled on the real cause! When I upgraded to 1.18.4, I suddenly came to face a newly-enforced config restriction, that each ACL stanza can contain either “actions” or “kibana_access”, but not both. I was trying to wrap my head around the implications of this for my end users and groups config, but just in order to get the nodes to start up at all, I commented out the “actions” line in a couple of places in readonlyread.yml (the file-based config).

However, this cluster still contained a .readonlyrest index-based config, and THAT config was now invalid as well! But I think the underlying polling logic retrieves the single document in there, tests to see if it parses properly, and if it does, loads it and starts using it… but if it isn’t valid, it just “coasts” on the config it already had and then polls again in a few seconds. So that may have been the real issue… I believe I had a valid file but invalid index-based config, and left it that way, and the RoR config-reloader polled over and over and used up all the threads.

I will try to figure out if this is the true issue. If the nodes stay up for a few hours now (after I have corrected the index-based config), then I will put some invalid content back into the index document and then reload one of the nodes, and see if it crashes after a couple of hours again.

– JeffS

sscarduzio · August 6, 2019, 8:56pm

Hi @JeffSaxe, nice to hear from you again

Just finished reading through your nice analysis. Your conclusions make a lot of sense. Of course please let us know if correcting the in-index settings solves the OoM error!

About mixing kibana_access and actions, we were in the situation of repeating the same thing to a lot of customers: those two can’t live in the same ACL block without interfering and creating unclear behaviours.

This is because kibana_access is a meta-rule that already includes a filter on some actions. More on this in the docs.

Of course, the OoM was not planned

JeffSaxe · August 7, 2019, 3:01pm

OK, last night I put the FAILOVER back to ROUND_ROBIN and restarted all three nodes for good measure, and it stayed up just fine overnight. So now I’m intentionally misspelling the attribute value (so that it will fail to parse), saving the settings, and restarting all three again, so they will load from file but then be unable to pull correctly from the document. I expect that they will crash within a couple of hours, and I will report back. If that ends up being the issue, then you will need to look at the polling or timer code to make sure you either start just one thread, and let it run forever, or start a thread and then finish / clean up that thread, over and over, but not start unlimited threads and never finish them. A low-priority bug, not likely to impact many customers.

By the way, I think there is a GUI regression or something… I believe it used to do a quick parse-check of the contents of the big text box in the RoR settings editor, and only if it was successful would it save into the index…?? Is this my imagination? Obviously if there is a problem with the content of that box, it’s my own fault, but I believe there used to be a last-minute safety net that would catch errors and report them to me in red text, and then I would be able to fix it and hit Save again. As it is right now, I don’t have any feedback as to whether the content I am saving is valid or not. What I rely on is doing “tail -F” on the Elasticsearch nodes’ logs, and if it is successful, then they spit out the ADDING BLOCK lines, but if it doesn’t parse, they don’t say anything. If there is supposed to be a syntax check on the Save button, consider bringing it back; if there never was, consider adding one, or adding a “Check Syntax” button that performs only the check without Saving, and gives a nice green “Parses successfully” message when it comes back clean. Thank you.

Oh, and thanks for the link to the documentation about the kibana_access. I will figure out what combination of rules I need to give my users a good experience while keeping sensitive information private.

JeffSaxe · August 7, 2019, 8:19pm

Confirmed, I got the crash again, after about 95 minutes of operation. The only difference was having the stale, now-officially-invalid bit of config in there…

actions: ["indices:data/read/*","cluster:monitor/*"]
kibana_access: rw
kibana_hide_apps: ["readonlyrest_kbn"]

So yes, please add this as a low-priority bug. As I said, it would only happen if the RoR admin leaves in content that doesn’t parse, so if the Save button goes back to the behavior of refusing to save when it doesn’t parse, then it’s not likely to impact any users (except in an unusual upgrade scenario like mine). But if you make no changes, and the user is allowed to save invalid content, this has the potential to be really bad… multiple ES nodes would all drop like flies, all around the same time as each other, leading rapidly to Yellow or Red or completely shut down clusters. So it should be addressed somehow.

I’m glad it wasn’t anything more serious, so now I can continue my upgrade plans!

sscarduzio · August 8, 2019, 5:34pm

Hi @JeffSaxe, very impressed with your analysis, thank you! Will file a bug immediately.

coutoPL · September 15, 2019, 9:03am

Hi @JeffSaxe I think we were able to fix the problem you’ve reported. Could you please check our pre build and let us know if the issue is not there any more?

https://readonlyrest-data.s3-eu-west-1.amazonaws.com/build/1.18.6-pre6/readonlyrest-1.18.6-pre6_es6.8.2.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA5SJIWBO54AGBERLX/20190915/eu-west-1/s3/aws4_request&X-Amz-Date=20190915T090053Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=0b560e992e531b48cb3c94edf9f2bfe7fdd2c3503ef40f58d2d0ade781ead8b3