Losing connections to LDAP servers

cmh · March 26, 2020, 12:54pm

Running four clusters of Elasticsearch and Kibana 7.5.0 with RoR 1.18.9 and Kibana Pro 1.18.9.

LDAP config in the ROR config:

    - name: ldap1
      hosts: ["freeipa1-0...", "freeipa1-0..."]
      ha: "FAILOVER"
      port: 636
      bind_dn: "...,dc=localdomain"
      bind_password: "derpderpderpderp"
      ssl_enabled: true
      ssl_trust_all_certs: true
      search_user_base_DN: "...,dc=localdomain"
      search_groups_base_DN: "...,dc=localdomain"
      user_id_attribute: "uid"
      unique_member_attribute: "member"
      connection_pool_size: 10
      connection_timeout_in_sec: 30
      request_timeout_in_sec: 30
      cache_ttl_in_sec: 60
      group_search_filter: "(objectclass=top)"
      group_name_attribute: "cn"

When a cluster is used often, login works consistently. However, when a cluster is not used frequently, the login tends to fail with “could not login”. Looking in the es logs shows the issue:

[2020-03-26T12:42:14,080][ERROR][t.b.r.a.b.Block          ] [es2-3...] ldap-admin:
  ldap_auth rule matching got an error LDAP returned code: timeout [85],
  cause: result code='85 (timeout)' diagnostic message='The asynchronous operation
  encountered a client-side timeout after waiting 30001 milliseconds for a response to arrive.' ...

LDAP login to these servers works on other services consistently.

What I’ve found to be an effective fix is to take any non-LDAP account (such as admin or the kibana user) and log in. Then, go to the readonlyrest tab and change anything - typically I change connection_timeout_in_sec - and save it. This forces ROR to reconnect to the LDAP servers, after which point LDAP auth works again. Once I’m logged in, I can run my update-ror script to push the contents of readonlyrest.yml into the index. I’ve tried doing this first but unless I change a value it reports that the contents are already up to date and this doesn’t trigger the LDAP reconnect.

Suggestions on how to avoid having to go through this each time I want to access a less-frequently accessed cluster? Some sort of LDAP keep-alive?

sscarduzio · March 26, 2020, 6:16pm

This is very likely the effects of a leak we fixed in 1.19.3. @clutroth can confirm?

cmh · March 26, 2020, 6:28pm

Ok, just so happens that one of the clusters that gets used infrequently is also the one that I can roll out new versions on the easiest. People keep asking “where are the logs for X?” and I keep telling them but man oh man I must be using expiring words or something because they keep bloody asking.

I’ll give that a whirl and post results.

cmh · March 27, 2020, 6:53pm

No dice. Worked fine after the upgrade to 1.19.3 yesterday but that’s to be expected. Tried today:

[2020-03-27T18:49:44,539][INFO ][t.b.r.a.l.AccessControlLoggingDecorator] [es2-2...] FORBIDDEN by default req={  ID:1923741058-1213703257#3202865,  TYP:RRAdminRequest,  CGR:N/A,  USR:cheerschap (attempted)...

Instead of displaying the usual “Could not login” firefox now gives an error that the site couldn’t serve the page. Didn’t grab the exact error message, sorry for that, but I will if I can get it to do it again.

Stranger yet, I grabbed the kibana creds and logged in with that, only to see:
Screenshot 2020-03-27 14.57.00

My LDAP cheerschap login failed - and this login was with the kibana user locally defined in the ror config. Logging out and logging back in with the same credentials got the login username set properly to kibana.

Subsequently logging out and logging in as my LDAP user succeeded but that’s to be expected as I changed and saved the ror config.

cmh · March 27, 2020, 7:14pm

Confirmed running 1.19.3:

root@es2-0:cheerschap# plugin list -v
Plugins directory: /usr/share/elasticsearch/plugins
readonlyrest
- Plugin information:
Name: readonlyrest
Description: Safely expose Elasticsearch REST API
Version: 1.19.3
Elasticsearch Version: 7.5.0
Java Version: 1.8
Native Controller: false
Extended Plugins: []
 * Classname: tech.beshu.ror.es.ReadonlyRestPlugin

clutroth · March 27, 2020, 7:17pm

Yes, it is.

Post must be at least 20 characters

cmh · March 27, 2020, 9:12pm

Here’s the previously mentioned error received after LDAP login fails - this happens after the configured 30s timeout:

Reloading that page results in this:

Screenshot 2020-03-27 17.09.26

I confirmed that the cluster is still up and 100% functional. Updating the readonlyrest config as described above and Kibana starts working again.

coutoPL · March 28, 2020, 12:48pm

@cmh couple of questions:

the error looks like regular read timeout (LDAP Result Code Reference: Client-Side Result Codes – LDAP.com). Are you able to call this LDAP using eg. ldapsearch and receive proper response?
do you see any logs on ES side related to the issue?

coutoPL · March 28, 2020, 1:05pm

I’ve just read the first post and now I think I understand what is going on. It seems that when LDAP error 85 occurs, the connection is broken and cannot be used any more. But it comes back to pool and then cannot handle any other request.

But still, I’d be nice to see logs. We will try to reproduce it in our integration tests.

cmh · March 30, 2020, 3:12pm

Yep, these LDAP servers are used for login and other services on the network, many of which are used infrequently, and those work consistently.

In terms of logs, here’s the related logs from this morning:

[2020-03-30T15:06:00,163][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthenticationService] [bz1es2-3.bb.internal.maas360.com] LDAP getting user CN returned error: [code=85 (timeout), cause=result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.']
[2020-03-30T15:06:00,163][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthenticationService] [bz1es2-3.bb.internal.maas360.com] LDAP getting user operation failed.
[2020-03-30T15:06:00,163][ERROR][t.b.r.a.b.Block          ] [bz1es2-3.bb.internal.maas360.com] ldap-admin: ldap_auth rule matching got an error LDAP returned code: timeout [85], cause: result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.'
tech.beshu.ror.accesscontrol.blocks.definitions.ldap.implementations.LdapUnexpectedResult: LDAP returned code: timeout [85], cause: result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.'
	at tech.beshu.ror.accesscontrol.blocks.definitions.ldap.implementations.BaseUnboundidLdapService.$anonfun$ldapUserBy$2(UnboundidLdapService.scala:240) ~[?:?]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:170) ~[?:?]
	at monix.eval.Task$.unsafeStartNow(Task.scala:4582) ~[?:?]
	at monix.eval.internal.TaskBracket$BaseStart$$anon$1.onSuccess(TaskBracket.scala:187) ~[?:?]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:165) ~[?:?]
	at monix.eval.Task$.unsafeStartNow(Task.scala:4582) ~[?:?]
	at monix.eval.internal.TaskBracket$BaseStart.apply(TaskBracket.scala:174) ~[?:?]
	at monix.eval.internal.TaskBracket$BaseStart.apply(TaskBracket.scala:164) ~[?:?]
	at monix.eval.internal.TaskRestartCallback.run(TaskRestartCallback.scala:65) ~[?:?]
	at monix.execution.internal.Trampoline.monix$execution$internal$Trampoline$$immediateLoop(Trampoline.scala:66) ~[?:?]
	at monix.execution.internal.Trampoline.startLoop(Trampoline.scala:32) ~[?:?]
	at monix.execution.schedulers.TrampolineExecutionContext$JVMOptimalTrampoline.startLoop(TrampolineExecutionContext.scala:143) ~[?:?]
	at monix.execution.internal.Trampoline.execute(Trampoline.scala:40) ~[?:?]
	at monix.execution.schedulers.TrampolineExecutionContext.execute(TrampolineExecutionContext.scala:64) ~[?:?]
	at monix.execution.schedulers.BatchingScheduler.execute(BatchingScheduler.scala:50) ~[?:?]
	at monix.execution.schedulers.BatchingScheduler.execute$(BatchingScheduler.scala:47) ~[?:?]
	at monix.execution.schedulers.AsyncScheduler.execute(AsyncScheduler.scala:31) ~[?:?]
	at monix.eval.internal.TaskRestartCallback.start(TaskRestartCallback.scala:56) ~[?:?]
	at monix.eval.internal.TaskRunLoop$.executeAsyncTask(TaskRunLoop.scala:592) ~[?:?]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:120) ~[?:?]
	at monix.eval.internal.TaskRunLoop$.$anonfun$restartAsync$1(TaskRunLoop.scala:222) ~[?:?]
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_242]
[2020-03-30T15:06:30,679][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthorizationService] [bz1es2-3.bb.internal.maas360.com] LDAP getting user CN returned error: [code=85 (timeout), cause=result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.']
[2020-03-30T15:06:30,680][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthorizationService] [bz1es2-3.bb.internal.maas360.com] LDAP getting user operation failed.
[2020-03-30T15:06:30,680][ERROR][t.b.r.a.b.Block          ] [bz1es2-3.bb.internal.maas360.com] ldap-all: ldap_auth rule matching got an error LDAP returned code: timeout [85], cause: result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.'
tech.beshu.ror.accesscontrol.blocks.definitions.ldap.implementations.LdapUnexpectedResult: LDAP returned code: timeout [85], cause: result code='85 (timeout)' diagnostic message='The asynchronous operation encountered a client-side timeout after waiting 29999 milliseconds for a response to arrive.'
	at tech.beshu.ror.accesscontrol.blocks.definitions.ldap.implementations.BaseUnboundidLdapService.$anonfun$ldapUserBy$2(UnboundidLdapService.scala:240) ~[core-1.19.3.jar:?]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:170) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.Task$.unsafeStartNow(Task.scala:4582) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskBracket$BaseStart$$anon$1.onSuccess(TaskBracket.scala:187) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:165) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.Task$.unsafeStartNow(Task.scala:4582) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskBracket$BaseStart.apply(TaskBracket.scala:174) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskBracket$BaseStart.apply(TaskBracket.scala:164) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRestartCallback.run(TaskRestartCallback.scala:65) ~[monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.execution.internal.Trampoline.monix$execution$internal$Trampoline$$immediateLoop(Trampoline.scala:66) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.internal.Trampoline.startLoop(Trampoline.scala:32) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.schedulers.TrampolineExecutionContext$JVMOptimalTrampoline.startLoop(TrampolineExecutionContext.scala:143) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.internal.Trampoline.execute(Trampoline.scala:40) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.schedulers.TrampolineExecutionContext.execute(TrampolineExecutionContext.scala:64) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.schedulers.BatchingScheduler.execute(BatchingScheduler.scala:50) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.schedulers.BatchingScheduler.execute$(BatchingScheduler.scala:47) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.execution.schedulers.AsyncScheduler.execute(AsyncScheduler.scala:31) [monix-execution_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRestartCallback.start(TaskRestartCallback.scala:56) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRunLoop$.executeAsyncTask(TaskRunLoop.scala:592) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:120) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRestartCallback.syncOnSuccess(TaskRestartCallback.scala:101) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskRestartCallback.onSuccess(TaskRestartCallback.scala:74) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at monix.eval.internal.TaskShift$Register$$anon$1.run(TaskShift.scala:65) [monix-eval_2.12-3.0.0.jar:3.0.0]
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_242]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_242]
[2020-03-30T15:06:30,684][INFO ][t.b.r.a.l.AccessControlLoggingDecorator] [bz1es2-3.bb.internal.maas360.com] FORBIDDEN by default req={  ID:663354327-1515145325#19136717,  TYP:RRAdminRequest,  CGR:N/A,  USR:cheerschap (attempted),  BRS:false,  KDX:null,  ACT:cluster:admin/rradmin/refreshsettings,  OA:10.2.111.52/32,  XFF:null,  DA:10.2.111.23/32,  IDX:<N/A>,  MET:GET,  PTH:/_readonlyrest/metadata/current_user,  CNT:<N/A>,  HDR:Authorization=<OMITTED>, Connection=keep-alive, Content-Length=0, Host=bz1es2-3.bb.internal.maas360.com:9200,  HIS:[local-admin-> RULES:[auth_key_unix->false]], [local-logstash-> RULES:[auth_key_unix->false]], [local-kibana-> RULES:[auth_key_unix->false]], [local-puppet-> RULES:[auth_key_unix->false]], [local-jenkins-> RULES:[auth_key_unix->false]], [elastalert-> RULES:[auth_key_unix->false]], [fluentbit-> RULES:[auth_key_unix->false]], [ldap-admin-> RULES:[ldap_auth->false]], [ldap-all-> RULES:[ldap_auth->false]], [localhost-> RULES:[hosts->false]]  }

coutoPL · March 30, 2020, 3:23pm

Ok, cool. This is what I’ve described. I’ll notify you when it’s fixed! Thanks

Xsi · April 4, 2020, 1:43pm

Hi,

I’ve got exactly the same problem. I can see that logs are provided abowe so I am not pasting any duplicates but I can confirm that this problem exists with plugin version 1.19.3 an ES 6.8.7. Please also notify me if You fix it. It’s a little bit critical for my company right now

sscarduzio · April 6, 2020, 9:54pm

Hi @Xsi, this issue is currently enqueued in our backlog. It will take still a few weeks.

cmh · April 7, 2020, 1:26pm

Have you seen my workaround? If you have a local user with appropriate permissions, you can log in as that user and make some minor change to the ROR config, and when you save it’ll cause a reconnect.

It could also be done from the command line but that’s a little more involved, I’ve got a script I could share if that would be helpful.

Xsi · April 7, 2020, 1:31pm

Thanks for the info. We’ve moved to basic_auth for now. I will wait for the appropriate fix.

cmh · April 7, 2020, 1:36pm

Understood - once the fix gets released, I’ll test it and let you know.

cmh · April 17, 2020, 1:49pm

Any update on where this stands? I had to create a new cluster due to load issues on the other, and folks are not logging into this one regularly so I’m having to run my workarounds each time I want to log in, and I’m getting pinged regularly from someone who is trying to log in. Seems like it really needs plenty of steady activity to keep the LDAP connection from dropping, and it’s getting frustrating to have to keep working around this. I wish I had Xsi’s option of moving to basic auth.

sscarduzio · April 17, 2020, 10:31pm

moved it up a position in priority!

coutoPL · April 20, 2020, 5:45am

@cmh @Xsi

I think I found a problem with LDAP request timeout handling. Please test this version:

https://readonlyrest-data.s3-eu-west-1.amazonaws.com/build/1.19.5-pre6/readonlyrest-1.19.5-pre6_es7.5.0.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA5SJIWBO54AGBERLX/20200420/eu-west-1/s3/aws4_request&X-Amz-Date=20200420T054421Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f4f4a6dce91ab3910d1e48cbed17af7c30b7f0f569b1e8f51c594ed08c01d717

https://readonlyrest-data.s3-eu-west-1.amazonaws.com/build/1.19.5-pre6/readonlyrest-1.19.5-pre6_es6.8.7.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA5SJIWBO54AGBERLX/20200420/eu-west-1/s3/aws4_request&X-Amz-Date=20200420T054535Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=741d13204410dddfdaaa98ab2c2468479e076ccb8d230074e323b23a3f68e210

Xsi · April 20, 2020, 9:03am

Thanks. I’ve deployed this version to cluster and so far it looks good. I will test it for a longer period and return if I notice anything suspicious.