LDAP getting user operation failed and LDAP authenticate operation failed error messages

alonzo · July 8, 2019, 7:34am

Hi ,

we have ROR for ES version 6.1.1 .
during the last few days we see some error messages in the ES logs .
those errors are once in two - three days .
during that time error messages appears in the application’s log.
I’ve done some checks and saw nothing special like high CPU or heap usage while those errors appears.
the error messages and the configuration are attached .
readonlyrest ldap errors.zip (2.6 KB) can you please advise ?

Thanks.

sscarduzio · July 8, 2019, 8:44am

Copying the stack traces here for search engine users:


[2019-06-26T01:10:22,614][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthorizationService] [
ode_name] LDAP getting user operation failed.
tech.beshu.ror.acl.blocks.definitions.ldap.implementations.LdapUnexpectedResult: null
        at tech.beshu.ror.acl.blocks.definitions.ldap.implementations.BaseUnboundidLdapService.$anonfun$ldapUserBy$2(UnboundidLdapService.scala:232) ~[?:?]
        at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:147) ~[?:?]
        at monix.eval.internal.TaskRestartCallback.syncOnSuccess(TaskRestartCallback.scala:108) ~[?:?]
        ...

--
2019-06-26T01:10:22,638][ERROR][t.b.r.a.b.Block          ] [
ode_name] allow all for GROUP_A: ldap_auth rule matching got an error null
tech.beshu.ror.acl.blocks.definitions.ldap.implementations.LdapUnexpectedResult: null
        at tech.beshu.ror.acl.blocks.definitions.ldap.implementations.BaseUnboundidLdapService.$anonfun$ldapUserBy$2(UnboundidLdapService.scala:232) ~[?:?]
        at monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:147) ~[?:?]
        at monix.eval.internal.TaskRestartCallback.syncOnSuccess(TaskRestartCallback.scala:108) ~[?:?]
        at monix.eval.internal.TaskRestartCallback$$anon$1.run(TaskRestartCallback.scala:128) ~[?:?]
        at monix.execution.internal.Trampoline.monix$execution$internal$Trampoline$$immediateLoop(Trampoline.scala:66) ~[?:?]
        at monix.execution.internal.Trampoline.startLoop(Trampoline.scala:32) ~[?:?]

This is in the works already (Jira ID RORDEV-66) more comment from @coutoPL?

alonzo · July 8, 2019, 8:57am

thank you Simone .
do you know if it’s a matter of days or weeks until you fix it ?

sscarduzio · July 8, 2019, 9:35am

Will wait on @coutoPL to confirm, but in a different post, that stack trace was symptomatic of a (temporary?) connection failure to the LDAP server. And the Jira issue is just about making that log less cryptic for the user.

We are about to release a new version, not sure if this fix is going to be part of it, but we can create a pre build for you as soon as we get it fixed.

alonzo · July 21, 2019, 10:09am

Hi Simone ,

I’ve checked with the system administrator and seems to be what you wrote .
the errors we see are while he had some issues with one of the DC servers .
though the ROR knows how to switch between DCs the developers still complains .
I tried to find some more details on google but saw nothing , can you please elaborate on those parameters :
connection_pool_size
connection_timeout_in_sec
request_timeout_in_sec
cache_ttl_in_sec

you have our configuration settings in the attachment .
do you recommend specific values for those settings ?
another question please , can we set a preferred DC and put it first on the list ?

thanks .

coutoPL · July 22, 2019, 4:20pm

Hi @alonzo,

@sscarduzio asked me to help you, but I think I don’t fully understand your problem (maybe I don’t have full context). I’ve seen the log above and in new version we have better raw ldap error logging. Could you try it? Are you able to provide logs?

According to parameters above - did you check ROR docs (readonlyrest-docs/elasticsearch.md at master · beshu-tech/readonlyrest-docs · GitHub)? Recommended settings are just defaults. If you have configured HA, hosts order doesn’t matter

alonzo · July 23, 2019, 11:50am

thanks , I’ve checked that link but could not find some data on each parameter.
can you just explain what is the difference between connection_timeout_in_sec and request_timeout_in_sec ?
regarding the default values , in one of the SR we opened the recommendation was to set a value other than 0 (which is the default) for cache_ttl_in_sec .
suppose we set cache_ttl_in_sec : 30 , does it mean that if I connect to ES cluster with my credential than in the next 30 seconds they will be stored in a cache and the request will not go through the LDAP server ?
as for your suggestion, today I installed the latest version 18.2 and we’ll continue checking .

Thank you.

coutoPL · July 23, 2019, 4:51pm

@alonzo
connection_timeout_in_sec - timeout for establishing a connection with LDAP
request_timeout_in_sec - timeout to receive data from the connection

accoding to cache_ttl_in_sec - we’re calling LDAP service and then cache the response for given amount of time (time to live for cache entry). When you make the same request once again, LDAP won’t be hit again, but the response will be returned from cache. The cache entry is invalidated after TTL expires.

sscarduzio · July 24, 2019, 8:40am

@alonzo we identified that HA LDAP hosts ordering should matter in failover mode, as users should be able to identify servers as primary, secondary, etc.
A PR is already out for this: we will preserve the same settings syntax (a simple YAML array).

This enhancement will be part of ROR 1.18.3.

alonzo · July 30, 2019, 6:10am

thank you both .
I’ll set those parameters accordingly .

alonzo · August 7, 2019, 1:18pm

hello , regarding the “cache_ttl_in_sec” parameter .
suppose I set this parameter to 60 seconds .
at 10:00:00 I ran a query against the AD .
if I do the same at 10:00:30 then I hit the cache .
will it stores the data for 60 seconds starting the first time or the last time I put my credentials ?

coutoPL · August 7, 2019, 4:38pm

@alonzo starting from the first time

alonzo · August 14, 2019, 7:24am

same issue, just to make sure ,

if the cache_ttl_in_sec is set to 60 , and a query is sent to the cluster at 7:00 , then ROR send request to AD but the AD is not responsive for 10 seconds.

when a situation like that occurs I see one error message in the cluster_name.log file exactly at 7:00 when the request arrived . the application then retry to connect to the cluster every few minutes .
but even though the AD was back after 10 seconds we see the same error message in the application’s log for 60 sec . that means that we keep in the ROR cache the result that came back from the AD for 60 sec and that’s why the application can not connect even when the AD responds again after 10 - 15 seconds ?