[SUPPORT|kbn_pro] Forbidden error after the LDAP password change

Support request

Hey, we’re using LDAP for authenticating to Kibana. We started experiencing weird behaviour after migrating our 6.x ELK to 7.x where after the LDAP password change when trying to open Kibana you get the forbidden error. Cleaning the cookies & cache solves the issue and redirects you properly to the login page. The issue persists after upgrading everything to 8.x:

Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: [13:23:32:443] [error][plugins][ReadonlyREST][esClient] ES Authorization error: 403 Error: ES Authorization error: 403
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at l.e (/usr/share/kibana/plugins/readonlyrestkbn/proxy/core/esClient.js:1:17932)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at l.e (/usr/share/kibana/plugins/readonlyrestkbn/proxy/core/esClient.js:1:5483)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at tryCatch (/usr/share/kibana/plugins/readonlyrestkbn/node_modules/regenerator-runtime/runtime.js:45:40)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at Generator.invoke [as _invoke] (/usr/share/kibana/plugins/readonlyrestkbn/node_modules/regenerator-runtime/runtime.js:274:22)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at Generator.prototype.<computed> [as next] (/usr/share/kibana/plugins/readonlyrestkbn/node_modules/regenerator-runtime/runtime.js:97:21)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at asyncGeneratorStep (/usr/share/kibana/plugins/readonlyrestkbn/node_modules/@babel/runtime/helpers/asyncToGenerator.js:3:24)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at _next (/usr/share/kibana/plugins/readonlyrestkbn/node_modules/@babel/runtime/helpers/asyncToGenerator.js:25:9)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: at processTicksAndRejections (node:internal/process/task_queues:95:5)
Dec 01 13:23:32 laaskb002d1vteo kibana[1855]: [13:23:32:445] [info][plugins][ReadonlyREST][authorizationHeadersValidation] Could not revalidate the session against ES: + WRONG_CREDENTIALS

RoR LDAP config:

  ldaps:
    - name: adform
      ssl_trust_all_certs: true
      bind_dn: ...
      bind_password: ...
      search_user_base_DN: ...
      search_groups_base_DN: ...
      user_id_attribute: ...
      unique_member_attribute: '...'
      group_search_filter: (objectClass=group)
      connection_pool_size: 10
      connection_timeout_in_sec: 15
      request_timeout_in_sec: 15
      cache_ttl_in_sec: 60
      servers:
        - "ldaps://....com:636"

Is it expected?

ROR Version: 1.54.0

Kibana Version: 8.10.4

Elasticsearch Version: 8.10.4


{“customer_id”: “a2d8a38b-1070-4845-aa8e-6f38fb585857”, “subscription_id”: “c6f3569d-3d8e-46ce-ac53-92f19301b69e”}

Could you please set ROR Kibana log level to trace, reproduce the issue and show us kibana logs?

Hi,
It seems we are running into the same (at least sounding very similar) issue.

Was anything done on this?
Can we hijack this topic or you want me to create a new one?

Not really. We are waiting for some hints regarding this issue. Could you please describe your case?

Steps to reproduce:
Have user A, ldap authenticated, password 123.
User A is logged in on Kibana, just doing his thing.
User A goes to LDAP and change his password to 456.
User A tries to do something in Kibana.
Stuff breaks :slight_smile:

What we see happening, also looking at behaviour discussed in:

The user gets a white screen.
But if we look at the logs of Elasticsearch, we see strange behaviour.

It tries to make a whole bunch of login attempts.
Each of those triggers a LDAP bind.
Look at the timing, all on the same milisecond.
All those connections trigger the circuit breaker for LDAP at some point.
But because the default connection pool is 30.
So the LDAP connector tries 30 binds with the old password, within a few miliseconds:

It seems due to the way LDAP session handling is setup, it uses the old credentials for this (User A hasn’t entered any new credentials yet).
This immediatly locks the account on LDAP side.
It goes on for a couple of thousand binds attempts, all going to circuit breaker or failing because account is locked on LDAP side now.

We applied a workaround by enabling LDAP cache and restricting the connection pool size to 1.
This makes for more predictable and manageable behaviour.
Now the user gets a white screen, should logoff and logon again.

But having a connection pool size of 1, is offcourse not ideal :slight_smile:

Some version information:
ROR 1.59.0
ES/Kibana: 8.15.1
For completeness, but I don’t think it matters, we are on ROR Enterprise

Piece of config for LDAP:
Before:

  ldaps:
    - name: xxx
      hosts: 
        - "ldaps://xxx:636"
        - "ldaps://xxx:636"
      ha: "ROUND_ROBIN"
      ssl_trust_all_certs: true
      ignore_ldap_connectivity_problems: false
      bind_dn: "uid=xxx"
      bind_password: "xxxx"
      search_user_base_DN: "xxx"
      user_id_attribute: "uid"
      connection_timeout_in_sec: 20
      request_timeout_in_sec: 15

Adding cache ttl and connection pool limit.
After:

  ldaps:
    - name: xxx
      hosts: 
        - "ldaps://xxx:636"
        - "ldaps://xxx:636"
      ha: "ROUND_ROBIN"
      ssl_trust_all_certs: true
      ignore_ldap_connectivity_problems: false
      bind_dn: "uid=xxx"
      bind_password: "xxxx"
      search_user_base_DN: "xxx"
      user_id_attribute: "uid"
      connection_timeout_in_sec: 20
      request_timeout_in_sec: 15
      cache_ttl_in_sec: 60
      connection_pool_size: 1

Ideally you can set something up in a test environment on your side and check with the default connection pool and no cache.
If you can recreate the issue it would be great.

The above logs were made with Elasticsearch log level on debug.
I needed to screenshot those to take sections out simply due to the enourmous volumes of login (1 request triggers really thousands of logs regarding logins and ldaps)

I hope the above stories makes some sense and gives some guidance on what is happening.

Let me know if you need any specific details.

Could you please describe what exactly happens in this step?

Stuff breaks :slight_smile:

I mean what you see.


The many bind requests with old password is expected behavior. If you are logged in as a user1:test1 and someones (e.g. LDAP’s admin) changes your password to test2, ROR doesn’t know that the password is changed and still uses the old password (this is obvious, obviously ;)).

The interesting part is what happens behind the scenes when the first request from Kibana to ES comes after the password is changed on the LDAP side:

ROR goes through the ACL and tries to match some block. The previously matched block, this time won't be matched because the LDAP auth request won't be authorized. How many requests to LDAP ROR is going to do? It depends on your ACL. E.g. you can observe one request to LDAP per block (assuming that each block has the ldap rule)

That’s why we advise to configure caching at the LDAP connector level (see the caching details docs).

E.g. when you configure “cache_ttl_in_sec: 60” (at the ldap connector level) the same scenario should look a little bit different:

ROR goes through the ACL and tries to match some block. The previously matched block, this time won't be matched because the LDAP auth request won't be authorized. How many requests to LDAP ROR is going to do? One! Because there is a cache. ROR expects that LDAP returns error code 49 (invalid credentials). This response is cached. The cache will be invalidated after 60s and you should observe another LDAP request (if Kibana is still calling ES with the same credentials).

So, let’s have this cache configured.
IMO connection_pool_size can be increased (as you noticed 1 is not a good value, and I don’t think we need it anyway).

Ok, what should happen from the Kibana side? ROR ES will send Kibana 403 response and the user should experience logout. The user should be able to enter new credentials and should be able to log in.

This is what is expected. I understand it doesn’t work like this (that’s why I asked at the beginning about details of what users see). I would like you to confirm that you experience the same problem when you have the cache configured.

If there is some bug somewhere, I guess that it’s in ROR Kibana and the logout handling (and session cleaning). I will try to reproduce it on my side, but it’d help me a lot if you could confirm that the missing cache configuration is not the only problem and the issue is still there.

Thanks in advance

Apologies for the late delay, I didn’t get a mail that there was a reply, so I only just checked and saw your reply.

Again a long story, so I will put a summary:
I am ok with current behaviour.
If you have many rules using LDAP connector, use cache.
If you have cache, you get a error after changing your LDAP password. (Exact steps are in longer story)
You can logout and login again and everything works.
I made recommendations to either change default setting of cache ttl. Or update documentation to more clearly explain the behaviour on the main page.

Longer story:

Stuff breaks:
What we saw without the cache and with default connection limit is that ROR tries to go to LDAP a whole lot of times, thereby locking the account.
User would get a white screen.
The user would then clear the session and cookies (I don’t think we tried logging out).
If the user would try to login again, he would fail, because his account is locked.

I just tried what you suggested, so now I have:
default connection pool limit.
cache ttl 60 seconds.

This (as you expected) indeed works.
In the same scenario now the flow works and there is indeed only one LDAP bind, that fails.
The user then gets an error page in Kibana.
User should then logout and login with his new password.

With regards to the auto logout, I don’t that is the way it currently works?
Also if we look at the discussion here:

The user gets thrown an error and should logout himself.
The error we got looked like this:

This is:
User A is logged in, in a discover.
User A changes password.
Wait for TTL to expire (60 seconds).
Try to browse to dashboard overview.
Get error.
Now user can logout and then login again with new password.

For me that is ok enough, we can sort this with an instruction.
May I recommend one of two things:
Option A: Change default cache from 0 to 60 seconds.
As the default behaviour seems to be disruptive and unnecessary (from my point of view, there might be valid use cases)
Option B: More clearly define the importance of caching in LDAP section of documentation.
Now it is written at the “addendum” like:

ReadonlyREST doesn’t enable caching by default. We leave that decision up to you. But in most cases, caching should be enabled to reduce external service calls. It’s very important when you look at it from the perspective of how the ROR’s ACL handles the request. And the rule of thumb is that caching at the service client definition level would be sufficient.

But on the main page this behaviour isn’t clear that this behaviour exists, there is this line:

Too many calls made by ROR to our LDAP service can sometimes be problematic (eg. when one LDAP connector is used in many rules)

But that doesn’t explain that by default an LDAP bind is attempted for every rule. We could rewrite that sentence like:

By default ROR doesn’t cache LDAP responses. This could cause problems when an LDAP connector is used in many rules. By default ROR does a LDAP bind for every rule. To prevent excessive calls to external services we recommend to enable cache.

And then in addendum we could add near here:

When someone calls GET /my_index/_search , ReadonlyREST handles it by checking block by block if a block is allowed. Processing stops when the first block matches, or when no block matches. In the worst case, all blocks can be checked. As we can see, each block has the ldap_auth' rule. This means that each block generates a call to LDAP which can be described as Check using LDAP connector “ldap1” if user XYZ exists and can be authenticated`. It’s worth noting that every call looks exactly the same.

A section that explains about locking accounts if a password change is done.
Something like:

A combination of cache disabled and many rules using an LDAP connector can cause problems if a user changes his LDAP password while being logged in to Kibana. If the user tries to continue working in Kibana ROR will attempt a bind for every rule that is using the LDAP connector, this will be done with the old credentials and could cause the LDAP account to be locked (if account locking is enabled in the LDAP server).

Thanks for reading.
Like I said, I am ok with the current behaviour.
If LDAP incorrect credentials can somehow be more eleganty handled in Kibana, as in: User is logged out or gets a clear warning his credentials are invalid, it would be even better. But from the other topic and my understandig of ROR this isn’t always ideal/possible (you don’t want to be logged out if you try to access something you are not authorized for, for example).

If you want to dig further into the error/handling of Kibana and need any logs or input from my side please let me know. But for me it is ok enough as it currently is.

Thank you for many ideas on how to improve our documentation. We really appreciate it. I will add to our backlog a task to take your suggestions into consideration and change the places you pointed!

The default cache setting in the LDAP connector section has been the subject of discussion many times among our team members. We do not agree whether this is a technical setting or rather related to business cases. At least in the current implementation, when the cache is not distributed among ES nodes. But we have a jira related to the cache in our backlog, so for sure we will back to it. At the moment, docs improvements look like sth we can (and probably should) do.

If LDAP incorrect credentials can somehow be more eleganty handled in Kibana, as in: User is logged out or gets a clear warning his credentials are invalid, it would be even better. But from the other topic and my understandig of ROR this isn’t always ideal/possible

Yes, exactly. Some ES calls that return 403 will end up with logout, but AFAIS, most of them will be handled as you saw. ATM, we are not able to distinguish between cases when the external service credentials change and when the user simply doesn’t have permission to use the service. I’m not sure if we can do much about it.

Thanks again for your contribution on this topic!