Losing connections to LDAP servers

Chris, can you look in the es logs, and try to find a log line with “FORBIDDEN” written in it corresponding to when you hsd those 403 errors in logstash?

@cmh could you please test this one:

https://readonlyrest-data.s3-eu-west-1.amazonaws.com/build/1.19.5-pre8/readonlyrest-1.19.5-pre8_es7.5.0.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA5SJIWBO54AGBERLX/20200422/eu-west-1/s3/aws4_request&X-Amz-Date=20200422T152808Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=8e7f79fb6805e418866d6b164e6e0e8a7847f3bb7d055795812cdeaf2b524a47

I’ve done one more improvement related to thread pool for LDAP and maybe this is the issue. If not, we need to more info and find any grip to start with.

@coutoPL looks like an acl issue if it’s responding 403 though. No?

between 1.18.9 and current pre version was a lot of changes, so right - this issue doesn’t have to be strictly correlated with LDAP issue from the beginning of this thread

That was the thing that made me take longer to find the issue - there was nothing in the logs to make me think it was ROR causing the 403, and since I had verified via curl with no issues, I thought that ruled it out. It wasn’t until I was looking at how the logs precipitously stopped right around when I had upgraded to the pre6.

I’ll try the pre8 on the one cluster that mostly nobody (but one particularly vociferous dev) looks at and will report back.

Starting from last of the newest (pre) versions of ROR our core is much more secure. Before there was rather a strategy to pass request even if found indices cannot be altered. Now at the end of handling, during indices altering, ROR can forbid a request (from a security reasons).

This should not happen on a daily basics, but rather indicates some bug in ROR. So, maybe at the moment we’re talking about sth like this. So, it’d be nice if you could provide your config (or at least block which should match) and a curl of the request. I will try to reproduce it in test.

Obviously there should be some error logs which tells us that sth goes wrong. If there is not, I’ll add one after I find what is wrong in this case.

I’ve installed pre8 on the cluster that I discovered the issue on and have been reloading diligently in the past 30 minutes, and so far it still seems to be logging, so it looks like whatever broke in pre6 might be fixed in pre8 - do you still want me to post configs?

I’ve already let folks know we’re running pre8 on that cluster - the others have been reverted - and won’t change any of the others until this has behaved until at least tomorrow morning.

No luck. Carefully monitored after making the change and was hopeful. Notified folks of the upgrade and signed off. Checked this morning and discovered it stopped logging at midnight GMT which is when new indices would be created, so at least now we know what is causing the failure - new index creation.

Here’s our current ROR config on this cluster:

---
# yamllint disable rule:line-length
# THIS FILE IS PROVISIONED BY PUPPET
# However, once it gets loaded into the .readonlyrest index,
#  you might need to use an admin account to log into Kibana
#  and choose "Load default" from the "ReadonlyREST" tab.
# Alternately, you can use the "update-ror" script in ~cheerschap/bin/
readonlyrest:
  enable: true
  prompt_for_basic_auth: false
  response_if_req_forbidden: Forbidden by ReadonlyREST plugin
  ssl:
    enable: true
    keystore_file: "elasticsearch.jks"
    keystore_pass: "redacted"
    key_pass: "redacted"
  access_control_rules:
    # LOCAL: Kibana admin account
    - name: "local-admin"
      auth_key_unix: "admin:redacted"
      kibana_access: admin
    # LOCAL: Logstash servers inbound access
    - name: "local-logstash"
      auth_key_unix: "logstash:redacted"
      # Local accounts for routine access should have less verbisity
      #  to keep the amount of logfile noise down
      verbosity: error
    # LOCAL: Kibana server
    - name: "local-kibana"
      auth_key_unix: "kibana:redacted"
      verbosity: error
    # LOCAL: Puppet communication
    - name: "local-puppet"
      auth_key_unix: "puppet:redacted"
      verbosity: error
    # LOCAL: Jenkins communication
    - name: "local-jenkins"
      auth_key_unix: "jenkins:redacted"
      verbosity: error
    # LOCAL: Elastalert
    - name: "elastalert"
      auth_key_unix: "elastalert:redacted"
      verbosity: error
    # LOCAL: fluentbit
    - name: "fluentbit"
      auth_key_unix: "fluentbit:redacted"
      verbosity: error
    # LDAP: Linux admins and extra kibana-admin group
    - name: "ldap-admin"
      kibana_access: admin
      ldap_auth:
        name: "ldap1"
        groups: ["prod_admins","kibana-admin"]
      type: allow
    # LDAP for everyone else
    - name: "ldap-all"
      # possibly include: "kibana:dev_tools",
      kibana_hide_apps: ["readonlyrest_kbn", "timelion", "kibana:management", "apm", "infra:home", "infra:logs"]
      ldap_auth:
        name: "ldap1"
        groups: ["kibana-admin", "admins", "prod-admins", "devqa", "development", "ipausers"]
      type: allow
    # Allow localhost
    - name: "localhost"
      hosts: [127.0.0.1]
      verbosity: error
  # Define the LDAP connection
  ldaps:
    - name: ldap1
      hosts: ["ldap1", "ldap2"]
      ha: "FAILOVER"
      port: 636
      bind_dn: "uid=xx,cn=xx,cn=xx,dc=xx,dc=xx,dc=xx"
      bind_password: "redacted"
      ssl_enabled: true
      ssl_trust_all_certs: true
      search_user_base_DN: "cn=users,cn=accounts,dc=m0,dc=sysint,dc=local"
      search_groups_base_DN: "cn=groups,cn=accounts,dc=m0,dc=sysint,dc=local"
      user_id_attribute: "uid"
      unique_member_attribute: "member"
      connection_pool_size: 10
      connection_timeout_in_sec: 30
      request_timeout_in_sec: 30
      cache_ttl_in_sec: 60
      group_search_filter: "(objectclass=top)"
      group_name_attribute: "cn"

The logstash systems use (appropriately enough) the “logstash” user for auth and as you can see I have verbosity set to “error” because otherwise the logs get VERY noisy with the constant auth. Elasticsearch logs are hard enough to parse as it is.

Let me know what else I can provide.

Your settings look ok to me, and I’m surpriesed you don’t have any anomalies in the ES logs (coming from ROR or not). @coutoPL WDYT?

@cmh you know, I don’t fully understand your infrastructure, so I don’t feel that I understand what kind of failure we talking about. ROR doesn’t work at all? Is there any crash maybe? Are you able to access your cluster, through ROR eg using curl?

Quick overview of the infrastructure:

  • Filebeat running on the systems, mostly 1.3.1 on prod systems but migrating to 7.6.2 on newer systems. These systems send logs to logstash.
  • A pair of Logstash servers in each instance running 7.6.2. They hand off to the elasticsearch cluster using credentials and SSL configured in RoR
  • Elasticsearch cluster locked to 7.5.0 at the moment, at least 4 nodes but bigger clusters have more. Indices are grouped by app and date coded, for example syslog-2020.04.23
  • Kibana for the UI running RoR pro kibana plugin, also locked to 7.5.0

RoR 1.18.9 works as expected but we have the LDAP timeout issue where after LDAP auth isn’t used for a period of time, the connection gets dropped and won’t reconnect until the RoR config is changed which gets it to reconnect.

Using these 1.19.5_pre versions, the LDAP auth is certainly fixed, and logging/auth seems to work until we cross the midnight GMT time when new indices would be created at which point logstash starts seeing the 403 errors mentioned above. Reverting to 1.18.9 fixes that issue. What is interesting is that when the bulk updates are getting the 403 error, I’m able to auth and run basic API commands using the logstash local account.

Please let me know if I can provide extra information or clarification.

@cmh great explained. So, I expect (I said that before and know I’m pretty sure this is it) that this issue is not strictly connected to previous LDAP issue and 1.19.5-pre6 version, but one of our changes between 1.18.9 and this one.

I suspect there is sth wrong with _bulk request handling. I write proper integration test for it today and later let you know if I found sth.

Thanks!

Oh, yeah, completely agreed. The 1.19.5_pre releases fix the LDAP issue, which is awesome - but not so great when you get logged in and there are no indices logged! :smiley: Thanks for checking, I look forward to finding out what you discover. LMK if I can do anything on my end.

@cmh here is new build for you to test:

https://readonlyrest-data.s3-eu-west-1.amazonaws.com/build/1.19.5-pre9/readonlyrest-1.19.5-pre9_es7.5.0.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA5SJIWBO54AGBERLX/20200428/eu-west-1/s3/aws4_request&X-Amz-Date=20200428T160613Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=171b8a8098ac1bedd63aa250abc656ca04768e494280c138967468e18a3a8b45

Indeed, there was an issue with _bulk request. Now should be ok.

1 Like

Thank you!

Applied on one cluster yesterday when I saw your message. Last night I logged in at 8:15 our time (just after midnight GMT) and I saw indices had been created for today. Checking again this morning, I see the logs are still coming in. I also logged in with my LDAP account - so at this point I would say that both of the issues I’ve seen are fixed. I will roll pre9 out to one or two other clusters this morning and continue to monitor.

Thanks!

2 Likes

Update - I’ve got 1.19.5_pre9 on several clusters and they’re working as expected. Thanks for the fix! Will hold off on putting it on prod clusters until 1.19.5 is released.

2 Likes

Glad to hear that, Chris. Yes, next release is going to be quite rich of fixes and enhancements in both ES and Kibana sides.

1 Like

Speaking of the next release, I’ve never received notification of a release, I think I’m supposed to be on a mailing list, but haven’t heard anything. I see there is an announcements category on the forums but the most recent thing I see is " ReadonlyREST 1.13.2 is here!" - so I guess that’s not regularly updated. Would it be possible to create a release announcements thread in there that could be followed so I get updates? Or some other way that I’ll actually receive the release notifications?

1 Like

Yeah you are right. We’ve experimented a bit and settled for the most difficult way to track: updating the download page with changelogs.

I created this poll as a result of this question. Please vote!

1 Like

Hi everyone, we have a two nodes cluster in our test environment. ES 7.17, using kibana plugin. I just received an email with similar massage to that of Chris H. But our cluster is up and running just fine. Not sure why the error. any idea on what what caused it and or what to do to avoid any potential issues. thanks.
This is the related log from ES logs: [2022-05-31T02:43:19,005][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthenticationService] [ol-tees-01] LDAP getting user operation failed.
[2022-05-31T02:43:19,001][ERROR][t.b.r.a.b.Block ] [ol-tees-01] KibanaAdmin: ldap_authentication rule matching got an error Task monix.execution.internal.InterceptRunnable@1d8fc822 rejected from java.util.concurrent.ThreadPoolExecutor@2b8202d0[Running, pool size = 50, active threads = 40, queued tasks = 0, completed tasks = 242893786]
java.util.concurrent.RejectedExecutionException: Task monix.execution.internal.InterceptRunnable@1d8fc822 rejected from java.util.concurrent.ThreadPoolExecutor@2b8202d0[Running, pool size = 50, active threads = 40, queued tasks = 0, completed tasks = 242893786]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365) ~[?:?]

this is the massage: > [2022-05-31T02:43:18,224][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthenticationService] [ol-tees-01] LDAP getting user operation failed.

[2022-05-31T02:43:18,224][ERROR][t.b.r.a.b.d.l.i.UnboundidLdapAuthenticationService] [ol-tees-01] LDAP getting user operation failed.
[2022-05-31T02:43:18,243][ERROR][t.b.r.a.b.Block ] [ol-tees-01] KibanaAdmin: ldap_authentication rule matching got an error Task monix.execution.internal.InterceptRunnable@414921c3 rejected from java.util.concurrent.ThreadPoolExecutor@2b8202d0[Running, pool size = 50, active threads = 33, queued tasks = 0, completed tasks = 242893370]
[2022-05-31T02:43:18,243][ERROR][t.b.r.a.b.Block ] [ol-tees-01] ROUsers: ldap_authentication rule matching got an error Task monix.execution.internal.InterceptRunnable@1fb6b893 rejected from