New elasticsearch.AuthorizationError after upgrading to 1.60

seemath · November 1, 2024, 5:26pm

We are using Elasticsearch & Kibana 8.15.3 and ROR 1.60

We are seeing new, intermittent AuthorizationErrors on the close_indices API call.

We close indices 1 by 1.
Here is the access policy we are currently utilizing. Is this a known issue, or are there any ways to workaround this. We did not see this prior to the upgrade (we were on 1.54.0 and ES 8.12.0).

name: ‘Allow group snapshot restore’
groups: [‘snapshot_restore’]
actions:
- cluster:admin/repository/*
- cluster:admin/snapshot/*
- cluster:admin/slm/*
- cluster:manage_slm/*
- cluster:monitor/*
- indices:monitor/*
- indices:admin/settings/*
- indices:admin/open
- indices:admin/close
- indices:admin/get

coutoPL · November 1, 2024, 5:48pm

What do you mean by „intermittent AuthorizationErrors”? Do you see any exception in ES log?
Did you tested ROR 1.60.0 with the old ES version you used? Maybe you didn’t add proper action related to the close indices API? Did you check that?

seemath · November 1, 2024, 6:42pm

What do you mean by „intermittent AuthorizationErrors”?

If I have 10 indices, some close, then 1 unexpectedly fails. If I try again I can close all of them, so it is not consistent.

Do you see any exception in ES log?

I don’t see anything specific related to this query

Did you tested ROR 1.60.0 with the old ES version you used?

Yes, we saw this issue with 8.12.0 as well + ROR 1.60.

Maybe you didn’t add proper action related to the close indices API? Did you check that?

Yes I did, you see in my original post that I have the close indices action listed in the policy.

coutoPL · November 5, 2024, 8:25pm

I tried to reproduce the problem on my side and I think all works well.
I used ROR Sandbox for testing. Please take a look: [RORDEV-1321] reproduction attempt by coutoPL · Pull Request #56 · beshu-tech/ror-sandbox · GitHub

In the PR there is a description of how to run the stack and how to call the ES API using curl. Do you see any differences (ROR settings or curl or anything else) in relation to your environment and tests?

margaretp · November 7, 2024, 4:39pm

I am seeing similar occasional AuthorizationErrors when calling msearch with the same cluster version and ROR version.

The access policy I’m using looks like this:

  - name: "Allow read access"
    groups: ["my_group"]
    indices: ["index_a*", "indexb_*", "index_c"]
    actions:
    - indices:data/read/*
  - name: "Allow write access"
    groups: ["my_group"]
    indices:  ["index_c"]
    actions:
    - indices:data/write/bulk
    - indices:data/write/index
  - name: "Allow update to certain indices"
    groups: ["my_group"]
    indices: ["indexb_*"]
    actions:
    - indices:data/write/update

I have a workflow that calls msearch in a loop. It runs 5-10 queries as a group against index_a_all. The queries are all identical, basic queries with a must clause that has two terms clauses, different values passed in for the terms for each query. 90% of the time, msearch is allowed and the queries return results, but occasionally we will see errors like this:

elasticsearch.AuthenticationException: AuthenticationException(401, 'forbidden_response', 'forbidden_response: forbidden')

Logging more details in the error response shows:

{'error': {'root_cause': [{'type': 'forbidden_response', 'reason': 'forbidden', 'due_to': 'OPERATION_NOT_ALLOWED', 'header': {'WWW-Authenticate': 'Basic'}}], 'type': 'forbidden_response', 'reason': 'forbidden', 'due_to': 'OPERATION_NOT_ALLOWED', 'header': {'WWW-Authenticate': 'Basic'}}, 'status': 401}

Rerunning the same msearch query typically works, so the error is inconsistent and only happens occasionally.

The code has been updated to work around this issue by calling each query one by one as a single search with the same query body. We have not seen auth errors on the individual searches.

coutoPL · November 7, 2024, 7:10pm

ok, I need two more things to know:

please show me your “users” section of the readonlyrest.yml file
please find FORBIDDEN log in the ES logs and paste it here

Thanks

margaretp · November 8, 2024, 6:02pm

Interestingly, I only see ALLOWED messages in the logs. I see an ALLOWED log for msearch for this user, but then I see the subsequent individual search requests (which we only call if msearch hits an auth error) also being ALLOWED. Here is an example for msearch (anonymized):

{"@timestamp":"2024-11-08T17:48:37.508Z", "log.level": "INFO", "message":"\u001B[36mALLOWED by { name: 'Allow read access', policy: ALLOW, rules: [groups_or,actions,indices] req={ ID:15fbd770-110b-4074-b423-3117b1786468-1134655352#13245089, TYP:MultiSearchRequest, CGR:my_group, USR:my_username, BRS:true, KDX:null, ACT:indices:data/read/msearch, OA:omitted/32, XFF:null, DA:omitted32, IDX:index_a1, MET:POST, PTH:/index_a1/_msearch, CNT:<OMITTED, LENGTH=2200.0 B> , HDR:Accept-Encoding=identity, Authorization=<OMITTED>, Content-Length=2200, Host=omitted:9200, accept=application/vnd.elasticsearch+json; compatible-with=8, connection=keep-alive, content-type=application/vnd.elasticsearch+x-ndjson; compatible-with=8, user-agent=elasticsearch-py/8.12.1 (Python/3.9.20; elastic-transport/8.12.0), x-elastic-client-meta=es=8.12.1,py=3.9.20,t=8.12.0,ur=1.26.18, HIS:[OMITTED],}\u001B[0m", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"scala-execution-context-global-1102","log.logger":"tech.beshu.ror.accesscontrol.logging.AccessControlLoggingDecorator","elasticsearch.cluster.uuid":"wBzfElLHTyK6VfR4SL3wvg","elasticsearch.node.id":"08aLNijJQ72_CcYbR-UFuA","elasticsearch.node.name":"data-18","elasticsearch.cluster.name":"my_cluster"}

Here is what the settings look like:

readonlyrest:
  audit_collector: false
  prompt_for_basic_auth: true
  access_control_rules:
 - name: "Allow read access"
    groups: ["my_group"]
    indices: ["index_a*", "indexb_*", "index_c"]
    actions:
    - indices:data/read/*
  - name: "Allow write access"
    groups: ["my_group"]
    indices:  ["index_c"]
    actions:
    - indices:data/write/bulk
    - indices:data/write/index
  - name: "Allow update to certain indices"
    groups: ["my_group"]
    indices: ["indexb_*"]
    actions:
    - indices:data/write/update 

  users:
  - username: my_username
    auth_key_unix: my_username:$6$rounds=656000$OMITTED
    groups: ["my_group"]

coutoPL · November 8, 2024, 6:06pm

Wow, this is interesing. But it’s also some hint for me. I will check it again and let you know

seemath · November 8, 2024, 6:58pm

We are using ECK version 2.14.0 in case that is a factor (and we saw this in version 2.13.0 as well).

seemath · November 14, 2024, 5:31pm

We are also seeing this intermittently on the update mapping API, and our client team has indicated they are seeing more auth errors with search queries, so I think this upgrade has caused more impact than just the close_indices API.

margaretp · November 15, 2024, 2:34pm

We found this error in our Elasticsearch logs today:

{"@timestamp":"2024-11-15T14:13:40.961Z", "log.level": "INFO", "message":"[my_index][55] global checkpoint sync failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"scala-execution-context-global-1244","log.logger":"org.elasticsearch.indices.cluster.IndicesClusterStateService","elasticsearch.cluster.uuid":"uuid-here","elasticsearch.node.id":"node-id-here","elasticsearch.node.name":"node-name-here","elasticsearch.cluster.name":"my-cluster","error.type":"tech.beshu.ror.es.handler.response.ForbiddenResponse","error.message":"forbidden","error.stack_trace":"tech.beshu.ror.es.handler.response.ForbiddenResponse: forbidden\n\tat tech.beshu.ror.es.handler.response.ForbiddenResponse$.create(ForbiddenResponse.scala:75)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.onForbidden(RegularRequestHandler.scala:111)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.commitResult$$anonfun$1(RegularRequestHandler.scala:71)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.$anonfun$adapted$1(RegularRequestHandler.scala:83)\n\tat scala.util.Try$.apply(Try.scala:210)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.commitResult(RegularRequestHandler.scala:83)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.handle$$anonfun$1$$anonfun$1(RegularRequestHandler.scala:57)\n\tat scala.runtime.function.JProcedure1.apply(JProcedure1.java:15)\n\tat scala.runtime.function.JProcedure1.apply(JProcedure1.java:10)\n\tat tech.beshu.ror.utils.ScalaOps$AutoCloseableOps$.bracket$extension(ScalaOps.scala:199)\n\tat tech.beshu.ror.es.handler.RegularRequestHandler.handle$$anonfun$1(RegularRequestHandler.scala:57)\n\tat scala.runtime.function.JProcedure1.apply(JProcedure1.java:15)\n\tat scala.runtime.function.JProcedure1.apply(JProcedure1.java:10)\n\tat monix.eval.Task$Map.apply(Task.scala:4727)\n\tat monix.eval.Task$Map.apply(Task.scala:4726)\n\tat monix.eval.internal.TaskRunLoop$.startFull(TaskRunLoop.scala:189)\n\tat monix.eval.internal.TaskRunLoop$.restartAsync$$anonfun$1(TaskRunLoop.scala:241)\n\tat java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1726)\n\tat java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute(ForkJoinTask.java:1717)\n\tat java.base/java.util.concurrent.ForkJoinTask$InterruptibleTask.exec(ForkJoinTask.java:1641)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)\n"}

I’m concerned this is putting cluster stability at risk. Should we downgrade to an older version of ROR? Our cluster is on 8.15.3, which I believe is only supported in ROR 1.60, but am not sure how much changed between 8.15.x versions and how much needed to change on ROR 1.60 to support 8.15.3.

coutoPL · November 15, 2024, 3:50pm

Sorry, what ES & ROR versions are you currently on?

seemath · November 15, 2024, 3:57pm

See original request

We are thinking of downgrading to ROR 1.54 but staying on ES 8.15.3, will this be an issue?

coutoPL · November 15, 2024, 4:01pm

Sorry, I didn’t know both of you are from the same company.
It’s not possible to downgrade to 1.54.x and support ES 8.15.
The best I can do is to prepare ROR 1.59.0 build for ES 8.15.3

coutoPL · November 15, 2024, 4:10pm

According to this log I cannot say it’s because of ROR behaviour. The error is watched by ROR but it comes from ES internals (IndicesClusterStateService). It would explain why ROR auth fails from time to time. It’s because of this error. But I’m not so sure it’s a ROR’s fault.

To prove it, I could try to prepare ROR 1.54.0 for ES 8.15.3. But we won’t maintain this branch for sure. I would need 1-2 days for that.

seemath · November 15, 2024, 4:12pm

Yes we are both from the same organization.

Yes, we are willing to try 1.59.0 or a 1.54 ROR version that is compatible with 8.15.3 as a troubleshooting step. Thank you.

coutoPL · November 15, 2024, 4:24pm

Here it is:

ROR 1.59.0 for ES 8.15.3 (plugin)
ROR 1.59.0 for ES 8.15.3 (sha512)

seemath · November 15, 2024, 4:58pm

Thank you, we will try this out and let you know if we see fewer auth errors.

coutoPL · November 15, 2024, 6:21pm

ok, it was easier than I thought. I prepared ROR 1.54.0 with ES 8.15.3 support. Remember it’s for testing purposes only.

ROR 1.54.0 for ES 8.15.3 (experimental version)

When you deploy it and:
a) if you still experience the issue, it means you should check your cluster, because it’s rather not a ROR’s fault
b) if the issue is gone, it’s probably sth in ROR, we have to investigate.

please test it and let us know about the result

seemath · November 18, 2024, 10:42pm

Thank you for supplying these patched versions.

We did some logs analysis and identified the dd_agent (datadog agent) user was the source of many FORBIDDEN requests. When we mitigated those with additional permissions, we saw the intermittent auth errors on msearch disappear. For that reason, we are going to hold off on trying the patched versions.

It is not clear to me why fixing dd_agent auth errors would help mitigate auth errors for other users, perhaps ES is doing some rate limiting or throttling on auth errors. If you have expertise in this we would appreciate insights.