Writing a log to a remote cluster

driveirk · November 10, 2025, 1:16pm

**Hi, all

ROR Version**: Enterprise 1.66.1_es8.18.3

Kibana Version: 8.18.3

Elasticsearch Version: 8.18.3

Steps to reproduce the issue
cluster:
3 nodes with roles dmr(data, master, remote_client)
All 3 nodes are running Kibana

ror conf:

readonlyrest:
    prompt_for_basic_auth: false
    response_if_req_forbidden: Wrong password or try clearing your browser cache
    audit:
      enabled: true
      outputs: 
      - type: index
        cluster: ["http://1.1.1.1:9100" , "http://2.2.2.2:9100", "http://3.3.3.3:9100" ]
        index_template: "'xcs-readonlyrest'-yyyy-MM-dd"
        serializer: tech.beshu.ror.requestcontext.QueryAuditLogSerializer
      - type: index # local cluster index
        index_template: "'.readonlyrest-audit'-yyyy-MM-dd"
        serializer: tech.beshu.ror.requestcontext.QueryAuditLogSerializer

Expected result:

Kibana works stably, the heap does not overflow.

Actual Result:
Now I see error:

[2025-11-10T13:06:13,037][INFO ][t.b.r.a.l.AccessControlListLoggingDecorator] [hostname]ALLOWED by {I removed the list of rules}
[2025-11-10T13:06:13,042][ERROR][t.b.r.e.s.RestClientAuditSinkService] [hostname]Cannot submit audit event [index: xcs-readonlyrest-2025-11-10, doc: dec3c93a-eadd-44bb-96f0-1677553b643a-697320470#104187123]
org.elasticsearch.client.ResponseException: method [PUT], host [http://1.1.1.1:9100], URI [/xcs-readonlyrest-2025-11-10/_doc/dec3c93a-eadd-44bb-96f0-1677553b643a-697320470?op_type=create#104187123], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[dec3c93a-eadd-44bb-96f0-1677553b643a-697320470]: version conflict, document already exists (current version [1])","index_uuid":"KjjUXeQVQfK3tbP8Y2Uqhw","shard":"0","index":"xcs-readonlyrest-2025-11-10"}],"type":"version_conflict_engine_exception","reason":"[dec3c93a-eadd-44bb-96f0-1677553b643a-697320470]: version conflict, document already exists (current version [1])","index_uuid":"KjjUXeQVQfK3tbP8Y2Uqhw","shard":"0","index":"xcs-readonlyrest-2025-11-10"},"status":409}
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351) ~[?:?]
	at org.elasticsearch.client.RestClient.access$1900(RestClient.java:109) ~[?:?]
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:401) ~[?:?]
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:397) ~[?:?]
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) ~[?:?]
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:182) ~[?:?]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) ~[?:?]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) ~[?:?]
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) ~[?:?]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:87) ~[?:?]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:40) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[?:?]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[?:?]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2025-11-10T13:06:13,042][ERROR][t.b.r.e.s.RestClientAuditSinkService] [hostname]Cannot submit audit event [index: xcs-readonlyrest-2025-11-10, doc: dec3c93a-eadd-44bb-96f0-1677553b643a-697320470#104187123]
org.elasticsearch.client.ResponseException: method [PUT], host [http://1.1.1.1:9100], URI [/xcs-readonlyrest-2025-11-10/_doc/dec3c93a-eadd-44bb-96f0-1677553b643a-697320470?op_type=create#104187123], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[dec3c93a-eadd-44bb-96f0-1677553b643a-697320470]: version conflict, document already exists (current version [1])","index_uuid":"KjjUXeQVQfK3tbP8Y2Uqhw","shard":"0","index":"xcs-readonlyrest-2025-11-10"}],"type":"version_conflict_engine_exception","reason":"[dec3c93a-eadd-44bb-96f0-1677553b643a-697320470]: version conflict, document already exists (current version [1])","index_uuid":"KjjUXeQVQfK3tbP8Y2Uqhw","shard":"0","index":"xcs-readonlyrest-2025-11-10"},"status":409}
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351) ~[?:?]
	at org.elasticsearch.client.RestClient.access$1900(RestClient.java:109) ~[?:?]
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:401) ~[?:?]
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:397) ~[?:?]
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) ~[?:?]
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:182) ~[?:?]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) ~[?:?]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) ~[?:?]
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) ~[?:?]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:87) ~[?:?]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:40) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[?:?]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[?:?]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[?:?]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

As a result, the heap becomes full and ElasticSearch needs to be restarted.
This behavior was not observed in version 7.

{“customer_id”: “6c4a385b-2ae8-4f02-a9cd-ef24addfb5b3”, “subscription_id”: “32d4073f-dc2f-4056-a868-842727c637cd”}

Mateusz · November 10, 2025, 5:36pm

Hi @driveirk ,
There should be no version conflict. Audit document IDs are generated based on UUIDs, so conflicts with previously saved documents are highly unlikely. It seems the conflict occurs when the same document is being saved to multiple nodes. Please verify that the URLs specified in the configuration (cluster field) point to distinct data nodes within the cluster.
We will attempt to reproduce the issue and address the memory usage caused by the conflict during document saving – this should not cause a heap overflow.

driveirk · November 11, 2025, 11:45am

If I use

cluster: ["http://1.1.1.1:9100"]

There is no problem

If I use

cluster: ["http://1.1.1.1:9100", "http://2.2.2.2:9100"]

The problem has been reproduced.

remote cluster

host/_cat/nodes GET
1.1.1.1  5 88 0 0.01 0.03 0.13 hisw - host1
2.2.2.2 38 99 1 0.00 0.01 0.00 hisw - host2
3.3.3.3 61 98 0 0.01 0.09 0.20 hisw - host3

I think the node is trying to send the document to all listed nodes.

driveirk · November 11, 2025, 12:32pm

Is it possible to make an error output into the index?

It’s inconvenient to navigate through nodes and search files to see where the request was made and what happened.

Perhaps we should implement some kind of request identifier?

So we can understand that a specific request was made and received an error.

Or forward the session from Kibana. So we can filter by session and understand whether the request was made from Kibana or via the API.

To trace user actions.

Mateusz · November 11, 2025, 2:22pm

I think the node is trying to send the document to all listed nodes.

And that’s the cause of the error mentioned above. When we added the cluster option for audit, we actually intended it for sending audit events to different remote clusters. Each address should therefore point to a separate remote cluster.
However, your use case showed us that this naming might be misleading, and we can definitely improve how failovers are handled when saving audit events.

For your case, please specify only one node address.
In the meantime, we’ll improve the documentation and consider adding a proper failover strategy for scenarios where multiple nodes from the same Elasticsearch cluster are provided.

coutoPL · November 12, 2025, 8:08am

We have sth like that. You can add “x-ror-correlation-id” header to your request, and the value will be a part of the “id” field in the ROR audit documents.

Is it sth that solves your problem?
If not, why?

driveirk · November 12, 2025, 8:19am

How will high availability be maintained in this case?
What happens if one node fails? Will logs be lost?
How do we maintain high availability?

P.S.
If I want to send logs to different clusters, I’ll just make different outputs.

driveirk · November 12, 2025, 8:25am

Until this resolves the issue of displaying errors in the index and linking errors to the query log.
If you look at the error in the first message, it’s impossible to determine which message caused it.

Where should the header be added? To the load balancer? To nginx? To Kibana?
How will tracking be performed if a user switches to another Kibana in the cluster?

Our client cluster:
haproxy(balancing between 3 kibana) => nginx(local) => kibana(Cluster of 3 nodes with 3 kiban.)

coutoPL · November 12, 2025, 8:30am

Until this resolves the issue of displaying errors in the index and linking errors to the query log.
If you look at the error in the first message, it’s impossible to determine which message caused it.

Ok, got it now. We will add the correlation ID value to this log. Currently, it’s missing.

Where should the header be added? To the load balancer? To nginx? To Kibana?

ROR KBN automatically adds it to all requests from it to ES.
If you use ES API directly, you can add it for your own (for your internal tracing), or the correlation ID will be generated when the header is missing.

driveirk · November 14, 2025, 8:12am

@Mateusz Do you have any updates?

coutoPL · November 14, 2025, 9:54am

Not yet. But we started to work on the improvement.

Is it a blocker for you?
Can you not workaround it for now like that:
cluster: ["``http://1.1.1.1:9100``" ]

leaving only one cluster node?
Another workaround would be a proxy with some round robin on these nodes.

driveirk · November 14, 2025, 1:17pm

Now I’d like to know your plans for this. Will it be possible to specify multiple cluster nodes or not?

In the current implementation, this feature is completely useless in terms of sending stability.
It’s easier to install filebeat next to it, which will send logs to the cluster.

I need this information to understand if I need to further update my infrastructure or if I need to add Filebeat to my deployment.

coutoPL · November 14, 2025, 2:30pm

The plan is as follows:

We treat the current behaviour as a bug. When you declare: cluster: ["``http://1.1.1.1:9100``" , "``http://2.2.2.2:9100``", "``http://3.3.3.3:9100``" ] It should pick the node in a round-robin fashion and use it to send the audit entry. Currently, it’s sending to all nodes (as you proved)
We want to investigate the leak and fix it.
We want to improve this functionality and add modes for picking the node from the list: round robin (default) and failover (pick first and fallback to next when it fails).
We want to improve documentation because we found it misleading in this part.

The timeline for those is as follows:
1 & 2: Should be ready to test at the beginning of next week. We will send you a pre-build.
3 & 4: Will be done within changes in the current sprint. Will be released with ROR 1.68.0.

Is it ok for you?

driveirk · November 16, 2025, 9:44pm

This is great news, thank you very much.

coutoPL · November 18, 2025, 9:19am

Hi @driveirk

As promised - the pre-build with 1. & 2. fixes: ROR 1.68.0-pre9 for ES 8.18.3
Please, test it and let us know about the results.

driveirk · November 19, 2025, 2:45pm

@coutoPL Installed on a test cluster. Data collection will take 1-2 days.

driveirk · November 21, 2025, 8:31am

Elasticsearch hasn’t crashed in two days.
On the previous version, Elasticsearch crashed every day.
I consider this fix valid.
Thank you for the work done.

coutoPL · November 21, 2025, 8:58am

Great to hear that.
Thanks for the report and tests

coutoPL · January 7, 2026, 9:11am

ROR 1.68.0 with the fix is released