Frequent notification from our Elasticsearch cluster

AlbertAtRor · April 11, 2022, 3:04pm

Lately we have been having this email notification from our Elasticsearch cluster:

[ph-prodr-02] [msgstoreadt-202204][0] unexpected failure while failing shard [shard id [[msgstoreadt-202204][0]],
allocation id [9Cj26sLdRp2sRuc0YW1Hhw],
primary term [1],
message [failed to perform indices:data/write/bulk[s] on replica [msgstoreadt-202204][0],
node[E3qewJBuSXKCfqIfrra-UA],
[R], s[STARTED],
a[id=9Cj26sLdRpeaRuc0YW1Hhw]],
failure [RemoteTransportException[[ph-prodr-01][10.10.10.90:9300][indices:data/write/bulk[s][r]]];
nested: IllegalStateException[[msgstoreadt-202204][0] operation primary term [1] is too old

Please any idea as to why and possible measures to take to resolve this. Thanks for your help.

sscarduzio · April 11, 2022, 9:08pm

Wait, how is this different from Unexpected error while indexing monitoring document - #5 by AlbertAtRor?

AlbertAtRor · April 11, 2022, 9:39pm

Hi Simone, sorry looks like I posted the wrong error. I just updated that. Thanks for reviewing.

I keep getting all these different email notifications periodically though, but the cluster stays up and running with all the indices just ok. not sure what is going on

sscarduzio · April 11, 2022, 9:47pm

Have you seen this?

AlbertAtRor · April 11, 2022, 10:45pm

I just found a similar log like the one on the link you just shared. Not sure. but our current installed version for elasticsearch is 7.11.2. about to upgrade to 7.17.0.
Here is the log: {“type”: “server”, “timestamp”: “2022-04-11T04:36:46,502-07:00”,
“level”: “WARN”,
“component”: “o.e.m.j.JvmGcMonitorService”,
“cluster.name”: “ASSide-Datastore”,
“node.name”: “dr-prodr-01”,
“message”: “[gc][1448110] overhead,
spent [511ms] collecting in the last [1s]”,
“cluster.uuid”: “Lt5E-i_eSRCyWahWdAV6RA”,
“node.id”: “OkALb75JT8K3CDQQq4zGLw” }

sscarduzio · April 15, 2022, 12:52pm

You should investigate on the garbage collection side: it’s possible that some node pauses very long times trying to free up memory. As a result it gets shunned by the cluster. Maybe your ES nodes memory is underprovisioned?