Frequent notification from our Elasticsearch cluster

Lately we have been having this email notification from our Elasticsearch cluster:

[ph-prodr-02] [msgstoreadt-202204][0] unexpected failure while failing shard [shard id [[msgstoreadt-202204][0]],
allocation id [9Cj26sLdRp2sRuc0YW1Hhw],
primary term [1],
message [failed to perform indices:data/write/bulk[s] on replica [msgstoreadt-202204][0],
node[E3qewJBuSXKCfqIfrra-UA],
[R], s[STARTED],
a[id=9Cj26sLdRpeaRuc0YW1Hhw]],
failure [RemoteTransportException[[ph-prodr-01][10.10.10.90:9300][indices:data/write/bulk[s][r]]];
nested: IllegalStateException[[msgstoreadt-202204][0] operation primary term [1] is too old

Please any idea as to why and possible measures to take to resolve this. Thanks for your help.

Wait, how is this different from Unexpected error while indexing monitoring document - #5 by AlbertAtRor?

Hi Simone, sorry looks like I posted the wrong error. I just updated that. Thanks for reviewing.

I keep getting all these different email notifications periodically though, but the cluster stays up and running with all the indices just ok. not sure what is going on

Have you seen this?

I just found a similar log like the one on the link you just shared. Not sure. but our current installed version for elasticsearch is 7.11.2. about to upgrade to 7.17.0.
Here is the log: {“type”: “server”, “timestamp”: “2022-04-11T04:36:46,502-07:00”,
“level”: “WARN”,
“component”: “o.e.m.j.JvmGcMonitorService”,
“cluster.name”: “ASSide-Datastore”,
“node.name”: “dr-prodr-01”,
“message”: “[gc][1448110] overhead,
spent [511ms] collecting in the last [1s]”,
“cluster.uuid”: “Lt5E-i_eSRCyWahWdAV6RA”,
“node.id”: “OkALb75JT8K3CDQQq4zGLw” }

You should investigate on the garbage collection side: it’s possible that some node pauses very long times trying to free up memory. As a result it gets shunned by the cluster. Maybe your ES nodes memory is underprovisioned?