Cluster start up hangs

I am getting

{"@version":1,"source_host":"s-ror-es-1","message":"[CLUSTERWIDE SETTINGS] Cluster not ready...","thread_name":"pool-2-thread-1","@timestamp":"2018-12-18T20:53:09.762+00:00","level":"INFO","logger_name":"tech.beshu.ror.commons.settings.SettingsPoller"}

This is after I had a working cluster with ror running then

  • modified the the acls’ in kibana and got locked out
  • stopped the cluster
  • removed ror plugin
  • started the cluster (cluster became green)
  • removed the readonlyrest indices
  • installed ror plugin
  • reverted readonlyrest.yml to before kibana mods
  • started cluster

Kibana actually takes my creds, allow appears in log, but then of course kibana hangs waiting for the es cluster.

The above INFO message repeats…

If I again remove ror-plugin the cluster comes up clean

Reinstall plugin, restart, same looping cluster not ready

readonlyrest.yml

readonlyrest:
  access_control_rules:
  -   auth_key: elastic:elastic
      name: "::CONSUL-SRV::"
  -   auth_key: kibana:kibana
      name: "::KIBANA-SRV::"
  -   auth_key: admin:admin
      kibana_access: admin
      name: "::ADMIN::"
  audit_collector: true
  audit_serializer: tech.beshu.ror.requestcontext.DefaultAuditLogSerializer
  prompt_for_basic_auth: false

Just got some errors in the log

{"@version":1,"source_host":"s-ror-es-1","message":"Some failures flushing the BulkProcessor: ","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.059+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][3]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][506164944-1130557732#10], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":10,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":7,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"506164944-1130557732#10\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.062+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"2x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][1]] containing [2] requests]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][0]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][632345688-1336153666#8], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":8,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":45,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"632345688-1336153666#8\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}```

FYI: When I deleted the .readonlyrest index to get back in, I also deleted all the audit indices.

So it looks like the problem is ror is creating the audit index with the default template of 5 primaries and 5 replicas on a 1 node cluster. This didn’t happen when ror was first installed.

On deleting the audit index, it gets recreated in a bit.

Sure, I can restart from a fresh cluster. But I need to be able to recover from a bad set of acl’s in case this happens in a production cluster.

1 Like

One other tidbit, while the cluster was in this state all http requests to the cluster were very very slow. So ror must have been doing heavy spinning.

When thru the reset process again, but this time I disabled the audit and the cluster is green.

I’ll hold off on the audit until I get some direction…

Interesting finding. I after getting back to my hello world rules, I applied the same rules via readonlyrest.yml (not kibana) that I did via kibana that got me locked out, and they worked as expected. This makes me wonder whether I pasted some non-ascii from Idea. It didn’t look like it when I queried the contents of .readonlyrest. But I wasn’t looking for that.


Another note: I suppose after I learn the rules better and don’t make stupid mistakes, changing the values via the app will be faster (for ad hoc testing that is). But for now its way way faster to just run the ansible playbook to update the rules and bounce the cluster. At least then I can revert back very quickly without having the extra restart/delete index step step.

Change the default #replicas in ES, or use templates. Those numbers come by default with ES.

Yes I will do that. What confuses me is why it worked initially but then later failed after I deleted all the *readonlyrest* indices. I take it that you don’t make any attempt to define a template for the ror indices? I did not see one…

Yeah this is going to come with the advanced audit logging package we plan to create in Q2 '19