Cluster start up hangs

memelet · December 18, 2018, 8:56pm

I am getting

{"@version":1,"source_host":"s-ror-es-1","message":"[CLUSTERWIDE SETTINGS] Cluster not ready...","thread_name":"pool-2-thread-1","@timestamp":"2018-12-18T20:53:09.762+00:00","level":"INFO","logger_name":"tech.beshu.ror.commons.settings.SettingsPoller"}

This is after I had a working cluster with ror running then

modified the the acls’ in kibana and got locked out
stopped the cluster
removed ror plugin
started the cluster (cluster became green)
removed the readonlyrest indices
installed ror plugin
reverted readonlyrest.yml to before kibana mods
started cluster

Kibana actually takes my creds, allow appears in log, but then of course kibana hangs waiting for the es cluster.

The above INFO message repeats…

memelet · December 18, 2018, 9:00pm

If I again remove ror-plugin the cluster comes up clean

memelet · December 18, 2018, 9:05pm

Reinstall plugin, restart, same looping cluster not ready

readonlyrest.yml

readonlyrest:
  access_control_rules:
  -   auth_key: elastic:elastic
      name: "::CONSUL-SRV::"
  -   auth_key: kibana:kibana
      name: "::KIBANA-SRV::"
  -   auth_key: admin:admin
      kibana_access: admin
      name: "::ADMIN::"
  audit_collector: true
  audit_serializer: tech.beshu.ror.requestcontext.DefaultAuditLogSerializer
  prompt_for_basic_auth: false

memelet · December 18, 2018, 9:06pm

Just got some errors in the log

{"@version":1,"source_host":"s-ror-es-1","message":"Some failures flushing the BulkProcessor: ","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.059+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][3]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][506164944-1130557732#10], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":10,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":7,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"506164944-1130557732#10\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.062+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"2x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][1]] containing [2] requests]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][0]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][632345688-1336153666#8], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":8,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":45,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"632345688-1336153666#8\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}```

memelet · December 18, 2018, 9:07pm

FYI: When I deleted the .readonlyrest index to get back in, I also deleted all the audit indices.

memelet · December 18, 2018, 11:28pm

So it looks like the problem is ror is creating the audit index with the default template of 5 primaries and 5 replicas on a 1 node cluster. This didn’t happen when ror was first installed.

On deleting the audit index, it gets recreated in a bit.

Sure, I can restart from a fresh cluster. But I need to be able to recover from a bad set of acl’s in case this happens in a production cluster.

memelet · December 18, 2018, 11:32pm

One other tidbit, while the cluster was in this state all http requests to the cluster were very very slow. So ror must have been doing heavy spinning.

memelet · December 18, 2018, 11:52pm

When thru the reset process again, but this time I disabled the audit and the cluster is green.

I’ll hold off on the audit until I get some direction…

memelet · December 19, 2018, 12:11am

Interesting finding. I after getting back to my hello world rules, I applied the same rules via readonlyrest.yml (not kibana) that I did via kibana that got me locked out, and they worked as expected. This makes me wonder whether I pasted some non-ascii from Idea. It didn’t look like it when I queried the contents of .readonlyrest. But I wasn’t looking for that.

–
Another note: I suppose after I learn the rules better and don’t make stupid mistakes, changing the values via the app will be faster (for ad hoc testing that is). But for now its way way faster to just run the ansible playbook to update the rules and bounce the cluster. At least then I can revert back very quickly without having the extra restart/delete index step step.

sscarduzio · December 19, 2018, 2:04am

Change the default #replicas in ES, or use templates. Those numbers come by default with ES.

memelet · December 19, 2018, 2:11am

Yes I will do that. What confuses me is why it worked initially but then later failed after I deleted all the *readonlyrest* indices. I take it that you don’t make any attempt to define a template for the ror indices? I did not see one…

sscarduzio · December 19, 2018, 2:16am

Yeah this is going to come with the advanced audit logging package we plan to create in Q2 '19