Cluster start up hangs


(Barry Kaplan) #1

I am getting

{"@version":1,"source_host":"s-ror-es-1","message":"[CLUSTERWIDE SETTINGS] Cluster not ready...","thread_name":"pool-2-thread-1","@timestamp":"2018-12-18T20:53:09.762+00:00","level":"INFO","logger_name":"tech.beshu.ror.commons.settings.SettingsPoller"}

This is after I had a working cluster with ror running then

  • modified the the acls’ in kibana and got locked out
  • stopped the cluster
  • removed ror plugin
  • started the cluster (cluster became green)
  • removed the readonlyrest indices
  • installed ror plugin
  • reverted readonlyrest.yml to before kibana mods
  • started cluster

Kibana actually takes my creds, allow appears in log, but then of course kibana hangs waiting for the es cluster.

The above INFO message repeats…


(Barry Kaplan) #2

If I again remove ror-plugin the cluster comes up clean


(Barry Kaplan) #3

Reinstall plugin, restart, same looping cluster not ready

readonlyrest.yml

readonlyrest:
  access_control_rules:
  -   auth_key: elastic:elastic
      name: "::CONSUL-SRV::"
  -   auth_key: kibana:kibana
      name: "::KIBANA-SRV::"
  -   auth_key: admin:admin
      kibana_access: admin
      name: "::ADMIN::"
  audit_collector: true
  audit_serializer: tech.beshu.ror.requestcontext.DefaultAuditLogSerializer
  prompt_for_basic_auth: false

(Barry Kaplan) #4

Just got some errors in the log

{"@version":1,"source_host":"s-ror-es-1","message":"Some failures flushing the BulkProcessor: ","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.059+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][3]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][506164944-1130557732#10], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":10,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":7,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"506164944-1130557732#10\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.062+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"2x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][1]] containing [2] requests]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}
{"@version":1,"source_host":"s-ror-es-1","message":"1x: UnavailableShardsException[[readonlyrest_audit-2018-12-18][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[readonlyrest_audit-2018-12-18][0]] containing [index {[readonlyrest_audit-2018-12-18][ror_audit_evt][632345688-1336153666#8], source[{\"error_message\":null,\"headers\":[\"Authorization\",\"Connection\",\"Content-Length\",\"Host\"],\"acl_history\":\"[::CONSUL-SRV::->[auth_key->false]], [::KIBANA-SRV::->[auth_key->true]]\",\"origin\":\"10.11.136.187\",\"match\":true,\"final_state\":\"ALLOWED\",\"destination\":\"10.11.136.187\",\"task_id\":8,\"type\":\"MainRequest\",\"req_method\":\"HEAD\",\"path\":\"/\",\"indices\":[],\"@timestamp\":\"2018-12-18T21:03:46Z\",\"content_len_kb\":0,\"error_type\":null,\"processingMillis\":45,\"action\":\"cluster:monitor/main\",\"block\":\"{ name: '::KIBANA-SRV::', policy: ALLOW, rules: [auth_key]}\",\"id\":\"632345688-1336153666#8\",\"content_len\":0,\"user\":\"kibana\"}]}]]]","thread_name":"elasticsearch[s-ror-es-1][generic][T#4]","@timestamp":"2018-12-18T21:05:18.063+00:00","level":"ERROR","logger_name":"tech.beshu.ror.es.AuditSinkImpl"}```

(Barry Kaplan) #5

FYI: When I deleted the .readonlyrest index to get back in, I also deleted all the audit indices.


(Barry Kaplan) #6

So it looks like the problem is ror is creating the audit index with the default template of 5 primaries and 5 replicas on a 1 node cluster. This didn’t happen when ror was first installed.

On deleting the audit index, it gets recreated in a bit.

Sure, I can restart from a fresh cluster. But I need to be able to recover from a bad set of acl’s in case this happens in a production cluster.


(Barry Kaplan) #7

One other tidbit, while the cluster was in this state all http requests to the cluster were very very slow. So ror must have been doing heavy spinning.


(Barry Kaplan) #8

When thru the reset process again, but this time I disabled the audit and the cluster is green.

I’ll hold off on the audit until I get some direction…


(Barry Kaplan) #9

Interesting finding. I after getting back to my hello world rules, I applied the same rules via readonlyrest.yml (not kibana) that I did via kibana that got me locked out, and they worked as expected. This makes me wonder whether I pasted some non-ascii from Idea. It didn’t look like it when I queried the contents of .readonlyrest. But I wasn’t looking for that.


Another note: I suppose after I learn the rules better and don’t make stupid mistakes, changing the values via the app will be faster (for ad hoc testing that is). But for now its way way faster to just run the ansible playbook to update the rules and bounce the cluster. At least then I can revert back very quickly without having the extra restart/delete index step step.


(Simone Scarduzio) #10

Change the default #replicas in ES, or use templates. Those numbers come by default with ES.


(Barry Kaplan) #11

Yes I will do that. What confuses me is why it worked initially but then later failed after I deleted all the *readonlyrest* indices. I take it that you don’t make any attempt to define a template for the ror indices? I did not see one…


(Simone Scarduzio) #12

Yeah this is going to come with the advanced audit logging package we plan to create in Q2 '19