we’ve stumbled several times now into an issue when we add (or remove) new entry points to a cluster. We are running ES behind http proxies which also run ES with ROR enabled. These are behind aliases, and we pass on the alias name to ROR in order to separate different users (which come via different aliases) from each other. The aliases are DNS load balanced, and also check for the sanity of the hosts behind: for example, if a node does not run ES it is removed from the alias.
Naturaly, when we add new aliases/entry points for new users, they do not yet resolve. Since we started to use ROR 1.16.1 this crashes ES (meaning that the node will never be added to the alias), with:
[2017-07-06T14:58:42,603][ERROR][o.e.b.Bootstrap ] Guice Exception: org.elasticsearch.plugin.readonlyrest.settings.SettingsMalformedException: Cannot create IP address from string: new-alias
…
[2017-07-06T14:58:42,986][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [hostname-ites_alias] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: org.elasticsearch.common.inject.CreationException: Guice creation errors:
Error injecting constructor, org.elasticsearch.plugin.readonlyrest.settings.SettingsMalformedException: Cannot create IP address from string: new-alias
at org.elasticsearch.plugin.readonlyrest.es.ReloadableSettingsImpl.(Unknown Source)
while locating org.elasticsearch.plugin.readonlyrest.es.ReloadableSettingsImpl
for parameter 1 at org.elasticsearch.plugin.readonlyrest.es.IndexLevelActionFilter.(Unknown Source)
while locating org.elasticsearch.plugin.readonlyrest.es.IndexLevelActionFilter
while locating org.elasticsearch.action.support.ActionFilter annotated with @org.elasticsearch.common.inject.multibindings.Element(setName=,uniqueId=1)
at unknown
while locating java.util.Set<org.elasticsearch.action.support.ActionFilter>
for parameter 0 at org.elasticsearch.action.support.ActionFilters.(Unknown Source)
while locating org.elasticsearch.action.support.ActionFilters
for parameter 4 at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.(Unknown Source)
while locating org.elasticsearch.gateway.TransportNodesListGatewayStartedShards
for parameter 1 at org.elasticsearch.gateway.GatewayAllocator.(Unknown Source)
while locating org.elasticsearch.gateway.GatewayAllocator
which brings down the entire cluster (including all valid endpoints).
How about throughing an error and ignoring the corresponding rule rather than crashing ES ?
Generally, the intention is that the rules are checked for sanity at best of capabilities as soon as possible. Fail fast, less surprises at runtime. And that describes also the current behaviour.
I guess your case makes an exception to the rule. What we could do to support it, is that we can delegate the DNS resolution at rule evaluation time, handle the name resolution exception returning a NO_MATCH, and try again next time a request comes.
Doing so, we could say we’d have supported “temporary DNS lookup failures”.
On a side note, keep in mind the JVM has a permanent address resolution (once a name resolves to anything, it’s going to be in a JVM-level dns lookup table for ever) for security reasons (i.e. DNS poisoning). This is relevant because the fresh alias has to resolve to nothing in order for the resolution to be “failed” and postponed to the next request.
The settings is here (and can be changed of course with a #of seconds)
thanks a lot for the answer! We’re indeed using the default here (which I guess means that it is cached indefinitely). I’m afraid though that this will not help us much. We automatically restart Elasticsearch on the search nodes if the configuration has changed - which is the case when we add or remove entry points, at which point the cache will be gone.
What would help though would be if we had an option to just not apply rules if hosts is given and does not resolve (or resolves to something empty, which is typically our case when we add entry points) rather than crashing ES. Instead, issuing a bad error message of cause would be just fine, telling why a rule has not been applied.
I think we are saying the same thing: This is what I proposed to do:
Algorithm
Boot time:
setup the rule, don’t try to resolve names
Rule check time (when a request arrives):
if the name was resolved before, check the request using that.
if we never resolved it before, try to resolve it now. IF name resolves OK, check request; else return NO_MATCH.
Expected new behaviour
ROR won’t crash
No checks are done at the boot-time settings sanity check => no crash.
The entire rules block gets skipped
When a rule in a block returns NO_MATCH, in an ACL like ROR has, it’s effectively like not having the whole block at all. I assume this is the wanted behaviour.
Whenever resolution is success, it’s resolved forever
Or… did you mean you would like the rule to be skipped instead of the block ignored? Weird, because that would mean the block will allow requests from anywhere to match until we have a name resolution.
Discard the unresolvable hosts and operate normally with the remaining configured hosts, if at least one is resolvable (i.e. accept requests from localhost).
Or should we skip the block if just a part of the hosts is unresolvable?
My nose says the first, but wanted to check with you.
Very good question. My initial response would have been to skip the block, as you don’t know in this case from which of the hosts the request actually comes. At least if it is a block which allows access to something.
I wonder if such a rule should not be flagged as bad practice
So my preference would be to deny access whenever there is something which looks not to be as it is supposed to be.
Today/Tomorrow maximum, as soon as people manages take it for a spin.
The 1.16.7 was really unlucky: I found like 4 bugs in 2 days after release, plus 5.5.x came up in the meanwhile. And I need a fresh build to support 5.5.0 too.
So yeah 1.16.8 is around the corner. BTW if somebody wants to try the pre-release, it’s here:
I’ve done basic functionality testing and it seems to work on our test cluster. I have not yet tested the new features requested in this specific post though