ES crash when "hosts:" cannot be resolved

(Meee!) #1

Hi, all,

we’ve stumbled several times now into an issue when we add (or remove) new entry points to a cluster. We are running ES behind http proxies which also run ES with ROR enabled. These are behind aliases, and we pass on the alias name to ROR in order to separate different users (which come via different aliases) from each other. The aliases are DNS load balanced, and also check for the sanity of the hosts behind: for example, if a node does not run ES it is removed from the alias.

Naturaly, when we add new aliases/entry points for new users, they do not yet resolve. Since we started to use ROR 1.16.1 this crashes ES (meaning that the node will never be added to the alias), with:

[2017-07-06T14:58:42,603][ERROR][o.e.b.Bootstrap ] Guice Exception: org.elasticsearch.plugin.readonlyrest.settings.SettingsMalformedException: Cannot create IP address from string: new-alias

[2017-07-06T14:58:42,986][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [hostname-ites_alias] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: org.elasticsearch.common.inject.CreationException: Guice creation errors:

  1. Error injecting constructor, org.elasticsearch.plugin.readonlyrest.settings.SettingsMalformedException: Cannot create IP address from string: new-alias
    at Source)
    while locating
    for parameter 1 at Source)
    while locating
    while locating annotated with @org.elasticsearch.common.inject.multibindings.Element(setName=,uniqueId=1)
    at unknown
    while locating java.util.Set<>
    for parameter 0 at Source)
    while locating
    for parameter 4 at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.(Unknown Source)
    while locating org.elasticsearch.gateway.TransportNodesListGatewayStartedShards
    for parameter 1 at org.elasticsearch.gateway.GatewayAllocator.(Unknown Source)
    while locating org.elasticsearch.gateway.GatewayAllocator

which brings down the entire cluster (including all valid endpoints).

How about throughing an error and ignoring the corresponding rule rather than crashing ES ?

(Simone Scarduzio) #2

Generally, the intention is that the rules are checked for sanity at best of capabilities as soon as possible. Fail fast, less surprises at runtime. And that describes also the current behaviour.

I guess your case makes an exception to the rule. What we could do to support it, is that we can delegate the DNS resolution at rule evaluation time, handle the name resolution exception returning a NO_MATCH, and try again next time a request comes.

Doing so, we could say we’d have supported “temporary DNS lookup failures”.

On a side note, keep in mind the JVM has a permanent address resolution (once a name resolves to anything, it’s going to be in a JVM-level dns lookup table for ever) for security reasons (i.e. DNS poisoning). This is relevant because the fresh alias has to resolve to nothing in order for the resolution to be “failed” and postponed to the next request.

The settings is here (and can be changed of course with a #of seconds)

$ grep 'networkaddress.cache.ttl' $JAVA_HOME/jre/lib/security/

How does this mechanism work for you?

(Meee!) #3

Hi, Simone,

thanks a lot for the answer! We’re indeed using the default here (which I guess means that it is cached indefinitely). I’m afraid though that this will not help us much. We automatically restart Elasticsearch on the search nodes if the configuration has changed - which is the case when we add or remove entry points, at which point the cache will be gone.

What would help though would be if we had an option to just not apply rules if hosts is given and does not resolve (or resolves to something empty, which is typically our case when we add entry points) rather than crashing ES. Instead, issuing a bad error message of cause would be just fine, telling why a rule has not been applied.

Does that make sense?

(Simone Scarduzio) #4

I think we are saying the same thing: This is what I proposed to do:


Boot time:

  • setup the rule, don’t try to resolve names

Rule check time (when a request arrives):

  • if the name was resolved before, check the request using that.
  • if we never resolved it before, try to resolve it now. IF name resolves OK, check request; else return NO_MATCH.

Expected new behaviour

:white_check_mark: ROR won’t crash
No checks are done at the boot-time settings sanity check => no crash.

:white_check_mark: The entire rules block gets skipped
When a rule in a block returns NO_MATCH, in an ACL like ROR has, it’s effectively like not having the whole block at all. I assume this is the wanted behaviour.

:white_check_mark: Whenever resolution is success, it’s resolved forever

Or… did you mean you would like the rule to be skipped instead of the block ignored? Weird, because that would mean the block will allow requests from anywhere to match until we have a name resolution.

(Meee!) #5

Hi, Simone,

I’m perfectly happy with you proposal, very nice summary. Of cause the block has to be ignored.

(Simone Scarduzio) #6

OK let’s throw this in for 1.16.8 :slight_smile:

(Simone Scarduzio) #7

@schwicke help me deciding what should happen when you have configured an array of hosts, one of which is not resolving.

    - name: Multihost Rule
      hosts: ["unresol.vab.le", "localhost"]

When a request comes, should we:

  • Discard the unresolvable hosts and operate normally with the remaining configured hosts, if at least one is resolvable (i.e. accept requests from localhost).
  • Or should we skip the block if just a part of the hosts is unresolvable?

My nose says the first, but wanted to check with you.

(Meee!) #8

Very good question. My initial response would have been to skip the block, as you don’t know in this case from which of the hosts the request actually comes. At least if it is a block which allows access to something.

I wonder if such a rule should not be flagged as bad practice :slight_smile:

So my preference would be to deny access whenever there is something which looks not to be as it is supposed to be.

(Simone Scarduzio) #9

OK makes sense. then 1.16.8-pre3 is ready to test. You got the build-from-tag in place, I don’t even link you the build :slight_smile:

The only missing thing would be a warning log line when we skip the block due to unresolvable host. Will be there in the official release, np.

(Meee!) #10

Great! I managed to package it, and put it on our test cluster. Will test in more detail next week. What are the release plans for 1.16.8 ?

(Simone Scarduzio) #11

Today/Tomorrow maximum, as soon as people manages take it for a spin.
The 1.16.7 was really unlucky: I found like 4 bugs in 2 days after release, plus 5.5.x came up in the meanwhile. And I need a fresh build to support 5.5.0 too.

So yeah 1.16.8 is around the corner. BTW if somebody wants to try the pre-release, it’s here:

(Meee!) #12

I’ve done basic functionality testing and it seems to work on our test cluster. I have not yet tested the new features requested in this specific post though :slight_smile: