Has Kibana PRO "cookiePass" stopped working in 1.18.5 or .6 or .7?

Good day. We have been using Kibana PRO plug-in on a set of 3 servers side-by-side, with no load balancer (either session-sticky or otherwise) between the user’s browser and Kibana service, just a plain 3-answer “A” record in DNS and relying on the excellent retry / failover behavior of the typical Chrome browser and underlying TCP sockets. This has been working beautifully for months, because the Kibana plug-in has the “readonlyrest_kbn.cookiePass” parameter, so each of the three servers can construct a cryptographically-protected session cookie that will be instantly recognized as valid by the other servers, with no coordination necessary between the Kibana instances as long as they have the same cookiePass value to hash.

This is great, but it seems to have stopped working in some recent version. Our production cluster has version 1.18.0 for ES 6.7.1, and it’s still working fine. But a separate “advance testing” cluster has the latest 1.18.7 for ES 7.3.2, and the common cookie seems to be broken. My memory suspects that it broke maybe in 1.18.5 or .6, but the problem only happened to me sporadically and so I wasn’t paying enough attention to diagnose it with precision until now.

I can demonstrate on my machine that if I restrict myself to connecting to only one Kibana host, the problem doesn’t occur; I can log in, get a cookie, and then execute Dev Tools queries over and over without issue. But if my browser is allowed to talk to two or more hosts under the same DNS name, then it works as long as it doesn’t happen to hit more than one host, but as soon as it does, it gets “403 FORBIDDEN” errors back, visible in the Kibana logs as well as to me in Dev Tools. If I happen to be using other Kibana functions (not Dev Tools, but Dashboard or Monitoring, etc.), then the Kibana client attempt at session_probe.txt will figure out that I don’t have a valid cookie and force me back to a green login screen.

This is all true regardless of how I restrict my multi-answer DNS name to only connecting to one host. I.e., I’m uniformly connecting to

http://qim-elastic666-kibana.qim.com:5601

and that name is a DNS CNAME to an A record with 3 answers.

dig qim-elastic666-kibana.qim.com

;; ANSWER SECTION:
qim-elastic666-kibana.qim.com. 3600 IN CNAME qim-elastic666-query.qim.com.
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.237
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.236
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.235

So under normal circumstances, my Chrome browser will resolve the name to 3 addresses, then by its own random-but-helpful behavior, it could connect to any of the three with any combination of keepalive HTTP sockets at any time. If I leave it this way, and I keep running queries while tailing the Kibana logs on all three hosts, as soon as I see my browser send a query to more than one host, the invalid-cookie problem happens. But if I do nothing to my workstation, but simply stop the Kibana service on two of the three hosts, then Chrome is only ever able to connect to one host, and the problem doesn’t happen. Similarly, if I leave Kibana service technically running but install an “iptables” filter on two of the three hosts to drop incoming packets toward :5601, then Chrome pauses for a moment while trying to get to them, but eventually it settles on the only accessible host of the three, and again the problem never happens. The symptom is the same whether I set server.ssl.enabled to true or false, i.e., whether the browser-to-Kibana session is HTTPS or not; it was simply harder for me to troubleshoot when I couldn’t see the content in plaintext. :slight_smile:

So it appears that I do have a workaround if I want to upgrade my clusters… I can just leave two of the three hosts with Kibana stopped, and if I notice a problem with the only one, I can stop that one and start another. So this is not a work-stopping problem for me, just seemingly a regression from previous desirable behavior. Let me know if there are any other tests I should perform.

Thanks!

– JeffS

Hi Jeff, great analysis as usual!
You are absolutely right, the HA behaviour without sticky session appears to be broken. And this time is due to another cause. When users belonged to a lot of groups, and/or group names were long, the identity object we saved in the encrypted cookie exceeded the maximum dimensions supported by browsers.

To solve that, we introduced a basic server-side session cache. So the encrypted auth cookie now only contains a session ID, and the rest of the identity info (groups, username, kibana index, access level) is looked up in the cache.

The issue is that at the moment, the server side cache is an in-memory cache, so its scope is the single Kibana instance. We currently have an engineer at work on the task of transferring the cache to an API in ES that will be backed by an actual index. So all the HA instances of Kibana will be able to look up session from there.

Ah, got it. So you need some way to cross-share the valid sessions, and since ES is right there, it makes sense to just use that as the key-value store. And if you use PUT and GET by a specific individual key (_id), then ES guarantees the result isn’t stale… you wouldn’t want to use a Search API, because visible results are delayed by at least one second, and it’s possible the user could log into one Kibana node, get a generated cookie, and then instantly on the next browser request, round-robin over to a different Kibana node, and that second one would have to recognize a cookie that had JUST been created milliseconds before.

And I guess you would want to make this new Kibana code catch any failures in the GET / PUT calls to ES and harmlessly fail back to only recognizing its own in-memory cache. I.e., if the Elasticsearch cluster is in a bad state and the admin is trying very hard to log into Kibana to go look at all the Offline nodes and Red indices, but in the process of logging in, he keeps losing his encrypted cookie and popping back to a login screen, it’s going to be doubly frustrating! I guess this is already a hazard… the good Elasticsearch admin has to always be prepared to fall back to CLI curl commands to troubleshoot an unhealthy cluster anyway.

OK, I will watch the release notes for the cookiePass feature being returned to us. If the new code to cache this in ES starts taking a long time to get right, and the customer pressure is high to put it back to a working state, then you could consider a new config option

readonlyrest_kbn.cookieStateStorage: browser | internal

… with the default = “internal”. If it was set to “browser”, then you would do the same as you have done before, store the long encrypted cookie in the browser and warn the administrator that they’d better do a constrained LDAP query and not return excessive group names. If this was configured to “internal”, then it would do what you’re talking about here, send short cookies to the browser and store the full versions internally in ES. But I don’t want this to be a distraction – if the store-in-ES code is straightforward and will be done soon, then you don’t have to offer two alternate schemes.

1 Like

loads of good points, @JeffSaxe!