Has Kibana PRO "cookiePass" stopped working in 1.18.5 or .6 or .7?

JeffSaxe · October 4, 2019, 5:49pm

Good day. We have been using Kibana PRO plug-in on a set of 3 servers side-by-side, with no load balancer (either session-sticky or otherwise) between the user’s browser and Kibana service, just a plain 3-answer “A” record in DNS and relying on the excellent retry / failover behavior of the typical Chrome browser and underlying TCP sockets. This has been working beautifully for months, because the Kibana plug-in has the “readonlyrest_kbn.cookiePass” parameter, so each of the three servers can construct a cryptographically-protected session cookie that will be instantly recognized as valid by the other servers, with no coordination necessary between the Kibana instances as long as they have the same cookiePass value to hash.

This is great, but it seems to have stopped working in some recent version. Our production cluster has version 1.18.0 for ES 6.7.1, and it’s still working fine. But a separate “advance testing” cluster has the latest 1.18.7 for ES 7.3.2, and the common cookie seems to be broken. My memory suspects that it broke maybe in 1.18.5 or .6, but the problem only happened to me sporadically and so I wasn’t paying enough attention to diagnose it with precision until now.

I can demonstrate on my machine that if I restrict myself to connecting to only one Kibana host, the problem doesn’t occur; I can log in, get a cookie, and then execute Dev Tools queries over and over without issue. But if my browser is allowed to talk to two or more hosts under the same DNS name, then it works as long as it doesn’t happen to hit more than one host, but as soon as it does, it gets “403 FORBIDDEN” errors back, visible in the Kibana logs as well as to me in Dev Tools. If I happen to be using other Kibana functions (not Dev Tools, but Dashboard or Monitoring, etc.), then the Kibana client attempt at session_probe.txt will figure out that I don’t have a valid cookie and force me back to a green login screen.

This is all true regardless of how I restrict my multi-answer DNS name to only connecting to one host. I.e., I’m uniformly connecting to

http://qim-elastic666-kibana.qim.com:5601

and that name is a DNS CNAME to an A record with 3 answers.

dig qim-elastic666-kibana.qim.com

;; ANSWER SECTION:
qim-elastic666-kibana.qim.com. 3600 IN CNAME qim-elastic666-query.qim.com.
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.237
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.236
qim-elastic666-query.qim.com. 3600 IN A 192.168.48.235

So under normal circumstances, my Chrome browser will resolve the name to 3 addresses, then by its own random-but-helpful behavior, it could connect to any of the three with any combination of keepalive HTTP sockets at any time. If I leave it this way, and I keep running queries while tailing the Kibana logs on all three hosts, as soon as I see my browser send a query to more than one host, the invalid-cookie problem happens. But if I do nothing to my workstation, but simply stop the Kibana service on two of the three hosts, then Chrome is only ever able to connect to one host, and the problem doesn’t happen. Similarly, if I leave Kibana service technically running but install an “iptables” filter on two of the three hosts to drop incoming packets toward :5601, then Chrome pauses for a moment while trying to get to them, but eventually it settles on the only accessible host of the three, and again the problem never happens. The symptom is the same whether I set server.ssl.enabled to true or false, i.e., whether the browser-to-Kibana session is HTTPS or not; it was simply harder for me to troubleshoot when I couldn’t see the content in plaintext.

So it appears that I do have a workaround if I want to upgrade my clusters… I can just leave two of the three hosts with Kibana stopped, and if I notice a problem with the only one, I can stop that one and start another. So this is not a work-stopping problem for me, just seemingly a regression from previous desirable behavior. Let me know if there are any other tests I should perform.

Thanks!

– JeffS

sscarduzio · October 5, 2019, 6:22am

Hi Jeff, great analysis as usual!
You are absolutely right, the HA behaviour without sticky session appears to be broken. And this time is due to another cause. When users belonged to a lot of groups, and/or group names were long, the identity object we saved in the encrypted cookie exceeded the maximum dimensions supported by browsers.

To solve that, we introduced a basic server-side session cache. So the encrypted auth cookie now only contains a session ID, and the rest of the identity info (groups, username, kibana index, access level) is looked up in the cache.

The issue is that at the moment, the server side cache is an in-memory cache, so its scope is the single Kibana instance. We currently have an engineer at work on the task of transferring the cache to an API in ES that will be backed by an actual index. So all the HA instances of Kibana will be able to look up session from there.

JeffSaxe · October 7, 2019, 2:27pm

Ah, got it. So you need some way to cross-share the valid sessions, and since ES is right there, it makes sense to just use that as the key-value store. And if you use PUT and GET by a specific individual key (_id), then ES guarantees the result isn’t stale… you wouldn’t want to use a Search API, because visible results are delayed by at least one second, and it’s possible the user could log into one Kibana node, get a generated cookie, and then instantly on the next browser request, round-robin over to a different Kibana node, and that second one would have to recognize a cookie that had JUST been created milliseconds before.

And I guess you would want to make this new Kibana code catch any failures in the GET / PUT calls to ES and harmlessly fail back to only recognizing its own in-memory cache. I.e., if the Elasticsearch cluster is in a bad state and the admin is trying very hard to log into Kibana to go look at all the Offline nodes and Red indices, but in the process of logging in, he keeps losing his encrypted cookie and popping back to a login screen, it’s going to be doubly frustrating! I guess this is already a hazard… the good Elasticsearch admin has to always be prepared to fall back to CLI curl commands to troubleshoot an unhealthy cluster anyway.

OK, I will watch the release notes for the cookiePass feature being returned to us. If the new code to cache this in ES starts taking a long time to get right, and the customer pressure is high to put it back to a working state, then you could consider a new config option

readonlyrest_kbn.cookieStateStorage: browser | internal

… with the default = “internal”. If it was set to “browser”, then you would do the same as you have done before, store the long encrypted cookie in the browser and warn the administrator that they’d better do a constrained LDAP query and not return excessive group names. If this was configured to “internal”, then it would do what you’re talking about here, send short cookies to the browser and store the full versions internally in ES. But I don’t want this to be a distraction – if the store-in-ES code is straightforward and will be done soon, then you don’t have to offer two alternate schemes.

sscarduzio · October 11, 2019, 12:57pm

loads of good points, @JeffSaxe!

JeffSaxe · February 5, 2020, 4:40pm

Good day, Simone. I had a chance to go back and see if the multi-Kibana-server session cookie storage was addressed, and from the release notes it appears to have been, but I can’t get it to work. I currently have installed the PRO plugin version 1.18.9 for ES 7.5.0, although I could try updating to the absolute latest version of both before troubleshooting more.

I found the required settings on your documentation page, so I have:

readonlyrest_kbn.cookiePass: WhateverMyLongPasswordStringIs
readonlyrest_kbn.store_sessions_in_index: true
readonlyrest_kbn.sessions_refresh_after: 1000
readonlyrest_kbn.clearSessionOnEvents: ["never"]
readonlyrest_kbn.session_timeout_minutes: 21600

I turned on just one node of Kibana, and I can see when I logged in that it created the .readonlyrest_kbn_sessions index, and I can see my one session cookie in it, with a constantly ticking “expiresAt” time. So far, so good. Then I turned on a second node of Kibana and waited for it to be ready to serve (watching its logs in “journalctl --follow”). But when I typed in an iptables REJECT rule on the first Kibana node, so that it can no longer take :5601 connections, then after some seconds my browser reconnects to the second node, but then it pops me back to a login screen, defeating the whole purpose of the shared session cookies! Once I logged back in, I see there is still only one document in the sessions index, but now it has a new ID, so effectively it’s a new session for me. I tried a couple of times, but every time I switch from one Kibana node to another, it invalidates my cookie and pops me back. At least “clear session = Never” means I don’t lose what was in my Dev Tools window.

So, so far, I don’t think it’s working, but maybe I am doing something wrong. If it persists in not working, I will take a packet capture from the standpoint of my browser (the HTTP client) just to see where the breakdown is. Note that I am not using a TCP or HTTP load balancer – this is just my browser, a 3-answer A record in DNS that gives all three IP addresses of all three Kibana servers, and the servers themselves, either answering port 5601 or not. So there is no stickiness other than the browser and OS’s builtin behavior of naturally keeping HTTP sessions and sockets alive while they are in use. But this entire thing (multiple Kibana servers with a 3-answer A record) used to work just fine when you had the encrypted session cookie stored as long browser data, before you transitioned to storing it in the ES index. Obviously I would like it to work again for us.

I will let you know what I find, unless you already have an answer. Thanks!

sscarduzio · February 6, 2020, 2:14pm

Hi Jeff! Nice to hear from you, and thanks for reporting this. I think our engineer Jan has fixed a very similar bug in 1.19.0 (the currently available ROR version). Could you update your ROR PRO plugin and see?

JeffSaxe · February 6, 2020, 2:43pm

Yes, I will go download that now and try it. I did (just now) take a packet capture in plaintext, i.e., turning off front-side SSL on my Kibana servers so that I could read the Wireshark conversation. When my browser flips over from the Kibana node that just became inaccessible to the new one that it tries (another answer from the DNS record), it does have its rorCookie – i.e. it doesn’t just show up with no cookie at all, so Chrome believes that the two servers are equivalent for cookie replay purposes. That existing cookie is deemed good enough to produce a “200 OK” response for the “session_probe.txt” tickler that is checked every few seconds, but presumably that probe only checks to see if Kibana is working and the cookie isn’t expired yet; it doesn’t actually try to access any Elasticsearch resources or APIs. But then the next “real” call, for /api/console/proxy?path=_mapping, is rejected with a 403 Forbidden response, and then a few other /api/console/proxy calls for other paths are similarly rejected with 403. Then after a few more checks for session_probe, all of which succeed (either 200 OK or 304 Not Modified), then the client jumps to /logout, gets its cookie reset, and then begins the login sequence.

OK, I will try ES 7.5.2 with 1.19.0 and let you know in a while.

JeffSaxe · February 6, 2020, 9:44pm

Hmm, I’m afraid it still isn’t working. I double-checked the documentation and realized before that I hadn’t put quotes around the cookiePass string, so I thought that might make a difference; I corrected it, and the behavior is slightly better, but not perfect. I think before, since my non-quoted-string was probably failing to parse, it was probably ignored, and likely all the Kibana nodes had made up their own random cookiePass strings, not the same as one another, and then obviously a cookie created by one server would be invalid when presented to any other server. So at least I am out of that situation.

Now the cookie I got from the first server appears to be accepted by the second server, and I am not being redirected by the client-side code to the logout page. However, the internal state of “who is this already-logged-in username and what groups is he a member of” does not seem to be available, or mentally filled in, or passed through to the Elasticsearch query node, from the second server. If I go into the Kibana Dev Tools window and run simple queries, even _cat/health, they come back with 403 OPERATION_NOT_ALLOWED root causes there in the right-side window. In fact, I can see this in the ES logs…

[2020-02-06T16:08:54,297][INFO ][t.b.r.a.l.AccessControlLoggingDecorator] [qim-elastic1-d7] FORBIDDEN by default req={  ID:156871966-329568818#73110,  TYP:ClusterHealthRequest,  CGR:N/A,  USR:[no info about user],  BRS:false,  KDX:null,  ACT:cluster:monitor/health,  OA:192.168.48.237/32,  XFF:x-forwarded-for=192.168.48.224,  DA:192.168.48.235/32,  IDX:<N/A>,  MET:GET,  PTH:/_cat/health,  CNT:<N/A>,  HDR:Connection=close, content-length=0, content-type=application/json, host=qim-elastic1-d7, x-forwarded-for=192.168.48.224, x-forwarded-host=qim-elastic666-kibana.qim.com:5601, x-forwarded-port=64548, x-forwarded-proto=http,  HIS:[Logstash can write into any index-> RULES:[auth_key_sha256->false]], [Kibana mini-super-user for Kibana service itself-> RULES:[auth_key_sha256->false]], [Kibana Admin super-user for human to edit Kibana security-> RULES:[auth_key_sha256->false]], [Permit specific group to MandR firewall logs-> RULES:[ldap_auth->false]], [Deny everyone else to MandR-> RULES:[actions->false]], [LDAP from AD, ElasticsearchSuperUsers group-> RULES:[ldap_auth->false]], [ElasticsearchRestrictedDashboardOnly group member-> RULES:[ldap_auth->false]], [LDAP from AD, ElasticsearchReadAllIndices group-> RULES:[ldap_auth->false]]  }
[2020-02-06T16:08:54,523][INFO ][t.b.r.a.l.AccessControlLoggingDecorator] [qim-elastic1-d7] FORBIDDEN by default req={  ID:1369957901-1608887246#73112,  TYP:ClusterHealthRequest,  CGR:N/A,  USR:[no info about user],  BRS:false,  KDX:null,  ACT:cluster:monitor/health,  OA:192.168.48.237/32,  XFF:x-forwarded-for=192.168.48.224,  DA:192.168.48.235/32,  IDX:<N/A>,  MET:GET,  PTH:/_cat/health,  CNT:<N/A>,  HDR:Connection=close, content-length=0, content-type=application/json, host=qim-elastic1-d7, x-forwarded-for=192.168.48.224, x-forwarded-host=qim-elastic666-kibana.qim.com:5601, x-forwarded-port=64548, x-forwarded-proto=http,  HIS:[Logstash can write into any index-> RULES:[auth_key_sha256->false]], [Kibana mini-super-user for Kibana service itself-> RULES:[auth_key_sha256->false]], [Kibana Admin super-user for human to edit Kibana security-> RULES:[auth_key_sha256->false]], [Permit specific group to MandR firewall logs-> RULES:[ldap_auth->false]], [Deny everyone else to MandR-> RULES:[actions->false]], [LDAP from AD, ElasticsearchSuperUsers group-> RULES:[ldap_auth->false]], [ElasticsearchRestrictedDashboardOnly group member-> RULES:[ldap_auth->false]], [LDAP from AD, ElasticsearchReadAllIndices group-> RULES:[ldap_auth->false]]  }

The USR field says “no info about user”. So this feels to me as if my cookie created on the first server was stored, along with my username and my group memberships, into the ES session document. But then on the second server, it realizes that my cookie is valid, but either it can’t retrieve the corresponding user-and-groups state, or it is retrieving it but then failing to apply that credential state to the downstream queries passed through to ES, so ES correctly replies that I am not permitted to do any of those things.

Curiously, I am trying (right there in the Dev Tools window) to GET _search all the current session cookies out of the index, just to look at them and verify if they are corrupted or anything, and it’s acting even more strangely! I am just running a query (which worked fine before, when my session was valid)…

GET .readonlyrest_kbn_sessions/_search?size=100

That formerly showed me the JSON documents of the logged-in sessions. Now, after my session is messed up but before my Kibana client has pushed me back to logout page, when I run this same command, it seemingly modifies the name of the index I am trying to GET from, with some random string appended, different every time I re-run it. This doesn’t make sense to me, but perhaps it will help Jan.

GET .readonlyrest_kbn_sessions/_search?size=100

{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_not_found_exception",
        "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x]",
        "resource.type" : "index_or_alias",
        "resource.id" : ".readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x",
        "index_uuid" : "_na_",
        "index" : ".readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x"
      }
    ],
    "type" : "index_not_found_exception",
    "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x]",
    "resource.type" : "index_or_alias",
    "resource.id" : ".readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x",
    "index_uuid" : "_na_",
    "index" : ".readonlyrest_kbn_sessions_ROR_IGW0Q4Oo3x"
  },
  "status" : 404
}



{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_not_found_exception",
        "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr]",
        "resource.type" : "index_or_alias",
        "resource.id" : ".readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr",
        "index_uuid" : "_na_",
        "index" : ".readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr"
      }
    ],
    "type" : "index_not_found_exception",
    "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr]",
    "resource.type" : "index_or_alias",
    "resource.id" : ".readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr",
    "index_uuid" : "_na_",
    "index" : ".readonlyrest_kbn_sessions_ROR_Hj3o3dRqbr"
  },
  "status" : 404
}



{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_not_found_exception",
        "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_kPImZCx5p5]",
        "resource.type" : "index_or_alias",
        "resource.id" : ".readonlyrest_kbn_sessions_ROR_kPImZCx5p5",
        "index_uuid" : "_na_",
        "index" : ".readonlyrest_kbn_sessions_ROR_kPImZCx5p5"
      }
    ],
    "type" : "index_not_found_exception",
    "reason" : "no such index [.readonlyrest_kbn_sessions_ROR_kPImZCx5p5]",
    "resource.type" : "index_or_alias",
    "resource.id" : ".readonlyrest_kbn_sessions_ROR_kPImZCx5p5",
    "index_uuid" : "_na_",
    "index" : ".readonlyrest_kbn_sessions_ROR_kPImZCx5p5"
  },
  "status" : 404
}

OK, I think that’s as much info as I have. Jan is welcome to contact me offline if he wants to troubleshoot or ask my questions; you have my email and probably phone numbers in your records.

– Jeff Saxe

sscarduzio · February 9, 2020, 10:27am

This “rewritten” index name (notice the _ROR_xxxxx suffix) means that the index you look for exists, but you have no permission to query it. This is a recent variation to our algorithm that makes it more coherent with the model “every tenant has to have the illusion the cluster contains only indices they have permissions to see”.

However no worries, we have @pondzix is already having a look at this issue.

pondzix · February 9, 2020, 6:14pm

Hi @JeffSaxe thank you for great analysis!

Regarding described issue - did you manage to use Kibana successfully at least once after switching to second server or does problem (OPERATION_NOT_ALLOWED) happen everytime you try to use Kibana Dev Tools (or any other Kibana tool)?

Could you share (if possible) kibana logs containing info for these particular actions (OPERATION_NOT_ALLOWED when using Kibana Dev Tools)?

JeffSaxe · February 11, 2020, 7:43pm

Ah, thanks for the explanation of the suffixed index name to generate the does-not-exist error, @sscarduzio . I’m glad this is expected behavior.

@pondzix , I will be happy to send you the Kibana log excerpts from the two servers in question (the one my session started on, and the one I flipped to after stopping Kibana service on the first). I don’t necessarily want to post that here in public, but I can email them offline or something. I am jeff (dot) saxe (at) quantitative (dot) com. Thanks!

sscarduzio · February 11, 2020, 10:34pm

Hi @JeffSaxe, @pondzix just replicated and fixed the issue you found. I will send you a 1.19.1-preX build in private, so you can give it a try!

JeffSaxe · February 12, 2020, 4:34pm

Oh, good, I’m glad he was able to address it without my log excerpts. OK, so I will await a link via private message here in the forum site? This would be Kibana PRO, 1.19.1-preX, built against ES 7.5.2.

sscarduzio · February 12, 2020, 6:04pm

You should have a private message, Jeff! Check in this forum.

JeffSaxe · February 12, 2020, 7:33pm

I’m sorry to be so dense; last time I didn’t know where the Messages feature even was, but now I do know that (under the uncustomized “J in a green circle” icon in the right corner), and looking in there I still do not see a new message from you with a link. All I see is messages from a few months ago. My apologies if this is obvious to everyone else.

sscarduzio · February 13, 2020, 12:26pm

Sorry my fault. The message should show now in your inbox.

JeffSaxe · February 14, 2020, 11:23pm

Perfect! It works beautifully. I put the preview version (1.19.1-pre9) on the same two nodes, then started a session toward one, and stopped Kibana on that one while tailing the Kibana logs on the other. My browser flipped over completely invisibly, no hesitation, no interruption in my Dev Tools queries or Stack Monitoring refreshes. Then I put it on the third node, and I can see browser hits against any of the three, as desired.

Thanks for fixing this, gentlemen! Have a great weekend.