POST token UMA ticket scalability

Hello,

I’m working on migrating our authorization over from a custom solution to Keycloak UMA. However, locally after I have migrated ~120k resources over with ~475k policies. When I now try to exchange an access token+permission ticket for an RPT token using POST /auth/realms/${keycloak_realm}/protocol/openid-connect/token with grant-type uma-ticket, the 1st call is taking ~15-30 seconds to complete. Has anyone else seen this behavior?

This is with no load on the keycloak server and it’s the only request happening. I’m running Keycloak 4.8.3.Final and MySQL 5.6.45 Community Edition w/INNODB. My local machine is a late model i7 hex core w/32GiB of RAM and m.2 SSD.

Watching the docker stats while the request is running shows that the MySQL container is pegging the CPU. When I run mysql performance and profiling, it’s showing a query that is doing a full table scan and dominating the performance of MySQL. I’m guessing the query is created via an ORM because it looks to contain some pretty gnarly predicates and joins. When I run the explain on the query nothing is jumping out. However, the performance numbers are showing a tong of rows being scanned. I tried manually creating a couple of indexes but that did not seem to help.

I’ve attached the visual explain and the query below. Any help would be appreciated.

 select 
    policyenti0_.ID as ID1_65_0_, 
    resourceen2_.ID as ID1_66_1_, 
    scopeentit4_.ID as ID1_67_2_, 
    policyenti6_.ID as ID1_65_3_, 
    policyenti0_.DECISION_STRATEGY as DECISION2_65_0_, 
    policyenti0_.DESCRIPTION as DESCRIPT3_65_0_,
    policyenti0_.LOGIC as LOGIC4_65_0_, 
    policyenti0_.NAME as NAME5_65_0_, 
    policyenti0_.OWNER as OWNER6_65_0_, 
    policyenti0_.RESOURCE_SERVER_ID as RESOURCE8_65_0_, 
    policyenti0_.TYPE as TYPE7_65_0_, 
    resourceen2_.DISPLAY_NAME as DISPLAY_2_66_1_, 
    resourceen2_.ICON_URI as ICON_URI3_66_1_, 
    resourceen2_.NAME as NAME4_66_1_, 
    resourceen2_.OWNER as OWNER5_66_1_, 
    resourceen2_.OWNER_MANAGED_ACCESS as OWNER_MA6_66_1_, 
    resourceen2_.RESOURCE_SERVER_ID as RESOURCE8_66_1_, 
    resourceen2_.TYPE as TYPE7_66_1_, 
    resources1_.POLICY_ID as POLICY_I1_61_0__, 
    resources1_.RESOURCE_ID as RESOURCE2_61_0__, 
    scopeentit4_.DISPLAY_NAME as DISPLAY_2_67_2_, 
    scopeentit4_.ICON_URI as ICON_URI3_67_2_, 
    scopeentit4_.NAME as NAME4_67_2_, 
    scopeentit4_.RESOURCE_SERVER_ID as RESOURCE5_67_2_, 
    scopes3_.POLICY_ID as POLICY_I1_71_1__, 
    scopes3_.SCOPE_ID as SCOPE_ID2_71_1__, 
    policyenti6_.DECISION_STRATEGY as DECISION2_65_3_, 
    policyenti6_.DESCRIPTION as DESCRIPT3_65_3_, 
    policyenti6_.LOGIC as LOGIC4_65_3_, 
    policyenti6_.NAME as NAME5_65_3_, 
    policyenti6_.OWNER as OWNER6_65_3_, 
    policyenti6_.RESOURCE_SERVER_ID as RESOURCE8_65_3_, 
    policyenti6_.TYPE as TYPE7_65_3_, 
    associated5_.POLICY_ID as POLICY_I1_1_2__, 
    associated5_.ASSOCIATED_POLICY_ID as ASSOCIAT2_1_2__ 
from 
	RESOURCE_SERVER_POLICY policyenti0_ left outer join 
    RESOURCE_POLICY resources1_ on policyenti0_.ID=resources1_.POLICY_ID left outer join 
    RESOURCE_SERVER_RESOURCE resourceen2_ on resources1_.RESOURCE_ID=resourceen2_.ID inner join 
    SCOPE_POLICY scopes3_ on policyenti0_.ID=scopes3_.POLICY_ID inner join 
    RESOURCE_SERVER_SCOPE scopeentit4_ on scopes3_.SCOPE_ID=scopeentit4_.ID inner join 
    ASSOCIATED_POLICY associated5_ on policyenti0_.ID=associated5_.POLICY_ID inner join 
    RESOURCE_SERVER_POLICY policyenti6_ on associated5_.ASSOCIATED_POLICY_ID=policyenti6_.ID 
where 
	policyenti0_.RESOURCE_SERVER_ID='f75754d9-4c13-4710-b358-7aea027c50c4' and 
    (exists (select policyenti9_.ID from RESOURCE_SERVER_SCOPE scopeentit7_ inner join SCOPE_POLICY policies8_ on scopeentit7_.ID=policies8_.SCOPE_ID inner join RESOURCE_SERVER_POLICY policyenti9_ on policies8_.POLICY_ID=policyenti9_.ID where scopeentit7_.RESOURCE_SERVER_ID='f75754d9-4c13-4710-b358-7aea027c50c4' and policyenti9_.RESOURCE_SERVER_ID='f75754d9-4c13-4710-b358-7aea027c50c4' and policyenti9_.ID=policyenti0_.ID and policyenti9_.TYPE='scope' and (scopeentit7_.ID in ('1fd80023-ee52-48bb-a18c-79762c308026')))) and 
    not (exists (select resourceen11_.ID from RESOURCE_POLICY resources10_, RESOURCE_SERVER_RESOURCE resourceen11_ where policyenti0_.ID=resources10_.POLICY_ID and resources10_.RESOURCE_ID=resourceen11_.ID));

I tried updating the keycloak docker container to 10.0.2 and the endpoint to exchange ticket+access token for an RPT is still showing the same behavior.

I was able to get ~3x speedup with the query by forcing the use of IDX_SCOPE_POLICY_POLICY index on the SCOPE_POLICY table. I’m now looking into finding a way to have keycloak generate SQL that includes the use of this index or configuring MySQL in a way to get the query optimizer to use a plan that uses this index.

I believe I’ve traced down the query to the PolicyEntity’s findPolicyIdByNullResourceScope NamedQuery in hibernate. Unfortunately, I don’t think hibernate allows for setting which index to use or force in the NamedQuery syntax. You have to break out into raw SQL.

What’s really strange is that I have ~35 rows in RESOURCE_SERVER_SCOPE and 655k rows in RESOURCE_SERVER_RESOURCE so I don’t understand why MySQL’s query optimizer would ever choose to use the index for 35 rows over 655k rows. So, I went nuclear on the SCOPE_POLICY table and rebuilt the primary key with SCOPE_ID and POLICY_ID switched around so POLICY_ID is the 1st column in the primary key index. Then I deleted the foreign key constraint for SCOPE_ID because MySQL requires an index for all foreign keys. After firing up jmeter again, the response time for exchanging permission tickets+access tokens for RPT tokens went down to ~7 seconds for 5 concurrent users. It’s still not a great response time but it’s at least workable.

Anyone have any thoughts here. Is anyone doing UMA at scale yet in keycloak?

Hi @Zambonilli,

Can you share the HTTP request you are sending ? Is the ticket obtained for a single resource and you are obtaining permissions for this single resource ? Or are you obtaining all permissions for all resources a user has ?

But yeah, that query is heavy and we need to find a way to make it better. I also did some tests using indexes like you did in the past, and they seem to help but still not solve the problem.

What we really need is someway to optimize how we discover which policies should be evaluated for a given permission (resource + scope). When the cache is not yet hot, it takes time during this discovery process.

Thanks @pedroigor,

Here’s the HTTP request we’re sending. Embedded in the permission ticket is a single resource and 9 scopes.

http://localhost:8181/auth/realms/ph/protocol/openid-connect/token
Bearer $ACCESS_TOKEN
Content-Type application/x-www-form-urlencoded

grant_type=urn:ietf:params:oauth:grant-type:uma-ticket&ticket=$PERMISSION_TICKET

Would you be able to run a test using some changes that I’m preparing?

Basically, for ticket-granted resources I’m skipping the evaluation completly, which should help with the cases where the resources explicitly granted by the owner are automatically granted.

Sure, no problem. I’m running the jboss/keycloak:10.0.2 docker container and can dive into what it takes to modify jboss or keycloak from the building from source readme.md.

Otherwise, if it’s easier I can export my MySQL DB but I think those are raw SQL files.

Maybe I can push a docker image ?

Although building from source should be easier … and faster for both of us.

Sorry, I should have been more clear. I’m running a custom docker container that is derived from jboss/keycloak:10.0.2 that has some custom SPI implementations unrelated to UMA. I can build from source and update my custom container.

Here it is https://github.com/pedroigor/keycloak/tree/authz-perf. There is more than just UMA, but I would like to see how the UMA part works for you.

I’m still doing changes, I should send more once I finish some more tests on my side.

Thanks.

1 Like

Hmm, I’m seeing worse performance now. The same query is still dictating the poor performance. Maybe I was wrong and it’s not the PolicyEntity namedQuery findPolicyIdByNullResourceScope. I was just eyeballing the code and trying to derive where the query was coming from in the endpoint.

Interesting, because ticket-granted resources should be fetched from database and don’t imply a complete policy evaluation.

I’m going to look at your set-up and see if I can re-create it here.

I’ve sent this PR https://github.com/keycloak/keycloak/pull/7227. I saw improvements in other areas too.

Now I realized that from the description in that JIRA, you’re not really granting access to resources based on tickets. But manually creating permissions and policies for those resources. Is that correct?

That change things a bit :slight_smile:

I’m going to try to create your set-up.

Yeah, I’m creating resources and assigning 2 user policies and 1 group policy to them for ~120k resources. From there, our clients are going through the UMA flow to receive a RPT token by exchanging a permission ticket+access_token for the RPT using grant_type urn:ietf:params:oauth:grant-type:uma-ticket.

Here’s the flow between client, resource server and authorization server.

  1. client calls a resource server endpoint that is protected by UMA
  2. resource server calls keycloak for a permission ticket for the resource w/8 scopes
  3. resource server returns a 401 www-authenticate header w/permission ticket
  4. client logs in the user with grant_type password with keycloak and gets an oidc access_token
  5. client calls keycloak to exchange oidc access_token and permission ticket for an RPT

Step 5 is where we are seeing the poor performance from keycloak which looks to be caused by the MySQL query. It is working, it’s just taking a long time.

Let me know if that makes sense, I can add more details if it helps.

It makes sense, however, how many users and groups per policies do you have ?

I’m trying to figure out what can be causing this because usually you get a good throughput when asking permissions for a small set of resources (as opposed to obtain all 120k once).

I’m preparing a set-up now using 150k resources and 150k permissions, as a start.

The group policies are always 1 group to 1 resource. One of the optimizations here is for me to see if 1 policy with 1 group and a ton of resources works better. The user policies have an average of 34 users to 1 resource but a max of 761 users and a standard deviation of 125. I’m unable to get the median easily but looking at the code that generates the synthetic data most of the user policies should have 1-2 users and 1 resource.

I see.

Could you try it out again https://github.com/keycloak/keycloak/pull/7227 ?

A few more changes there. I did not manage to set-up a policy model like yours yet, but I’ll.

What I did was to create 150k resources and 150k permissions, where these permissions are assigned to 3-5 role policies.

Numbers are much better when querying a single or all permissions from the server. Of course, querying everything is still expensive, but once the cache is hot I’m able to obtain 100k permissions in 3-4s. Not so bad as a start.

For single permissions, I’m getting 4-9ms.

Please, let me know how it goes for you.

Btw, one thing to keep in mind when tunning the server is the number of cache entries for the authorization cache.

It helps a lot when you have too many data:

<local-cache name="authorization">
    <object-memory size="500000"/>
</local-cache>

For clustering, the cache name is the same.

Btw, can you add to the JIRA a JSON that I can import where you have a single set of resource/permissions and etc ? I want to get the idea and replicate here in large scale.