Immuta Performance Optimization on CDH Clusters

Audience: System Administrators

Content Summary: This page describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.

Overview

Immuta operates within a locked operation in the NameNode when granting / denying permissions based on Immuta policies. This section contains configuration and strategies to prevent RPC queue latency, threads waiting, or other issues on cluster-wide file permission checks.

Deployment Architecture

Isolated HDFS Namespace

Best Practice: NameNode Plugin Configuration

Immuta recommends only configuring the NameNode Plugin to check permissions on the NameNode(s) that oversee the data that you want to protect.

For example, say that you currently have a federated HDFS NameNode architecture with three Nameservices - nameservice1, nameservice2, and nameservice3. The HDFS federation in this example is distributed across these nameservices as described below.

nameservice1: /data, /tmp/, /user
nameservice2: /data2
nameservice3: /data3

Suppose you know that all the sensitive data that you want to protect with Immuta is located under /data3. To achieve optimum performance in this case, you can go ahead and add the Immuta NameNode-only configuration (hdfs-site.xml) to the role config group for nameservice3, and leave it out of nameservice1 and nameservice2. The public / client Immuta configuration (core-site.xml) should still be configured cluster-wide. See Immuta CDH Integration Installation for more details about these configuration groupings.

One caveat to take into consideration here is that Immuta's Vulcan service requires the Immuta NameNode Plugin to oversee user credentials that are stored in /user/<username> by default. Vulcan also stores some configuration under /user/immuta by default. This is a problem because /user resides under nameservice1, and the goal is to only operate the Immuta NameNode Plugin on nameservice3.

A simple solution to this problem is to create a new directory for these credentials, /data3/immuta_creds for example, and configure the NameNode Plugin and the Vulcan service to use this directory instead of /user. Changing this requires the configuration modifications listed below.

HDFS - Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
- Set immuta.generated.api.key.dir and immuta.credentials.dir to /data3/immuta_creds.
Immuta - Immuta Spark 2 Vulcan Server Advanced Configuration Snippet(Safety Valve) for session/generator.xml
- Set immuta.meta.store.token.dir to /data3/immuta_creds/immuta/tokens.
- Set immuta.meta.store.remote.token.dir to /data3/immuta_creds/immuta/remotetokens.
- Set immuta.configuration.id.file.config to hdfs://nameservice3/data3/immuta_creds/immuta/config_id.

Note that you will need to manually create the /data3/immuta_creds/immuta directory and set the permissions such that only the immuta user can read / write in that directory. The /data3/immuta_creds directory should also be world writable to allow user directories to be created the first time that they interact with Immuta on the cluster.

Configuration

Essential Performance Tuning Settings

immuta.permission.paths.to.enforce
- Description: A comma delimited list of paths to enforce when checking permissions on HDFS files. This ensures that API calls to the Immuta web service are only made when permissions are being checked on the paths that you specify in this configuration. This also means that you can only create data sources against data that lives under these paths, and the Immuta Workspace must be under one of these paths as well. Alternatively, immuta.permission.paths.to.ignore can be set to a list of paths that you know do not contain Immuta data - then API calls will never be made against those paths. Setting both immuta.permission.paths.to.ignore and immuta.permission.paths.to.enforce properties at the same time is unsupported.
immuta.permission.groups.to.enforce
- Description: A comma delimited list of groups that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are in a group on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence). This may improve NameNode performance by only making permission check API calls for the subset of users who fall under Immuta enforcement.
immuta.permission.source.cache.enabled
- Description: Denotes whether a background thread should be started to periodically cache paths from Immuta that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources. For performance optimization, it is best to enable this cache to act as a "backup" to immuta.permission.paths.to.enforce.
immuta.permission.source.cache.enabled
- Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS. You can increase this value to further reduce the number of API calls made from the NameNode.
immuta.permission.workspace.base.path.override
- Description: This configuration item can be set so that the NameNode does not have to retrieve the Immuta HDFS workspace base path periodically from the Immuta API.

Advanced Cache and Network Settings

There are also a wide variety of cache and network settings that can be used to fine-tune performance. You can refer to the Configuration Guide for details on each of these items.

immuta.permission.source.cache.timeout.seconds
immuta.permission.source.cache.retries
immuta.permission.request.initial.delay.milliseconds
immuta.permission.request.socket.timeout
immuta.no.data.source.cache.timeout.seconds
immuta.hive.impala.cache.timeout.seconds
immuta.canisee.cache.timeout.seconds
immuta.data.source.cache.timeout.seconds
immuta.canisee.metastore.cache.timeout.seconds
immuta.canisee.non.user.cache.timeout.seconds
immuta.canisee.num.retries
immuta.project.user.cache.timeout.seconds
immuta.project.cache.timeout.seconds
immuta.project.forbidden.cache.timeout.seconds
immuta.permission.system.details.retries

Debugging Suspected Performance Issues

See Immuta Log Analysis Tool for CDH Deployments for instructions on how to identify performance issues in the Immuta NameNode Plugin.