Immuta Performance Optimization on CDH Clusters
Audience: System Administrators
Content Summary: This page describes strategies for improving performance of Immuta's NameNode plugin on CDH clusters.
Overview
Immuta operates within a locked operation in the NameNode when granting / denying permissions based on Immuta policies. This section contains configuration and strategies to prevent RPC queue latency, threads waiting, or other issues on cluster-wide file permission checks.
Deployment Architecture
Isolated HDFS Namespace
Best Practice: NameNode Plugin Configuration
Immuta recommends only configuring the NameNode Plugin to check permissions on the NameNode(s) that oversee the data that you want to protect.
For example, say that you currently have a federated HDFS NameNode architecture with three
Nameservices - nameservice1
, nameservice2
, and nameservice3
. The HDFS federation in this
example is distributed across these nameservices as described below.
nameservice1
:/data
,/tmp/
,/user
nameservice2
:/data2
nameservice3
:/data3
Suppose you know that all the sensitive data that you want to protect with Immuta is located
under /data3
. To achieve optimum performance in this case, you can go ahead and add the
Immuta NameNode-only configuration (hdfs-site.xml
) to the role config group for nameservice3
,
and leave it out of nameservice1
and nameservice2
. The public / client Immuta configuration
(core-site.xml
) should still be configured cluster-wide. See
Immuta CDH Integration Installation for more details about these configuration
groupings.
One caveat to take into consideration here is that Immuta's Vulcan service requires the
Immuta NameNode Plugin to oversee user credentials that are stored in /user/<username>
by
default. Vulcan also stores some configuration under /user/immuta
by default.
This is a problem because /user
resides under nameservice1
, and the goal is to
only operate the Immuta NameNode Plugin on nameservice3
.
A simple solution to this problem is to create a new directory for these credentials,
/data3/immuta_creds
for example, and configure the NameNode Plugin and the
Vulcan service to use this directory instead of /user
. Changing this requires the
configuration modifications listed below.
-
HDFS - Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
- Set
immuta.generated.api.key.dir
andimmuta.credentials.dir
to/data3/immuta_creds
.
- Set
-
Immuta - Immuta Spark 2 Vulcan Server Advanced Configuration Snippet(Safety Valve) for session/generator.xml
-
Set
immuta.meta.store.token.dir
to/data3/immuta_creds/immuta/tokens
. -
Set
immuta.meta.store.remote.token.dir
to/data3/immuta_creds/immuta/remotetokens
. -
Set
immuta.configuration.id.file.config
tohdfs://nameservice3/data3/immuta_creds/immuta/config_id
.
-
Note that you will need to manually create the /data3/immuta_creds/immuta
directory and set the permissions
such that only the immuta
user can read / write in that directory. The /data3/immuta_creds
directory should also
be world writable to allow user directories to be created the first time that they interact with Immuta on
the cluster.
Configuration
Essential Performance Tuning Settings
-
immuta.permission.paths.to.enforce
- Description: A comma delimited list of paths to enforce when checking permissions on HDFS files.
This ensures that API calls to the Immuta web service are only made when permissions are being checked on
the paths that you specify in this configuration. This also means that you can only create data sources against
data that lives under these paths, and the Immuta Workspace must be under one of these paths as well.
Alternatively,
immuta.permission.paths.to.ignore
can be set to a list of paths that you know do not contain Immuta data - then API calls will never be made against those paths. Setting bothimmuta.permission.paths.to.ignore
andimmuta.permission.paths.to.enforce
properties at the same time is unsupported.
- Description: A comma delimited list of paths to enforce when checking permissions on HDFS files.
This ensures that API calls to the Immuta web service are only made when permissions are being checked on
the paths that you specify in this configuration. This also means that you can only create data sources against
data that lives under these paths, and the Immuta Workspace must be under one of these paths as well.
Alternatively,
-
immuta.permission.groups.to.enforce
- Description: A comma delimited list of groups that must go through Immuta when checking permissions on HDFS files. If this configuration item is set, then fallback authorizations will apply to everyone by default, unless they are in a group on this list. If a user is on both the enforce list and the ignore list, then their permissions will be checked with Immuta (i.e., the enforce configuration item takes precedence). This may improve NameNode performance by only making permission check API calls for the subset of users who fall under Immuta enforcement.
-
immuta.permission.source.cache.enabled
- Description: Denotes whether a background thread should be started to periodically cache paths from Immuta
that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents
the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources.
For performance optimization, it is best to enable this cache to act as a "backup" to
immuta.permission.paths.to.enforce
.
- Description: Denotes whether a background thread should be started to periodically cache paths from Immuta
that represent Immuta-protected paths in HDFS. Enabling this increases NameNode performance because it prevents
the NameNode plugin from calling the Immuta web service for paths that do not back HDFS data sources.
For performance optimization, it is best to enable this cache to act as a "backup" to
-
immuta.permission.source.cache.enabled
- Description: The time between calls to sync/cache all paths that back Immuta data sources in HDFS. You can increase this value to further reduce the number of API calls made from the NameNode.
-
immuta.permission.workspace.base.path.override
- Description: This configuration item can be set so that the NameNode does not have to retrieve the Immuta HDFS workspace base path periodically from the Immuta API.
Advanced Cache and Network Settings
There are also a wide variety of cache and network settings that can be used to fine-tune performance. You can refer to the Configuration Guide for details on each of these items.
immuta.permission.source.cache.timeout.seconds
immuta.permission.source.cache.retries
immuta.permission.request.initial.delay.milliseconds
immuta.permission.request.socket.timeout
immuta.no.data.source.cache.timeout.seconds
immuta.hive.impala.cache.timeout.seconds
immuta.canisee.cache.timeout.seconds
immuta.data.source.cache.timeout.seconds
immuta.canisee.metastore.cache.timeout.seconds
immuta.canisee.non.user.cache.timeout.seconds
immuta.canisee.num.retries
immuta.project.user.cache.timeout.seconds
immuta.project.cache.timeout.seconds
immuta.project.forbidden.cache.timeout.seconds
immuta.permission.system.details.retries
Debugging Suspected Performance Issues
See Immuta Log Analysis Tool for CDH Deployments for instructions on how to identify performance issues in the Immuta NameNode Plugin.