Skip to main content

Privacera Documentation

Configure Audits for Databricks Unity Catalog on PrivaceraCloud

This section provides steps to configure the audits for the Databrick Unity Catalog on the PrivaceraCloud.

Setting up Databricks audits

To set up audits, follow the instructions provided by Databricks:

Audit Unity Catalog events

This sends activity from each workspace to Amazon S3.

Next, it is necessary to create an external table for PolicySync to read the audit data. Follow the instructions provided by Databricks to create an external location from the Amazon S3 bucket:

Manage external locations and storage credentials

Post external location is set up, run the following command to create the external table from that location.

Note

In the following command:

  • Replace <catalog-name>, <schema-name>, and <table-name> with the location where you would be creating the table.

  • Replace <bucket-name> and <delivery-path-prefix> with the values used while enabling the audits in Databricks.

CREATE TABLE <catalog-name>.<schema-name>.<table-name> ( accountId STRING, actionName STRING, auditLevel STRING, requestId STRING, requestParams STRING, response STRING, serviceName STRING, sessionId STRING, sourceIPAddress STRING, timestamp BIGINT, userAgent STRING, userIdentity STRING, version STRING, workspaceId STRING, date DATE) USING json PARTITIONED BY (workspaceId, date) LOCATION 's3://<bucket-name>/<delivery-path-prefix>

Enabling audits in the PolicySync connector

Perform following steps to enable audits in the PolicySync connecter:

  1. Under PrivaceraCloud portal, go the SettingApplications.

  2. In the Databricks Unity Catalog application, go to Access management> BASIC.

  3. Click toggle button to turn on Enable access audits.

  4. Under Audit table Path, enter the path of the table created, with the format catalog_name.schema_name.table_name. See Setting up Databricks audits.

Note

Under the ADVANCED tab, go to Users to exclude when fetching access audits. This property is to add any users for which audits should not be shown. It is recommended to set this to the user who’s access token is used by PolicySync. Otherwise audits will be shown for PolicySync activity, which can produce a lot of clutter.

Notes for delays in loading audits

Following are the notes for delays in loading audits:

Note

Post PolicySync loads audits from Databricks, it takes four hours before loading the next set of audits. This wait time can be changed, but it is set this way to prevent PolicySync from using too much warehouse uptime. Refer to section Properties summary for more details.

Note

When PolicySync loads audits, it loads audits upto one hour before the present time. This is to make sure that each workspace has time to send their events to Amazon S3. This value can also be changed, but is not recommended to do so. Refer to Properties summary for more details.

Understanding the audits loaded by PolicySync

Whenever a table is created or deleted, you can see an audit with access type set to createTable or deleteTable. These correspond to the Unity Catalog events documented here. This means you can also see actions such as deleteSchema and updateExternalLocation, as these actions are also listed under the documentation.

PolicySync sends all of the Unity Catalog events listed in the documentation, except for those that start with get or list, for example getTables or listSchemas. These events are not sent because they only check metadata, not the actual data access. Privacera also do not send the metadataSnapshot and metadataAndPermissionsSnapshot events listed in the document.

Whenever a table is accessed, there will be an audit with access type generateTemporaryTableCredential. Whenever an external location is accessed, there will be an action with access type generateTemporaryPathCredential. In both cases, the audit will not specify the command done on the resource, but it will specify who accessed the resource, from which workspace, and if it was a read or write access.

To summarize, following are the two categories of audits:

  • For data reads and data writes, look for access types generateTemporaryTableCredential and generateTemporaryPathCredential.

  • For resources being created, deleted, or metadata updates, look for an access type that corresponds to a Unity Catalog event, such as createTable.

Enable verbose logging (Optional)

The audits mentioned in the preceding section do not contain the queries that users are running. However, it is possible to configure PolicySync to load this information. This will only get queries run on warehouses, and commands run on notebooks, but it will NOT show what was done from jobs.

Perform the instruction given by Databricks to enable verbose logging on the workspace:

Enable verbose audit logs

It is necessary to perform preceding instruction for each workspace that you want to get verbose logs from.

Next, when configuring PolicySync, in the ADVANCED properties, under Audit mode, set the value to verbose.

Now, when a query is run on a warehouse, there should be an audit with access type commandSubmit, that shows the query that was performed. When a command is run on a notebook, there should be an audit with access type runCommand that shows the command that was run.

Caution

This section shows actions performed with warehouses and notebooks, but NOT jobs. Therefore, the actions produced from verbose logging should NOT be considered a completely accurate summary of data access. To get a completely accurate view of who’s accessing what data, it is necessary to check the events generateTemporaryTableCredential and generateTemporaryPathCredential.

If the verbose logging is used, the one hour delay mentioned in the section Notes for delays in loading audits will increase to two hours, as these types of actions can take longer to be sent to Amazon S3.

Configure the workspaces to receive audits from (Optional)

If you only want to receive audits from some workspaces, it is possible to configure this. In the ADVANCED properties, under Workspaces to get audits from, enter a list of workspace ids to get audits from.

This should be a comma-separated list of ids. For example, 1023707303840399, 1164438425884509. If the property is not set, the PolicySync loads audits from all workspaces.

Properties summary

Following are the properties need to set under ADVANCED tab:

  • Audit mode: By default it is set to simple. Can be set to verbose if verbose logging is enabled.

  • Users to exclude when fetching audits: A comma-separated list of users that should be excluded from the audits. For example, user1@gmail.com,user2@gmail.com,etc. It is recommended to set this to the user who’s access token is being used by PolicySync, to avoid PolicySync activity from showing up in the audits.

  • Workspaces to get audits from: Set this property if you want to get audits from some but not all workspaces. This is a comma-separated list of workspaces to get audits from. For example, 1023707303840399,1164438425884509. Leave this property blank if you want audits from all workspaces.

The following properties can be changed in the custom properties:

  • ranger.policysync.connector.0.audit.interval.sec=14400:

    As mentioned in the section Notes for delays in loading audits, after loading audits, PolicySync will wait four hours before loading audits again. This property changes the wait time to the specified number of seconds.

    Caution

    If the wait time is reduced, audit frequency gets increased and the Databricks warehouse starts up more often, that increases the costs. It is NOT RECOMMENDED to lower this property, unless the additional costs of keeping the warehouse up are acceptable.

  • ranger.policysync.connector.0.audit.delay.seconds=7200;

    As mentioned in the section Notes for delays in loading audits, PolicySync will only load audits up to the present time minus one hour. This property changes the wait time to the specified number of seconds.

    Caution

    It is NOT RECOMMENDED to lower this property, as audits can be missed if the logs take more than one hour to reach Amazon S3. The logs do not normally take that long, but the time for audits to reach Amazon S3 can be variable and can depend on the workspace.

Enable audits using the query history API (deprecated)

Prior to centralize audit feature, query history API option was useful to get audit logs using the following link:

Databricks SQL Query History

You can still get the audit logs using the query history API method. For that, set the Audit modeproperty to workspace_api.

However, enable audits using the query history API method is NOT RECOMMENDED, as it has following issues:

  • Getting query history from one workspace, that is the workspace PolicySync is connecting to. Queries from other workspaces will not be loaded from this API.

  • Getting queries done on warehouses. It will not get activity done through notebooks, jobs, or the web UI.

Enable audits using the query history API method does not require any setup. Since the setup for centralized audits method is time consuming, therefore enable audits using the query history API method can be useful to get something for audit without any setup.

However, the audit logs received from enable audits using the query history API method are incomplete, hence in most of the cases its not useful. So this feature is considered to be deprecated, and Privacera DO NOT RECOMMEND to use it.