Skip to main content

PrivaceraCloud Documentation

Table of Contents

Azure Data Factory Integration with Privacera Enabled Databricks Cluster

:

This topic is about Azure Data Factory integration with Privacera enabled Databricks cluster.

Prerequisites

Create Azure Data Factory, for more information see Create a data factory.

Create a pipeline with new Databricks cluster

  1. Open Azure Data Factory studio.

  2. Open Author > Factory Resources > Pipelines and click on the ellipsis to create a new pipeline, for more information see Create a new pipeline.

    The Properties section is displayed.

  3. Enter the Name and Description for the Pipeline.

  4. On the Activities section, select Databricks > Notebook and then drag the Notebook to the right panel to configure the cluster.

  5. Select the Notebook and navigate to Azure Databricks tab.

  6. Click +New to create Databricks linked service.

    The New linked service section is displayed.

  7. Enter or choose appropriate values in the New linked service section:

    Table 6. 

    Field

    Values

    Name

    Auto-populated or you can enter a new name

    Description

    Enter a brief description

    Connect via integration runtime

    Auto-populated or select a value from the dropdown

    Account selection method

    Select a value from the options

    Azure subscription

    Select Azure subscription

    Databrick Workspace URL

    Enter Databrick Workspace URL

    Authentication Type

    Select Authentication Type as Managed service identity

    Workspace resource ID

    Auto-populated depending on the Databrick Workspace URL

    Select cluster

    Select cluster as New job cluster

    Cluster version

    Select appropriate cluster version

    Cluster node type

    Select appropriate cluster node type

    Python Version

    Select appropriate Python version

    Worker options

    Select Worker options as Fixed

    Workers

    Enter the number of workers

    Additional cluster settings

    Provide or enter the following Spark properties. For more information see Spark Properties.Spark Properties

    • spark.hadoop.privacera.fgac.use.cluster.owner true

      Default Value = False

    • spark.hadoop.privacera.fgac.use.cluster.ownertag owner_email

      Default Value = Owner

    Databricks int scripts

    Add Databricks init scripts, for more information see Obtain Init Script for Databricks FGAC



Create a pipeline with an existing Databricks cluster

  1. Open Azure Data Factory studio.

  2. Open Author > Factory Resources > Pipelines and click on the ellipsis to create a new pipeline, for more information see Create a new pipeline.

    The Properties section is displayed.

  3. Enter the Name and Description for the Pipeline.

  4. On the Activities section, select Databricks > Notebook and then drag the Notebook to the right panel to configure the cluster.

  5. Select the Notebook and navigate to Azure Databricks tab.

  6. Click +New to create Databricks linked service.

    The New linked service section is displayed.

  7. Enter or choose appropriate values in the New linked service section:

    Table 7. 

    Fields

    Values

    Name

    Auto-populated or you can enter a new name

    Description

    Enter a brief description

    Connect via integration runtime

    Auto-populated or select a value from the dropdown

    Account selection method

    Select a value from the options

    Azure subscription

    Select Azure subscription

    Databricks workspace

    Select the appropriate Databricks workspace

    Select cluster

    Select cluster as Existing interactive cluster

    Databricks workspace URL

    Enter Databrick Workspace URL

    Authentication Type

    Select Authentication Type as Managed service identity

    Workspace resource ID

    Auto-populated depending on the Databrick Workspace URL

    Existing cluster ID

    Select existing cluster ID from your existing cluster.

    Update existing cluster with additional spark properties. For more information see Spark PropertiesSpark Properties

    • spark.hadoop.privacera.fgac.use.cluster.owner true

      Default Value = False

    • spark.hadoop.privacera.fgac.use.cluster.ownertag owner_email

      Default Value = Owner

    Databricks int scripts

    Existing clusters contain Databricks init script. Update cluster with Privacera plugin, for more information see Obtain Init Script for Databricks FGAC



  8. Click Test Connection , once the connection is successful click Create.

Validate and Debug pipeline

To validate the pipeline, click the Validate button and click Debug/Trigger to run the pipeline. Once succeeded, you can check the audit log on Privacera portal. For more information, see Validate and Debug Pipelines