Prerequisites for Discovery on GCP¶
Note
The prerequisites for Privacera Discovery are the same for both Self-Managed and PrivaceraCloud Data Plane deployments.
The Privacera Discovery module leverages GCP services such as BigTable, Pub/Sub, GCS, and Google Sink for data scanning. To enable this, you need to create the necessary GCP resources and configure Workload Identity Federation to allow the Discovery and Portal pods to access these resources. Privacera Manager can create the Pub/Sub and GCS resources for you, or you can create them manually.
Here are the prerequisites for setting up Privacera Discovery on GCP:
Prerequisites | Description |
---|---|
GCS bucket | The GCS bucket where the configurations and temporary files for Discovery are stored. |
Google BigTable | Used to store metadata and tags. |
Google PubSub | Used for discovery inter process communication |
Workload Identity Federation | Assign the IAM to the GKE Service Accounts for the Discovery and Portal pods. |
Google Logging Sink | Used when real-time scanning is enabled. The change events for the GCS / GBQ are retrieved from the PubSub queue. |
GCS bucket¶
A GCS bucket is required to store the configuration for Privacera Discovery. It is recommended to create a dedicated bucket specifically for Privacera Discovery. Ensure that the IAM roles associated with the Discovery and Portal pods have the necessary read/write access to this bucket.
You will need to provide the bucket name to Privacera Manager during the installation configuration. The bucket can be created manually, or Privacera Manager can create it for you.
Example :
DISCOVERY_BUCKET_NAME: "acme-prod"
Google BigTable¶
Google Bigtable is required to store the metadata for Privacera Discovery. The recommended naming convention for these tables is privacera_*_DEPLOYMENT_ENV_NAME
. You can create these tables manually, or Privacera Manager can create them for you.
Table Naming Convention
The table names are recommended to be suffixed with the DEPLOYMENT_ENV_NAME (e.g., privacera-prod) to avoid conflicts with other deployments.
Assuming your DEPLOYMENT_ENV_NAME is privacera-prod
, the table names should be suffixed with privacera-prod
, as shown above. If you are manually creating the tables, you can follow the naming convention and schema provided, but replace DEPLOYMENT_ENV_NAME
with your actual deployment environment name.
The table names and their corresponding column family:
Table Name | Column Family |
---|---|
privacera_alert_privacera-prod | d |
privacera_audit_summary_privacera-prod | i |
privacera_lineage_privacera-prod | d |
privacera_mlresource_v2_privacera-prod | i |
privacera_resource_v2_privacera-prod | i |
privacera_scan_requests_privacera-prod | i |
privacera_scan_status_privacera-prod | i |
privacera_state_privacera-prod | i |
Google PubSub¶
Discovery uses the following Pub/Sub topics for offline and real-time scanning. Privacera Manager can create these topics for you, or you can create them manually. The recommended naming convention for the Pub/Sub topics is privacera_*_DEPLOYMENT_ENV_NAME
. Assuming your DEPLOYMENT_ENV_NAME is privacera-prod, the topics should follow this convention.
- privacera_alerts_privacera-prod
- privacera_audits_privacera-prod
- privacera_classification_info_privacera-prod
- privacera_delay_queue_privacera-prod
- privacera_results_privacera-prod
- privacera_scan_worker_gcs_privacera-prod
- privacera_offline_scan_privacera-prod
- privacera_spark_events_privacera-prod
- privacera_scan_resources_info_privacera-prod
- privacera_right_to_privacy_privacera-prod
- privacera_apply_scheme_privacera-prod
- privacera_lineage_privacera-prod
Workload Identity Federation¶
Two sets of IAM policies are required for Privacera Discovery, as outlined below:
- For Privacera Manager: Permissions to create the GCP resources required for Privacera Discovery by Privacera Manager. This step is optional. You can either allow Privacera Manager to create the resources during installation or create them manually and provide their details during the installation of the Discovery module. These IAM policies should be attached to the Compute Engine instance where Privacera Manager is running.
- For Discovery Services: IAM roles for the Discovery and Portal pods to access the GCP resources. This is mandatory for scanning GCP services such as GCS and GBQ. These permissions must be manually assigned to the privacera-sa GKE service account. You can limit access to only the required resources that will be scanned by Discovery.
Step 1: IAM for Privacera Manager¶
You can skip this step if you do not want Privacera Manager to create these resources.
Follow these steps to configure API access scopes and service account permissions for a Google Cloud Platform (GCP) VM instance, enabling the creation of GCP resources for Discovery, such as GCS buckets and Pub/Sub topics.
-
Configure the Service Account
- Access IAM & Admin
- In the GCP Console, navigate to IAM & Admin > IAM.
- Verify Service Account Permissions
- Create a new service account or use an existing one that will be used by the VM instance to create GCP resources.
- Ensure that the service account has the following Roles:
- Pub/Sub Admin
- Storage Admin
- Access IAM & Admin
-
Modify the VM Instance Configuration
- Access VM Instances
- In the GCP Console, navigate to Compute Engine > VM Instances.
- Stop the Instance
- Locate your VM instance and stop it to allow configuration changes.
- Open Instance Settings
- Click on the VM instance name to open its details page.
- Navigate to API and Identity Management
- Scroll down and go to the API and Identity Management section.
- Modify Cloud API Access Scope
- Under CLOUD API Access Scope, select the following scopes:
- Storage – Read/Write
- Pub/Sub – Enabled
- Under CLOUD API Access Scope, select the following scopes:
- Attach Service Account
- Under Service Account, select the service account that you configured in the previous step.
- Save Changes
- Access VM Instances
-
Restart the Instance
- After making these changes, restart the instance to apply the updated access scopes and permissions.
Step 2: Configuring Service Account Permissions for Discovery Services¶
Follow these steps to configure service account permissions for a Google Kubernetes Engine (GKE) service account, enabling Discovery to access GCP resources.
-
Run the following gcloud command to assign the necessary permissions to the GKE service account, allowing Discovery to access GCP resources:
Command to provide access Google PubSub resources to GKE Service Account Command to provide access Google BigTable resources to GKE Service Account Command to provide access Google GCS resources to GKE Service Account Command to provide access Google BigQuery resources to GKE Service Account (Optional) Replace the following placeholders:
<PROJECT_ID>
: Your GCP project id.<PROJECT_NUMBER>
: Your GCP project number.<NAMESPACE>
: The namespace where the Discovery pods are deployed.<GKE_SERVICE_ACCOUNT>
: The GKE service account used by the Discovery pods.
Google Logging Sink¶
This topic explains how to use a Sink-based approach to read real-time audit logs for real-time scanning in Pkafka for Discovery. The following are the key advantages of the Sink-based approach:
- All the logs will be synchronized to a Sink.
- Sinks are exported to a destination Pub/Sub topic.
- Pkafka subscribes to the Pub/Sub topic, reads the audit data, and forwards it to the Privacera topic, triggering a real-time scan.
You need to create the following resources in the Google Cloud Console:
- Pub/Sub Topic
- Create Sink
Create Pub/Sub topic
- Log in to the Google Cloud Console and navigate to the Pub/Sub Topics page.
- Click the + CREATE TOPIC.
- In the Create a topic dialog, enter the following details:
- Enter a unique topic name in the Topic ID field (e.g., DiscoverySinkTopic).
- Select the Add a default subscription checkbox.
-
Click CREATE TOPIC.
Note
If required, you can create a subscription in a later stage, after creating the topic, by navigating to Topic > Create Subscription > Create a simple subscription. Note down the subscription name as it will be used inside a property in Discovery.
-
If you created a default subscription or a new subscription, you need to change the following properties:
- Acknowledgement deadline: Set to 600.
- Retry policy: Select "Retry after exponential backoff delay" and enter the following values:
- Minimum backoff(seconds): 10
- Maximum backoff (seconds): 600
- Update
Create a Sink
- Log in to the Google Cloud Console and navigate to the Logs Router page. Alternatively, you can perform this action using the Logs Explorer page by navigating to Actions > Create Sink.
- Click CREATE SINK.
- Enter Sink details:
- Sink name (Required): Enter an identifier for the Sink.
- Sink description (Optional): Describe the purpose or use case for the Sink.
- Click NEXT.
- Now, enter Sink destination:
- Select Sink service.
- Select Pub/Sub from the list of services.
-
Choose which logs to include in the Sink:
- Build an inclusion filter: Enter a filter to select the logs that you want to be routed to the Sink's destination. For example:
(resource.type="gcs_bucket" AND resource.labels.bucket_name="bucket-to-be-scanned" AND (protoPayload.methodName="storage.objects.create" OR protoPayload.methodName="storage.objects.delete" OR protoPayload.methodName="storage.objects.get")) OR resource.type="bigquery_resource"
- Add all the bucket names you want to scan in the above filter as resources in Discovery.
bucket_name="bucket-to-be-scanned" AND
- In the case of multiple buckets, you will need to specify them using an “OR” condition, for example:
(resource.type="gcs_bucket" AND resource.labels.bucket_name="bucket_1" OR resource.labels.bucket_name="bucket_2" OR resource.labels.bucket_name="bucket_3"
- In above example, three buckets are identified to be scanned - bucket_1, bucket_2, bucket_3.
-
Click DONE.
Cross-project scanning¶
-
For cross-project scanning of GCS and GBQ resources, you need to create a Sink in another project and set the destination as a Pub/Sub topic from Project One.
-
Follow the same steps mentioned above to create the Sink in the destination project. Navigate to Destination > Select Other project, and enter the Pub/Sub topic name in the following format:
'pubsub.googleapis.com/projects/google_sample_project/topics/sink_new'
-
To access the Sink created in another project, you need to add the Sink writer identity service account to the IAM page of the project where the Pub/Sub topic and VM instance are located.
-
To get the Sink Writer Identity, perform the following steps:
-
Go to the Logs Router page > select the Sink > select the dots icon > select Edit Sink Details > Writer Identity section, copy the service account.
-
Go to the IAM Administration page of the project where you have the Pub/Sub Topic and the VM instance > select Add member > Add the service account of the Writer Identity of the Sink created above.
-
Choose the role Owner and Editor
-
Click Save. Verify whether the service account which you added is present as a member on the IAM Administration page.
-
Final Checklist¶
It is extremely important to ensure that all the prerequisites are met before proceeding with the installation of Privacera Discovery.
- Create IAM policies for Privacera Manager to create GCP resources required for Privacera Discovery.
- Assign the IAM roles to the GKE service accounts for the Discovery and Portal pods.
- Create a GCS bucket to store configurations and temporary files for Privacera Discovery, or allow Privacera Manager to create it for you.
- Create tables in Bigtable to store metadata and tags, or allow the Discovery service to create them for you.
- Create Pub/Sub topics for offline and real-time scanning, or let Privacera Manager create them for you.
- Create a Google Logging Sink for real-time scanning (optional).
You should have the values for the following placeholders:
- DISCOVERY_BUCKET_NAME: GCP bucket name for storing config files.
- BIGTABLE_INSTANCE_ID: BigTable instance id used to store tables.
- BigTable table names: for storing metadata and tags (only if you have created the tables manually).
- SCAN_REQUEST_TABLE: Table name for storing scan requests
- RESOURCE_TABLE:Table name for storing resource metadata
- ALERT_TABLE: Table name for storing alerts
- AUDIT_SUMMARY_TABLE: Table name for storing audit summary
- ACTIVE_SCANS_TABLE: Table name for storing active scans
- STATE_TABLE: Table name for storing state
- LINEAGE_TABLE: Table name for storing lineage
- MLRESOURCE_TABLE: Table name for storing ML resource
- PubSub Topic Name: Google PubSub Topic name for offline / real-time scanning.
- CLASSIFICATION_TOPIC: PubSub Topic name for classification
- ALERT_TOPIC: PubSub Topic name for alerts
- SPARK_EVENT_TOPIC: PubSub Topic name for spark events
- RESULT_TOPIC: PubSub Topic name for scan results
- OFFLINE_SCAN_TOPIC: PubSub Topic name for offline scan
- AUDITS_TOPIC: PubSub Topic name for audits
- SCAN_RESOURCE_INFO_TOPIC: PubSub Topic name for scan resource info
- RIGHT_TO_PRIVACY_TOPIC: PubSub Topic name for right to privacy
- DELAY_QUEUE_TOPIC: PubSub Topic name for delay queue
- APPLY_SCHEME_TOPIC: PubSub Topic name for apply scheme
- ML_CLASSIFY_TAG_TOPIC: PubSub Topic name for ML classify tag
- LINEAGE_TOPIC: PubSub Topic name for lineage
- PRIVACERA_PORTAL_TOPIC_DYNAMIC_PREFIX: PubSub Topic name for scan worker
- Prev Prerequisites
- Next Discovery Setup