Disaster Recovery for Data Plane without Discovery¶
This document outlines the disaster recovery (DR) design for the Data Plane portion of a Privacera deployment when integrated with PrivaceraCloud. It applies specifically to hybrid deployments where the control plane is hosted by PrivaceraCloud, and the Data Plane is deployed within the customer environment.
This section is designed for environments where Privacera Discovery is not enabled. It focuses solely on ensuring continuity of the Data Plane including PolicySync, UserSync, and Data Server components in case of regional outages.
Disaster Recovery Design¶
The disaster recovery design for the Data Plane is based on a cold standby approach, where a secondary region is deployed with identical infrastructure to the primary region. This secondary region remains inactive until a failover occurs. The design ensures that all components of the Data Plane are replicated in the standby region, allowing for quick recovery in the event of a failure in the primary region.
Here are the steps involved in the DR design:
- Setup
- Deploy identical infrastructure in a secondary region (Region 2) to the primary region (Region 1).
- Ensure that all components in Region 2 are configured and ready to be activated. It is recommended to clone the
~/privacera/privacera-manager
from Region 1 to Region 2. - Scale down all the services in Region 2 to ensure they are not active.
- Failover
- In the event of a failure in Region 1, update the DNS records to point to Region 2.
- Delete all the Persistent Volumes (PVs) in Region 2
- Start the services in Region 2 in the correct sequence.
- If you are using SCIM, reset SCIM to re-push Users and Groups.
- Validate that all components are functioning correctly in Region 2.
- Revert
- Once Region 1 is restored, update the DNS records to point back to Region 1.
- Stop all services in Region 2.
- Delete all the Persistent Volumes (PVs) in Region 1
- Start the services in Region 1 in the correct sequence.
- If you are using SCIM, reset SCIM to re-push Users and Groups.
- Validate that all components are functioning correctly in Region 1.
Upgrading Privacera
When upgrading the Data Plane, ensure that both regions are upgraded to the same version before performing any failover or revert operations. This ensures compatibility and minimizes potential issues during the process
Architecture Overview¶
The architecture leverages an active-cold standby approach using two geographically separate regions.
Region 1 (Active)¶
- Primary region where production workloads are executed
- Includes:
- Load Balancer (routing to Data Servers)
- Data Servers
- Privacera UserSync
- Multiple PolicySync connectors
Region 2 (Cold Standby)¶
- Backup region with identical infrastructure deployed but inactive
- Includes the same components as Region 1, preconfigured and ready for activation
DNS-Based Failover¶
- DNS is used as the failover control mechanism
- When Region 1 becomes unavailable, DNS entries are updated to route traffic to Region 2
Load Balancer is used by Data Server, Privacera Diagnostic Tools and Privacera Health Monitoring
Important Considerations¶
- PolicySync State: PolicySync is stateless and can be rebuilt in the standby region. However, it maintains state in local Persistent Volumes (PVs). When rolling over to the standby region or during a failover, ensure that the PVs are deleted before restarting the services. PolicySync connectors will automatically re-sync the state from the data platform and PrivaceraCloud and rebuild the local state.
- UserSync State: UserSync is also stateless and can be rebuilt in the standby region. Similar to PolicySync, ensure that the PVs are deleted before restarting the services. If you are using SCIM, ensure to reset the SCIM to push Users and Groups again to the target system.
-
Data Server: There are two main approaches for Data Server failover:
a. DNS-based failover: Update DNS entries to point to Region 2's load balancer when Region 1 is unavailable. This approach is simpler but may have longer failover times due to DNS propagation delays.
b. Load Balancer failover: Configure a global/cross-region load balancer to automatically route traffic to Region 2 when Region 1 is unavailable. This approach provides faster failover but may be more complex to set up.
Choose the approach that best fits your requirements for Recovery Time Objective (RTO). Consult your Cloud provider's documentation for specific implementation details of either approach.