IllumiDesk Security Docs
  • IllumiDesk Team Handbook
  • People Group
    • Introduction
    • General Employment
    • Employment Status & Recordkeeping
    • Working Conditions & Hours
    • Employee Benefits
    • Employee Conduct
    • Timekeeping & Payroll
  • Security and Compliance
    • Security Controls
      • BC.1.01 - Business Continuity Plan
        • IllumiDesk Business Continuity Plan
        • IllumiDesk Disaster Recovery
        • IllumiDesk Reference Architectures
        • IllumiDesk Handbook listing of DR for Databases
      • BC.1.0.2 - Business Continuity Plan: Roles and Responsibilities
      • BC.1.03 - Continuity Testing
      • BC.1.04 - Business Impact Analysis
        • Business Impact Analysis in the handbook
        • Data Protection Impact Assessment (DPIA) Policy
        • Data Protection Impact Assessments or DPIAs
        • UX Department
        • Triage Operations - Communication about expected automation impact
        • NIST BCP with reference to BIA
      • CFG.1.01 - Baseline Configuration Standard
        • Laptop or Desktop System configuration
        • Configuring New Laptops
        • Security Best Practices
      • CFG.1.03 - Configuration Checks
        • Production Change Requests Policy
      • CM.1.01 - Change Management Workflow
      • CM.1.02 - Change Approval
      • CM.1.03 - Change Management Issue Tracker
      • CM.1.04 - Emergency Changes
      • DM.1.01 - Data Classification Criteria
        • Data Classification Policy
      • DM.2.01 - Terms of Service
        • Application Terms of Use
      • DM.4.01 - Encryption of Data in Transit
        • Deprecate support for TLS 1.0 and TLS 1.1
      • DM.7.03 - Data Retention and Disposal Policy
      • IAM.1.01 - Logical Access Provisioning
        • Access Requests
        • Access Management Process
      • IAM.1.02 - Logical Access De-Provisioning
        • Access Management Process
        • Logical Access Deprovisioning
        • Access Reviews
        • IllumiDesk Offboarding Guidelines
      • IAM.1.04 - Logical Access Review
        • Access Reviews
      • IAM.1.05 - Transfers: Access De-Provisioning
        • Access Control Policy and Procedures
        • Job Transfers
        • Access Change Request
      • IAM.1.06 - Shared Logical Accounts
        • Security Process and Procedures for Team Members
        • Access Management Process
      • IAM.1.08 - New Access Provisioning
        • Access Requests
        • Access Management Process
      • IAM.2.01 - Unique Identifiers
        • Unique Account Identifiers
        • Access Control Policy and Procedures
        • Section on shared accounts in Okta handbook page
        • Access Management Process
      • IAM.2.02 - Password Authentication
      • IAM.2.03 - Multi-factor Authentication
      • IAM.3.02 - Source Code Security
      • IAM.4.01 - Remote Connections
      • IAM.6.01 - Key Repository Access
      • IR.1.01 - Incident Response Plan
      • IR.1.03 - Incident response
      • IR.1.04 - Insurance Policy
      • IR.2.02 - Incident Reporting
      • NO.1.01 - Network Policy Enforcement Points
      • PR.1.01 - Background Checks
      • RM.1.01 - Risk Assessment
      • RM.1.02 - Continuous Monitoring
        • Security Compliance
      • RM.1.04 - Service Risk Rating Assignment
      • RM.1.05 - Risk Management Policy
      • RM.3.01 - Remediation Tracking
      • SDM.1.01 - System Documentation
      • SG.1.01 - Policy and Standard Review
      • SG.2.01 - Information Security Program Content
      • SG.5.03 - Security Roles and Responsibilities
        • Incident Management Roles and Responsibilities
      • SG.5.06 - Board of Director Bylaws
        • Governance Documents
      • SG.5.07 - Board of Directors Security Program Content
        • Audit Committee Agenda Planner
      • SLC.1.01 - Service Lifecycle Workflow
      • SLC.2.01 - Source Code Management
      • SYS.1.01 - Audit Logging
      • SYS.2.01 - Security Monitoring Alert Criteria
      • SYS.2.07 - System Security Monitoring
      • TPM.1.01 - Third Party Assurance Review
      • TPM.1.02 - Vendor Risk Management
      • TRN.1.01 - General Security Awareness Training
        • Security Awareness Training
      • TRN.1.02 - Code of Conduct Training
      • VUL.1.01 - Vulnerability Scans
      • VUL.1.03 - Approved Scanning Vendor
      • VUL.2.01 - Application & Infrastructure Penetration Testing
      • VUL.3.01 - Infrastructure Patch Management
      • VUL.3.02 - End of Life Software
      • VUL.4.01 - Enterprise Protection
      • VUL.5.01 - Code Security Check
      • VUL.6.01 - External Information Security Inquiries
  • VPAT Version 2.3
Powered by GitBook
On this page
  • Database: Disaster Recovery
  • Restore testing
  • Disaster Recovery Replicas
  • Delayed Replica
  • Archive Replica
  1. Security and Compliance
  2. Security Controls
  3. BC.1.01 - Business Continuity Plan

IllumiDesk Handbook listing of DR for Databases

Database: Disaster Recovery

This page contains an overview of the disaster recovery strategy we have in place for the PostgreSQL database. In this context, a disaster means losing the main database cluster or parts of it (a DROP DATABASE-type incident).

The overview here is not complete and is going to be extended soon.

We base our strategy on PostgreSQL's Point-in-Time Recovery (PITR) feature.

This means we're shipping daily snapshots and transaction logs (WAL) to an external storage (the archive). Given a snapshot, we are now able to replay the WAL until a certain point in time is reached (for example, right before the disaster happened earlier).

Currently, AWS S3 serves as a storage backend for the PITR archive.

Restore testing

A backup is only worth something if it can be successfully restored in a certain amount of time. In order to monitor the state of backups and measure the expected recovery time (DB-DR-TTR), we employ a daily process to test the backups.

This process is implemented as a CI pipeline (see README.md for details). On a daily schedule, a fresh database GCE instance is created that restores from the latest backup, gets configured as an archive replica that recovers from the WAL archive (essentially performing PITR). After this is complete, the restored database is verified.

There is monitoring in place to detect problems with the restore pipeline (currently using deadmanssnitch.com). We plan to monitor the time it takes to recover and other metrics soon.

Disaster Recovery Replicas

The backup strategy above is a cold backup. In order to restore from a cold backup, we need to retrieve the full backup from a cold storage (via network) and perform PITR from it. This can take quite some time considering the amount of data needed to be put on the network.

The current speed of restoring a cold backup from AWS S3 is at about 380 GB per hour (net size) for retrieving the base backup. With a database size of currently 2.1 TB, just retrieving the base backup alone takes more than 5 hours already. The PITR phase is generally deemed slower.

We currently aim at a DB-DR-TTR of 8 hours to recover from a backup. We're not there yet and as an interim measure, we introduce disaster recovery replicas.

Delayed Replica

Another option is to have a replica in place that always lags a few hours behind the production cluster. We call this a delayed replica: It is a normal streaming replica but delayed by a few hours. In case disaster strikes, it can be used to quickly perform PITR from the WAL archive. This is much faster than a full restore, because we don't have to fully retrieve a full backup from S3. Additionally, with daily snapshots the latest snapshot is 24 hours (plus the time it took to capture the snapshot) old worst-case. A delayed replica is constantly kept at a certain offset with respect to the production cluster and hence does not need to replay too many hours worth of data.

  • Production host: postgres-dr-delayed-01-db-gprd.c.IllumiDesk-production.internal

  • Chef role: gprd-base-db-postgres-delayed

Archive Replica

Another type of replica is an archive replica. It's sole purpose is to continuously recover from the WAL archive and hence test the WAL archive. This is necessary because PITR relies on a continuous sequence of WAL that can be applied to a snapshot of the database (basebackup). If that sequence is broken for whatever reason, PITR can only recover until this point and no further. We monitor the replication lag of the archive replica. If it falls back too far, there's likely a problem with the WAL archive.

The restore testing pipeline also performs PITR from the WAL archive and thus also would be able to detect (some) problems with the archive. However, employing an archive replica that is close to the production cluster helps to detect problems with the archive much faster than with a daily test of a backup. Also, the archive replica has to consume all WAL from the archive - a backup restore is likely to only read a portion of the archive to recover to a certain point in time.

In that sense, there is overlap between functionality of archive and delayed replicas and the restore testing. Together it gives us high confidence in our cold backup and PITR recovery strategy.

  • Production host: postgres-dr-archive-01-db-gprd.c.IllumiDesk-production.internal

  • Chef role: gprd-base-db-postgres-archive

PreviousIllumiDesk Reference ArchitecturesNextBC.1.0.2 - Business Continuity Plan: Roles and Responsibilities

Last updated 1 year ago