Governance is the control plane that sits above every table in the warehouse, and every decision you make in the DDL is a governance decision whether you know it or not. Who can read the column, who can join it, who gets masked, who gets raw, how long it lives, where the audit event goes, which classification tag propagates downstream through dbt lineage, and which upstream producer can rewrite its schema without paging the security team. This page maps those controls to where they live in the architecture, starting at the role hierarchy and ending at the audit log retention policy, because that's how governance actually shows up in the job and in the system design whiteboard.
Control Layers
Staff Design Rounds
Senior Rounds Tracked
Typical Audit Retention
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Governance is the set of processes and technical controls that ensure data is secure, compliant, discoverable, and trustworthy. For data engineers, this breaks down into five areas of responsibility. Each one has concrete implementation patterns.
Access control: Who can read, write, and administer each dataset? This includes table-level permissions, column-level security, row-level filtering, and service account management.
Data classification: What type of data does each column contain? PII, financial, health-related, internal, or public? Classification drives access control decisions and determines which masking rules apply.
PII handling: How do you protect personal data throughout the pipeline? From ingestion through transformation to the analytics layer, PII must be masked, hashed, or encrypted based on who is consuming it.
Audit trails: Can you answer the question “who accessed this data and when” at any point? Audit logging captures reads, writes, permission changes, and data exports.
Data retention: How long do you keep data, and how do you delete it when required? GDPR right-to-deletion, regulatory retention periods, and storage cost management all drive retention policies.
Access control in a data warehouse operates at multiple levels. The goal is least privilege: every user and service account gets exactly the access it needs and nothing more. Here are the layers from coarsest to finest.
Create roles that map to job functions (analyst, engineer, finance_viewer). Grant table and schema permissions to roles, not individual users. When someone changes teams, revoke the old role and grant the new one. Every modern warehouse (Snowflake, BigQuery, Redshift, Databricks) supports RBAC.
-- Snowflake RBAC example
CREATE ROLE analyst_role;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE analyst_role;
GRANT USAGE ON DATABASE analytics TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.public
TO ROLE analyst_role;
-- Assign to user
GRANT ROLE analyst_role TO USER jane;Some columns in a table are sensitive while others are not. Column-level security restricts which roles can see specific columns. Snowflake uses masking policies. BigQuery uses column-level access control with policy tags. Redshift uses column-level GRANT.
-- Snowflake dynamic masking policy
CREATE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'SUPPORT')
THEN val
ELSE '***@***.***'
END;
ALTER TABLE customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;Row-level security filters rows based on the querying user's role or attributes. A regional manager sees only their region's data. A customer support agent sees only their assigned accounts. This is implemented via security policies (PostgreSQL, SQL Server) or row access policies (Snowflake, BigQuery).
-- PostgreSQL row-level security
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY region_access ON orders
USING (region = current_setting('app.user_region'));Classification is the primary key of governance. Every other control (RBAC, masking, retention, audit) reads from this tag and branches on it. If the column isn't labeled, nothing downstream can enforce anything, and the whole architecture collapses back to trust-based access. Start here, propagate the tag through dbt lineage, and let the rest of the stack react to it.
| Level | Examples | Controls |
|---|---|---|
| Public | Product names, published prices | Standard access, no masking |
| Internal | Revenue figures, user counts | Employee-only access |
| Confidential | Salary, contract terms | Role-restricted, audit logged |
| Restricted / PII | SSN, email, phone, health records | Masked by default, encryption at rest, audit logged, retention enforced |
Tooling for classification includes data catalogs (Alation, Atlan, DataHub, OpenMetadata), cloud-native tagging (BigQuery policy tags, Snowflake object tagging), and automated scanning tools that detect PII patterns in column names and sample data.
Personal Identifiable Information requires special treatment at every stage of the pipeline. The approach depends on whether downstream consumers need the original value, a pseudonymized version, or no access at all.
Apply SHA-256 (with a salt) to PII columns. The hash preserves the ability to join records across tables without exposing the raw value. Use a consistent salt so the same email hashes to the same value everywhere. Store the salt in a secrets manager, not in the pipeline code.
The raw PII exists in the table, but a masking policy hides it from unauthorized roles. Privileged users (support agents, data admins) see the real value. Everyone else sees a masked version. This is zero-copy: you do not maintain separate tables for different access levels.
Replace PII with a random token and store the mapping in a separate, tightly controlled service. Authorized applications can detokenize when needed. This is the strongest protection because the warehouse never contains the raw PII, and the tokenization service has its own access controls and audit logging.
An audit trail answers two questions: who accessed the data, and what did they do with it? Regulations like GDPR, HIPAA, and SOX require demonstrable audit capability. Even without regulatory pressure, audit trails help you investigate incidents and prove that access controls are working.
Query logging: Capture every query executed against sensitive tables. Snowflake provides QUERY_HISTORY. BigQuery provides INFORMATION_SCHEMA.JOBS. Redshift provides STL_QUERY. Route these logs to a tamper-proof store (S3 with object lock, or a dedicated audit database).
Access logging: Track GRANT, REVOKE, and role assignment changes. These logs prove that access control changes were authorized and timely.
Data change logging: For tables that require it, maintain a change history. Use slowly changing dimensions (SCD Type 2), event sourcing, or database-level change data capture (CDC) to preserve the before-and-after state of every record modification.
Retention of audit logs: Audit logs themselves need a retention policy. Most regulations require 1 to 7 years. Store them in a cost-effective tier (cold storage, compressed Parquet on S3) and ensure they are queryable when needed.
Governance questions appear in system design and behavioral rounds. Interviewers are not looking for you to recite GDPR articles. They want to know whether you can build systems that protect data and comply with regulations.
Strong answer structure: classify which columns are PII at ingestion. Hash PII columns for analytics. Apply dynamic masking policies for authorized roles. Log all access to tables containing PII. Implement a deletion pipeline for GDPR right-to-erasure requests. Mention specific tools (Snowflake masking policies, BigQuery policy tags) if you have used them.
Describe a specific project. What regulation or policy drove it? What technical controls did you implement? What was the outcome? Good examples: building a GDPR deletion pipeline, implementing column-level masking for a new dataset, setting up audit logging for SOX compliance, or migrating PII to a tokenized format.
Staff-level system design rounds grade you on the controls around the pipeline, not the pipeline itself. Practice the architecture that wins that second hour of the onsite.
Six dimensions of data quality, testing patterns, and monitoring for data pipelines
What a data catalog does, how it supports governance, and tools like Alation, Atlan, and DataHub
How to approach data engineering system design interviews with frameworks and examples