Data Governance for Data Engineers

Governance is the control plane that sits above every table in the warehouse, and every decision you make in the DDL is a governance decision whether you know it or not. Who can read the column, who can join it, who gets masked, who gets raw, how long it lives, where the audit event goes, which classification tag propagates downstream through dbt lineage, and which upstream producer can rewrite...

Control Layers

17%

Staff Design Rounds

632

Senior Rounds Tracked

Typical Audit Retention

What Governance Means for Data Engineers

Governance is the set of processes and technical controls that ensure data is secure, compliant, discoverable, and trustworthy. For data engineers, this breaks down into five areas of responsibility. Each one has concrete implementation patterns.

Access control: Who can read, write, and administer each dataset? This includes table-level permissions, column-level security, row-level filtering, and service account management.

Data classification: What type of data does each column contain? PII, financial, health-related, internal, or public? Classification drives access control decisions and determines which masking rules apply.

PII handling: How do you protect personal data throughout the pipeline? From ingestion through transformation to the analytics layer, PII must be masked, hashed, or encrypted based on who is consuming it.

Audit trails: Can you answer the question 'who accessed this data and when' at any point? Audit logging captures reads, writes, permission changes, and data exports.

Data retention: How long do you keep data, and how do you delete it when required? GDPR right-to-deletion, regulatory retention periods, and storage cost management all drive retention policies.

Role-Based Access Control (RBAC)

-- Snowflake RBAC example
CREATE ROLE analyst_role;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE analyst_role;
GRANT USAGE ON DATABASE analytics TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.public
  TO ROLE analyst_role;

-- Assign to user
GRANT ROLE analyst_role TO USER jane;

Create roles that map to job functions (analyst, engineer, finance_viewer). Grant table and schema permissions to roles, not individual users. When someone changes teams, revoke the old role and grant the new one.

Column-Level Security

-- Snowflake dynamic masking policy
CREATE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'SUPPORT')
      THEN val
    ELSE '***@***.***'
  END;

ALTER TABLE customers
  MODIFY COLUMN email
  SET MASKING POLICY email_mask;

Column-level security restricts which roles can see specific columns. Snowflake uses masking policies. BigQuery uses column-level access control with policy tags. Redshift uses column-level GRANT.

Row-Level Security

-- PostgreSQL row-level security
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY region_access ON orders
  USING (region = current_setting('app.user_region'));

Row-level security filters rows based on the querying user's role or attributes. A regional manager sees only their region's data. Implemented via security policies (PostgreSQL, SQL Server) or row access policies (Snowflake, BigQuery).

Data Classification Levels

Classification is the primary key of governance. Every other control (RBAC, masking, retention, audit) reads from this tag and branches on it. If the column is not labeled, nothing downstream can enforce anything.

Level	Examples	Controls
Public	Product names, published prices	Standard access, no masking
Internal	Revenue figures, user counts	Employee-only access
Confidential	Salary, contract terms	Role-restricted, audit logged
Restricted / PII	SSN, email, phone, health records	Masked by default, encryption at rest, audit logged, retention enforced

PII Handling Patterns

Personal Identifiable Information requires special treatment at every stage of the pipeline. The approach depends on whether downstream consumers need the original value, a pseudonymized version, or no access at all.

Hashing

Apply SHA-256 (with a salt) to PII columns. The hash preserves the ability to join records across tables without exposing the raw value. Use a consistent salt so the same email hashes to the same value everywhere. Store the salt in a secrets manager, not in the pipeline code.

Dynamic Masking

The raw PII exists in the table, but a masking policy hides it from unauthorized roles. Privileged users (support agents, data admins) see the real value. Everyone else sees a masked version. This is zero-copy: you do not maintain separate tables for different access levels.

Tokenization

Replace PII with a random token and store the mapping in a separate, tightly controlled service. Authorized applications can detokenize when needed. This is the strongest protection because the warehouse never contains the raw PII, and the tokenization service has its own access controls and audit logging.

Audit Trails

An audit trail answers two questions: who accessed the data, and what did they do with it? Regulations like GDPR, HIPAA, and SOX require demonstrable audit capability. Even without regulatory pressure, audit trails help you investigate incidents and prove that access controls are working.

Query logging: Capture every query executed against sensitive tables. Snowflake provides QUERY_HISTORY. BigQuery provides INFORMATION_SCHEMA.JOBS. Redshift provides STL_QUERY. Route these logs to a tamper-proof store (S3 with object lock, or a dedicated audit database).

Access logging: Track GRANT, REVOKE, and role assignment changes. These logs prove that access control changes were authorized and timely.

Data change logging: For tables that require it, maintain a change history. Use slowly changing dimensions (SCD Type 2), event sourcing, or database-level change data capture (CDC) to preserve the before-and-after state of every record modification.

Retention of audit logs: Audit logs themselves need a retention policy. Most regulations require 1 to 7 years. Store them in a cost-effective tier (cold storage, compressed Parquet on S3) and ensure they are queryable when needed.

Data Governance in Interviews

Governance questions appear in system design and behavioral rounds. Interviewers are not looking for you to recite GDPR articles. They want to know whether you can build systems that protect data and comply with regulations.

System Design: How would you handle PII in this pipeline?

Strong answer structure: classify which columns are PII at ingestion. Hash PII columns for analytics. Apply dynamic masking policies for authorized roles. Log all access to tables containing PII. Implement a deletion pipeline for GDPR right-to-erasure requests. Mention specific tools (Snowflake masking policies, BigQuery policy tags) if you have used them.

Behavioral: Tell me about a time you dealt with data compliance.

Describe a specific project. What regulation or policy drove it? What technical controls did you implement? What was the outcome? Good examples: building a GDPR deletion pipeline, implementing column-level masking for a new dataset, setting up audit logging for SOX compliance, or migrating PII to a tokenized format.

Data Governance FAQ

What is data governance in practice for a data engineer?+

For data engineers, data governance means implementing the technical controls that enforce governance policies. This includes access control (who can read or write which tables), data classification (tagging columns as PII, sensitive, or public), PII handling (masking, hashing, or encrypting personal data), audit trails (logging who accessed what and when), and retention management (automatically archiving or deleting data past its retention period). Governance is not just policy documents; it is infrastructure that data engineers build and maintain.

How do you handle PII in a data warehouse?+

Three common approaches. First, masking: replace PII with redacted values in views that non-privileged users query (dynamic masking in Snowflake, column-level security in BigQuery). Second, hashing: apply a one-way hash (SHA-256 with a salt) to PII columns so they can still be used as join keys without exposing raw values. Third, tokenization: replace PII with tokens via a separate service, allowing authorized users to detokenize when needed. The choice depends on whether downstream consumers need the original value. For analytics, hashing usually suffices. For customer support, tokenization with controlled detokenization is better.

What regulations should data engineers know about?+

GDPR (European Union) requires right to deletion, consent tracking, and data processing records. CCPA/CPRA (California) requires disclosure of collected data and opt-out mechanisms. HIPAA (US healthcare) requires encryption, access controls, and audit trails for health data. SOX (US financial reporting) requires data integrity controls for financial data. Data engineers do not need to memorize legal text, but they need to build pipelines that support these requirements: deletion pipelines, audit logging, encryption at rest and in transit, and access control by data classification.

How does data governance come up in interviews?+

In system design rounds, interviewers ask how you would handle sensitive data in the pipeline you are designing. They want to hear about column-level access control, PII masking or hashing, audit logging, and encryption. In behavioral rounds, they ask about a time you dealt with a governance or compliance requirement. Strong answers describe specific technical implementations: 'I built a dynamic masking layer using Snowflake policies so analysts could query customer tables without seeing raw emails.' Vague answers about 'following best practices' do not score well.

02 / Why practice

Draw The Control Plane On The Whiteboard, Not The Happy Path

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start Practicing

Related Guides

Data Quality→

Six dimensions of data quality, testing patterns, and monitoring for data pipelines

Data Catalog→

What a data catalog does, how it supports governance, and tools like Alation, Atlan, and DataHub

System Design for DEs→

How to approach data engineering system design interviews with frameworks and examples