Data Governance for Data Engineers
Governance is the control plane that sits above every table in the warehouse, and every decision you make in the DDL is a governance decision whether you know it or not. Who can read the column, who can join it, who gets masked, who gets raw, how long it lives, where the audit event goes, which classification tag propagates downstream through dbt lineage, and which upstream producer can rewrite...
What Governance Means for Data Engineers
Governance is the set of processes and technical controls that ensure data is secure, compliant, discoverable, and trustworthy. For data engineers, this breaks down into five areas of responsibility. Each one has concrete implementation patterns.
Access control: Who can read, write, and administer each dataset? This includes table-level permissions, column-level security, row-level filtering, and service account management.
Data classification: What type of data does each column contain? PII, financial, health-related, internal, or public? Classification drives access control decisions and determines which masking rules apply.
PII handling: How do you protect personal data throughout the pipeline? From ingestion through transformation to the analytics layer, PII must be masked, hashed, or encrypted based on who is consuming it.
Audit trails: Can you answer the question 'who accessed this data and when' at any point? Audit logging captures reads, writes, permission changes, and data exports.
Data retention: How long do you keep data, and how do you delete it when required? GDPR right-to-deletion, regulatory retention periods, and storage cost management all drive retention policies.
Know Data Governance the way the interviewer who asks it knows it.
Audit Trails
An audit trail answers two questions: who accessed the data, and what did they do with it? Regulations like GDPR, HIPAA, and SOX require demonstrable audit capability. Even without regulatory pressure, audit trails help you investigate incidents and prove that access controls are working.
Query logging: Capture every query executed against sensitive tables. Snowflake provides QUERY_HISTORY. BigQuery provides INFORMATION_SCHEMA.JOBS. Redshift provides STL_QUERY. Route these logs to a tamper-proof store (S3 with object lock, or a dedicated audit database).
Access logging: Track GRANT, REVOKE, and role assignment changes. These logs prove that access control changes were authorized and timely.
Data change logging: For tables that require it, maintain a change history. Use slowly changing dimensions (SCD Type 2), event sourcing, or database-level change data capture (CDC) to preserve the before-and-after state of every record modification.
Retention of audit logs: Audit logs themselves need a retention policy. Most regulations require 1 to 7 years. Store them in a cost-effective tier (cold storage, compressed Parquet on S3) and ensure they are queryable when needed.
What Everyone Is Watching
Someone is watching. Capture everything.
Pulled from debriefs where system design separated levels.
Role-Based Access Control (RBAC)
-- Snowflake RBAC example
CREATE ROLE analyst_role;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE analyst_role;
GRANT USAGE ON DATABASE analytics TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.public
TO ROLE analyst_role;
-- Assign to user
GRANT ROLE analyst_role TO USER jane;Create roles that map to job functions (analyst, engineer, finance_viewer). Grant table and schema permissions to roles, not individual users. When someone changes teams, revoke the old role and grant the new one.
Column-Level Security
-- Snowflake dynamic masking policy
CREATE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'SUPPORT')
THEN val
ELSE '***@***.***'
END;
ALTER TABLE customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;Column-level security restricts which roles can see specific columns. Snowflake uses masking policies. BigQuery uses column-level access control with policy tags. Redshift uses column-level GRANT.
Row-Level Security
-- PostgreSQL row-level security
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY region_access ON orders
USING (region = current_setting('app.user_region'));Row-level security filters rows based on the querying user's role or attributes. A regional manager sees only their region's data. Implemented via security policies (PostgreSQL, SQL Server) or row access policies (Snowflake, BigQuery).
Data Classification Levels
Classification is the primary key of governance. Every other control (RBAC, masking, retention, audit) reads from this tag and branches on it. If the column is not labeled, nothing downstream can enforce anything.
| Level | Examples | Controls |
|---|---|---|
| Public | Product names, published prices | Standard access, no masking |
| Internal | Revenue figures, user counts | Employee-only access |
| Confidential | Salary, contract terms | Role-restricted, audit logged |
| Restricted / PII | SSN, email, phone, health records | Masked by default, encryption at rest, audit logged, retention enforced |
PII Handling Patterns
Personal Identifiable Information requires special treatment at every stage of the pipeline. The approach depends on whether downstream consumers need the original value, a pseudonymized version, or no access at all.
Hashing
Apply SHA-256 (with a salt) to PII columns. The hash preserves the ability to join records across tables without exposing the raw value. Use a consistent salt so the same email hashes to the same value everywhere. Store the salt in a secrets manager, not in the pipeline code.
Dynamic Masking
The raw PII exists in the table, but a masking policy hides it from unauthorized roles. Privileged users (support agents, data admins) see the real value. Everyone else sees a masked version. This is zero-copy: you do not maintain separate tables for different access levels.
Tokenization
Replace PII with a random token and store the mapping in a separate, tightly controlled service. Authorized applications can detokenize when needed. This is the strongest protection because the warehouse never contains the raw PII, and the tokenization service has its own access controls and audit logging.
Data Governance in Interviews
Governance questions appear in system design and behavioral rounds. Interviewers are not looking for you to recite GDPR articles. They want to know whether you can build systems that protect data and comply with regulations.
System Design: How would you handle PII in this pipeline?
Strong answer structure: classify which columns are PII at ingestion. Hash PII columns for analytics. Apply dynamic masking policies for authorized roles. Log all access to tables containing PII. Implement a deletion pipeline for GDPR right-to-erasure requests. Mention specific tools (Snowflake masking policies, BigQuery policy tags) if you have used them.
Behavioral: Tell me about a time you dealt with data compliance.
Describe a specific project. What regulation or policy drove it? What technical controls did you implement? What was the outcome? Good examples: building a GDPR deletion pipeline, implementing column-level masking for a new dataset, setting up audit logging for SOX compliance, or migrating PII to a tokenized format.
Data Governance FAQ
What is data governance in practice for a data engineer?+
How do you handle PII in a data warehouse?+
What regulations should data engineers know about?+
How does data governance come up in interviews?+
Draw The Control Plane On The Whiteboard, Not The Happy Path
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Related Guides
Six dimensions of data quality, testing patterns, and monitoring for data pipelines
What a data catalog does, how it supports governance, and tools like Alation, Atlan, and DataHub
How to approach data engineering system design interviews with frameworks and examples