Data Engineering

Data Governance for Data Engineers

Governance is the control plane that sits above every table in the warehouse, and every decision you make in the DDL is a governance decision whether you know it or not. Who can read the column, who can join it, who gets masked, who gets raw, how long it lives, where the audit event goes, which classification tag propagates downstream through dbt lineage, and which upstream producer can rewrite its schema without paging the security team. This page maps those controls to where they live in the architecture, starting at the role hierarchy and ending at the audit log retention policy, because that's how governance actually shows up in the job and in the system design whiteboard.

5

Control Layers

17%

Staff Design Rounds

632

Senior Rounds Tracked

7y

Typical Audit Retention

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

What Governance Means for Data Engineers

Governance is the set of processes and technical controls that ensure data is secure, compliant, discoverable, and trustworthy. For data engineers, this breaks down into five areas of responsibility. Each one has concrete implementation patterns.

Access control: Who can read, write, and administer each dataset? This includes table-level permissions, column-level security, row-level filtering, and service account management.

Data classification: What type of data does each column contain? PII, financial, health-related, internal, or public? Classification drives access control decisions and determines which masking rules apply.

PII handling: How do you protect personal data throughout the pipeline? From ingestion through transformation to the analytics layer, PII must be masked, hashed, or encrypted based on who is consuming it.

Audit trails: Can you answer the question “who accessed this data and when” at any point? Audit logging captures reads, writes, permission changes, and data exports.

Data retention: How long do you keep data, and how do you delete it when required? GDPR right-to-deletion, regulatory retention periods, and storage cost management all drive retention policies.

Access Control Patterns

Access control in a data warehouse operates at multiple levels. The goal is least privilege: every user and service account gets exactly the access it needs and nothing more. Here are the layers from coarsest to finest.

Role-Based Access Control (RBAC)

Create roles that map to job functions (analyst, engineer, finance_viewer). Grant table and schema permissions to roles, not individual users. When someone changes teams, revoke the old role and grant the new one. Every modern warehouse (Snowflake, BigQuery, Redshift, Databricks) supports RBAC.

-- Snowflake RBAC example
CREATE ROLE analyst_role;
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE analyst_role;
GRANT USAGE ON DATABASE analytics TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.public
  TO ROLE analyst_role;

-- Assign to user
GRANT ROLE analyst_role TO USER jane;

Column-Level Security

Some columns in a table are sensitive while others are not. Column-level security restricts which roles can see specific columns. Snowflake uses masking policies. BigQuery uses column-level access control with policy tags. Redshift uses column-level GRANT.

-- Snowflake dynamic masking policy
CREATE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_ADMIN', 'SUPPORT')
      THEN val
    ELSE '***@***.***'
  END;

ALTER TABLE customers
  MODIFY COLUMN email
  SET MASKING POLICY email_mask;

Row-Level Security

Row-level security filters rows based on the querying user's role or attributes. A regional manager sees only their region's data. A customer support agent sees only their assigned accounts. This is implemented via security policies (PostgreSQL, SQL Server) or row access policies (Snowflake, BigQuery).

-- PostgreSQL row-level security
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY region_access ON orders
  USING (region = current_setting('app.user_region'));

Data Classification

Classification is the primary key of governance. Every other control (RBAC, masking, retention, audit) reads from this tag and branches on it. If the column isn't labeled, nothing downstream can enforce anything, and the whole architecture collapses back to trust-based access. Start here, propagate the tag through dbt lineage, and let the rest of the stack react to it.

LevelExamplesControls
PublicProduct names, published pricesStandard access, no masking
InternalRevenue figures, user countsEmployee-only access
ConfidentialSalary, contract termsRole-restricted, audit logged
Restricted / PIISSN, email, phone, health recordsMasked by default, encryption at rest, audit logged, retention enforced

Tooling for classification includes data catalogs (Alation, Atlan, DataHub, OpenMetadata), cloud-native tagging (BigQuery policy tags, Snowflake object tagging), and automated scanning tools that detect PII patterns in column names and sample data.

PII Handling Patterns

Personal Identifiable Information requires special treatment at every stage of the pipeline. The approach depends on whether downstream consumers need the original value, a pseudonymized version, or no access at all.

Hashing

Apply SHA-256 (with a salt) to PII columns. The hash preserves the ability to join records across tables without exposing the raw value. Use a consistent salt so the same email hashes to the same value everywhere. Store the salt in a secrets manager, not in the pipeline code.

Dynamic Masking

The raw PII exists in the table, but a masking policy hides it from unauthorized roles. Privileged users (support agents, data admins) see the real value. Everyone else sees a masked version. This is zero-copy: you do not maintain separate tables for different access levels.

Tokenization

Replace PII with a random token and store the mapping in a separate, tightly controlled service. Authorized applications can detokenize when needed. This is the strongest protection because the warehouse never contains the raw PII, and the tokenization service has its own access controls and audit logging.

Audit Trails

An audit trail answers two questions: who accessed the data, and what did they do with it? Regulations like GDPR, HIPAA, and SOX require demonstrable audit capability. Even without regulatory pressure, audit trails help you investigate incidents and prove that access controls are working.

Query logging: Capture every query executed against sensitive tables. Snowflake provides QUERY_HISTORY. BigQuery provides INFORMATION_SCHEMA.JOBS. Redshift provides STL_QUERY. Route these logs to a tamper-proof store (S3 with object lock, or a dedicated audit database).

Access logging: Track GRANT, REVOKE, and role assignment changes. These logs prove that access control changes were authorized and timely.

Data change logging: For tables that require it, maintain a change history. Use slowly changing dimensions (SCD Type 2), event sourcing, or database-level change data capture (CDC) to preserve the before-and-after state of every record modification.

Retention of audit logs: Audit logs themselves need a retention policy. Most regulations require 1 to 7 years. Store them in a cost-effective tier (cold storage, compressed Parquet on S3) and ensure they are queryable when needed.

Data Governance in Interviews

Governance questions appear in system design and behavioral rounds. Interviewers are not looking for you to recite GDPR articles. They want to know whether you can build systems that protect data and comply with regulations.

System Design: “How would you handle PII in this pipeline?”

Strong answer structure: classify which columns are PII at ingestion. Hash PII columns for analytics. Apply dynamic masking policies for authorized roles. Log all access to tables containing PII. Implement a deletion pipeline for GDPR right-to-erasure requests. Mention specific tools (Snowflake masking policies, BigQuery policy tags) if you have used them.

Behavioral: “Tell me about a time you dealt with data compliance.”

Describe a specific project. What regulation or policy drove it? What technical controls did you implement? What was the outcome? Good examples: building a GDPR deletion pipeline, implementing column-level masking for a new dataset, setting up audit logging for SOX compliance, or migrating PII to a tokenized format.

Data Governance FAQ

What is data governance in practice for a data engineer?+
For data engineers, data governance means implementing the technical controls that enforce governance policies. This includes access control (who can read or write which tables), data classification (tagging columns as PII, sensitive, or public), PII handling (masking, hashing, or encrypting personal data), audit trails (logging who accessed what and when), and retention management (automatically archiving or deleting data past its retention period). Governance is not just policy documents; it is infrastructure that data engineers build and maintain.
How do you handle PII in a data warehouse?+
Three common approaches. First, masking: replace PII with redacted values in views that non-privileged users query (dynamic masking in Snowflake, column-level security in BigQuery). Second, hashing: apply a one-way hash (SHA-256 with a salt) to PII columns so they can still be used as join keys without exposing raw values. Third, tokenization: replace PII with tokens via a separate service, allowing authorized users to detokenize when needed. The choice depends on whether downstream consumers need the original value. For analytics, hashing usually suffices. For customer support, tokenization with controlled detokenization is better.
What regulations should data engineers know about?+
GDPR (European Union) requires right to deletion, consent tracking, and data processing records. CCPA/CPRA (California) requires disclosure of collected data and opt-out mechanisms. HIPAA (US healthcare) requires encryption, access controls, and audit trails for health data. SOX (US financial reporting) requires data integrity controls for financial data. Data engineers do not need to memorize legal text, but they need to build pipelines that support these requirements: deletion pipelines, audit logging, encryption at rest and in transit, and access control by data classification.
How does data governance come up in interviews?+
In system design rounds, interviewers ask how you would handle sensitive data in the pipeline you are designing. They want to hear about column-level access control, PII masking or hashing, audit logging, and encryption. In behavioral rounds, they ask about a time you dealt with a governance or compliance requirement. Strong answers describe specific technical implementations: 'I built a dynamic masking layer using Snowflake policies so analysts could query customer tables without seeing raw emails.' Vague answers about 'following best practices' do not score well.

Draw The Control Plane On The Whiteboard, Not The Happy Path

Staff-level system design rounds grade you on the controls around the pipeline, not the pipeline itself. Practice the architecture that wins that second hour of the onsite.