Data Catalog: Interview Questions & Tools Guide (2026)

A data catalog is a centralized inventory of every dataset in your organization. It stores metadata (schemas, owners, quality scores, lineage, usage stats) and provides search so engineers and analysts can find, understand, and trust data without asking around on Slack.

What this guide actually says

A data catalog answers two questions for every person on your data team: what data do we have, and can I trust it. Modern catalogs are not search engines. They are governance control planes. Without active metadata (alerts, auto-classification, policy enforcement), the catalog becomes a wiki that goes stale. Without integration into daily workflows, it becomes shelfware. Adoption is the failure mode 80% of catalog projects hit.

What is a data catalog

Three things a catalog actually does. Interviewers expect you to name all three.

The core idea

A data catalog is a centralized inventory of every dataset in your organization. It stores metadata: schemas, descriptions, ownership, quality scores, usage stats, and tags. Think of it as a search engine for your data platform. When an analyst asks 'do we have customer churn data?', the catalog should answer in seconds, not Slack threads.

Metadata as first-class asset

Catalogs treat metadata as a product. Technical metadata (column types, table sizes, freshness) is collected automatically. Business metadata (descriptions, domain tags, PII classifications) is curated by data owners. The catalog connects both layers so consumers see technical context and business meaning in one place.

Governance built in

Modern catalogs aren't just search tools. They enforce governance: who owns each dataset, who can access it, what PII it contains, and whether its quality meets SLA thresholds. A catalog without governance is just a wiki that goes stale. A catalog with governance becomes the control plane for your data platform.

Core components

A catalog isn't a single feature. It's six capabilities working together. Interviewers expect you to name these and explain how they interact.

Metadata Ingestion

Automated crawlers connect to warehouses, lakes, BI tools, and orchestrators to harvest technical metadata. Schema, column types, row counts, freshness, partition info pulled on a schedule or via event-driven hooks. The best catalogs also capture query logs to build usage and popularity signals.

Search and Discovery

Full-text search across table names, column names, descriptions, tags, even sample values. Relevance ranking uses popularity (query frequency), freshness, ownership signals. Think of it as Google for your warehouse: analysts type a keyword and find the trusted, documented dataset in seconds.

Data Quality Scores

Quality metrics surfaced directly in the catalog: freshness (when last updated?), completeness (null rate per column), volume (row count trends), schema drift. Some catalogs integrate with Great Expectations or Monte Carlo to pull quality check results into each asset page.

Ownership and Stewardship

Every dataset has a designated owner (usually a team or individual). Owners are responsible for descriptions, quality SLAs, and responding to questions. Stewardship workflows let consumers request access, flag issues, or propose corrections. Without clear ownership, catalogs decay into graveyards of undocumented tables.

Lineage Integration

Catalogs display upstream and downstream dependencies for each asset. Not lineage in isolation; lineage in context. When you find a dataset in the catalog, you immediately see where it came from, what transforms built it, and what dashboards depend on it.

Access Control and Compliance

Tag-based policies tie data classification (PII, HIPAA, internal) to access rules. When a column is tagged as PII, the catalog can enforce masking or restrict access to approved roles. Compliance teams use the catalog to audit who accessed what and whether PII handling meets regulatory requirements.

Real catalog systems and tools

Know the ecosystem. Name the tool, but more importantly, explain where it fits and what trade-offs it makes.

Open Source

DataHub

Metadata platform from LinkedIn, now maintained by Acryl Data. Supports ingestion from 50+ sources, GraphQL API, column-level lineage, fine-grained access policies. The most popular open-source catalog.

Open Source

Apache Atlas

Metadata governance for the Hadoop ecosystem. Strong type system, classification propagation, glossary support. Common in legacy on-prem environments with Hive, HBase, Kafka.

Enterprise

Collibra

Enterprise data intelligence platform. Combines catalog, glossary, data quality, policy management. Workflow engine for stewardship and access requests. Targeted at Fortune 500 with complex governance needs.

Enterprise

Alation

Pioneered the data catalog category. Strong in search, curation, BI integration. Behavioral analysis tracks query patterns to surface popular and trusted datasets. Used heavily in analytics-driven organizations.

Modern SaaS

Atlan

Cloud-native active metadata platform. Embedded collaboration (Slack-like threads on datasets), automated lineage, persona-based UI for analysts, engineers, governance. Growing fast in modern data stacks.

Databricks

Unity Catalog

Databricks native governance layer. Manages tables, volumes, models, functions across workspaces. Fine-grained ACLs, automatic lineage, AI/ML asset governance. Tight integration with Delta Lake and MLflow.

AWS

AWS Glue Data Catalog

Managed Hive metastore for AWS. Stores table definitions for Athena, Redshift Spectrum, EMR, Lake Formation. Crawlers auto-discover schemas from S3. Default catalog for AWS-native data lakes.

GCP

Google Data Catalog

Fully managed metadata service for BigQuery, Pub/Sub, Cloud Storage, Dataproc. Tag templates for custom metadata, policy tags for column-level security, integration with Data Lineage API.

Data catalog vs data dictionary vs data lineage

Interviewers frequently test whether you can distinguish these three. They overlap but serve fundamentally different purposes.

Dimension	Data Catalog	Data Dictionary	Data Lineage
Scope	Entire data platform as a searchable product	One dataset described in depth	Flow of data through pipelines and transformations
Primary content	Inventory of all assets with descriptions, ownership, quality, tags	Definitions for columns and fields within a specific dataset	Upstream/downstream dependency graphs at table or column level
Sources	Automated ingestion plus human curation	Documentation or embedded in schema comments	Parsing SQL, instrumenting orchestrators, runtime hooks
Use cases	Search, governance, access policies, quality signals	Understanding what a column means and what values are allowed	Impact analysis, debugging, compliance auditing

How interviewers test this

Scenarios that require you to apply catalog thinking to real organizational problems.

Self-service analytics

Analysts spend hours searching for the right table. How do you fix this?

Deploy a catalog with automated metadata ingestion, quality signals, ownership. The catalog becomes the entry point for data discovery. Describe ingestion connectors, search UX, curation workflows, and how to bootstrap adoption across teams.

PII compliance audit

An auditor asks you to list every system that stores customer email addresses.

Use the catalog's classification system: tag columns containing PII, propagate tags through lineage, query the catalog for all assets tagged 'email' or 'PII-direct'. Mention automated classification (regex, ML-based) and manual review workflows.

Data mesh ownership

We're moving to a data mesh. How do domains publish and discover each other's data products?

The catalog serves as the marketplace. Each domain registers data products with schema, SLAs, quality scores, ownership. Consumers search the catalog, request access, depend on published contracts. Lineage tracks cross-domain dependencies.

Duplicate data

Two teams have tables called 'revenue' with different numbers. How do you resolve this?

Use the catalog to surface both tables, compare lineage and definitions, designate one as the certified or golden dataset. Add quality badges, deprecation labels, redirect consumers to the trusted source. Governance workflow enforces the decision.

Adoption strategy

We bought a catalog tool six months ago and nobody uses it. What went wrong?

Catalogs fail without curation, ownership, and integration into daily workflows. Fix: automate metadata ingestion (no manual entry), embed catalog links in BI tools and query editors, assign owners per domain, gamify curation with completeness dashboards. Adoption is a product problem, not a tooling problem.

Interview questions with guidance

Ten questions across system design, governance, and platform design rounds.

What is a data catalog and why does it matter?

A centralized metadata store enabling discovery, governance, and trust across a data platform. Matters because without it, analysts waste hours asking 'does this table exist?' and 'can I trust this number?' The catalog answers both by combining technical metadata, business context, quality scores, and ownership in one searchable interface.

How does a data catalog differ from a data dictionary?

A dictionary defines columns within a single dataset (types, meanings, constraints). A catalog spans the entire platform: all datasets, all sources, with search, lineage, quality, access control. A dictionary is a chapter; a catalog is the library.

What metadata does a catalog store?

Technical: schema, column types, row counts, freshness, partitions, storage format. Business: descriptions, domain tags, PII classifications, glossary terms. Operational: query frequency, last queried by, pipeline dependencies, quality check results. Social: ratings, comments, questions from consumers.

How would you design metadata ingestion for a catalog?

Connector-based architecture: each source has a crawler or push integration. Crawlers run on schedule or trigger on schema changes. Ingested metadata normalized into a common model (datasets, fields, tags, owners). Must handle schema evolution, deleted assets, freshness tracking.

What is active metadata and why does it matter?

Goes beyond passive documentation. Metadata that drives automation: triggering alerts when freshness SLAs breach, auto-classifying PII columns, recommending datasets based on query patterns, propagating governance tags through lineage. Turns the catalog from a reference manual into an operational control plane.

How do you handle PII classification in a catalog?

Layer automated and manual. Automated: regex patterns (email, SSN, phone), ML classifiers for freetext, tag propagation through lineage. Manual: data stewards review and confirm. Tie PII tags to access policies so classification has enforcement teeth.

How do you measure catalog adoption?

Search volume (are people using it?), curation coverage (% of tables with descriptions and owners), time-to-discovery, consumer satisfaction. Also track stale metadata: tables with no queries in 90 days that still show as 'active' indicate catalog decay.

How does a catalog support data governance?

The catalog is the governance hub. Enforces ownership (every asset has an accountable team), classification (PII labels), access control (policy tags drive permissions), quality (SLA thresholds with alerting), lifecycle (deprecation workflows). Without a catalog, governance is policy documents nobody follows.

How would you evaluate data catalog tools?

Ingestion breadth (source connectors), search quality (relevance ranking, filters), lineage depth (table vs column level), governance features (classification, access policies), extensibility (APIs, custom metadata), adoption friction (SSO, embedded integrations, UX). Consider open source vs SaaS vs cloud-native trade-offs.

How does a catalog fit into a modern data stack?

Sits at the metadata layer, connecting warehouse (Snowflake, BigQuery), transformation (dbt), orchestration (Airflow), BI (Looker, Tableau). Ingests metadata from all layers and provides a unified view. In a mature stack, the catalog is the entry point: analysts start there, find the dataset, then open it in their query tool.

Common interview mistakes

Treating the catalog as a one-time documentation project

Catalogs require continuous automation. Manual documentation goes stale within weeks. Automate ingestion, enforce ownership, build curation into team workflows.

Confusing a catalog with a data dictionary

A dictionary describes one dataset in depth. A catalog is a platform-wide inventory with search, lineage, governance, access control. Interviewers will test whether you understand the difference in scope.

Naming catalog tools without explaining the architecture

'We use Collibra' is not an answer. Explain: connectors ingest metadata, a store normalizes it, search indexes it, lineage provides context, policies enforce governance. Tools implement architecture.

Ignoring the adoption problem

Most catalog failures are adoption failures, not technology failures. If the catalog isn't embedded in daily workflows (query editors, BI tools, Slack), people won't use it. Treat the catalog as a product with users, not infrastructure with admins.

Skipping data quality integration

A catalog without quality signals is a table of contents with no reviews. Consumers need to know: is this fresh? Are nulls within tolerance? Did the last pipeline succeed? Quality context is what makes catalog entries trustworthy.

Frequently asked questions

What is a data catalog?+

A centralized inventory of all data assets in an organization. Stores metadata (schemas, descriptions, owners, quality scores, lineage) and provides search and discovery so analysts and engineers can find, understand, and trust data.

What are the best data catalog tools?+

Open source: DataHub (most popular), Apache Atlas (Hadoop-centric). Enterprise: Collibra, Alation. Modern SaaS: Atlan. Cloud-native: Unity Catalog (Databricks), AWS Glue Data Catalog, Google Data Catalog. Choice depends on your stack, governance needs, and budget.

How is a catalog different from a warehouse?+

A warehouse stores actual data (rows and columns). A catalog stores metadata about that data (what tables exist, who owns them, what they mean, how fresh they are). The catalog helps you find and understand data; the warehouse lets you query it.

How does a catalog help with GDPR?+

Tracks which datasets contain PII through classification tags. Combined with lineage, it shows every downstream system that processes personal data. Enables data subject access requests, right-to-deletion audits, regulatory reporting.

What is active metadata?+

Metadata that drives automation rather than sitting as passive documentation. Examples: auto-classifying PII columns, triggering alerts when freshness SLAs breach, recommending datasets based on query patterns, propagating governance tags through lineage automatically.

Should you build or buy a catalog?+

For most organizations, buy or adopt open source. Building requires metadata ingestion, search infrastructure, lineage integration, access control, UI. DataHub (open source) or cloud-native options provide a strong starting point. Only build custom if you have unique requirements no existing tool addresses.

02 / Why practice

Practice data engineering concepts

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Related guides

Data Lineage→

Origins, transformations, and dependency tracking.

Data Modeling Questions→

All modeling topics interviewers test.

Data Fabric vs Data Mesh→

Where catalogs fit in mesh and fabric architectures.