Concepts

Data Catalog: What Interviewers Test

A data catalog is a centralized inventory of every dataset in your organization. It stores metadata (schemas, owners, quality scores, lineage, usage stats) and provides search so engineers and analysts can find, understand, and trust data without asking around on Slack.

Gartner estimates that data workers spend 30% of their time searching for and validating data. Organizations with mature catalog adoption report a 40% reduction in time-to-insight. The global data catalog market reached $1.2B in 2025 (IDC), growing at 22% CAGR.

What Is a Data Catalog

A data catalog answers one question for every person on your data team: what data do we have and can I trust it.

1

The Core Idea

A data catalog is a centralized inventory of every dataset in your organization. It stores metadata: schemas, descriptions, ownership, quality scores, usage stats, and tags. Think of it as a search engine for your data platform. When an analyst asks 'do we have customer churn data?', the catalog should answer that question in seconds, not Slack threads.

2

Metadata as First-Class Asset

Catalogs treat metadata as a product. Technical metadata (column types, table sizes, freshness) is collected automatically. Business metadata (descriptions, domain tags, PII classifications) is curated by data owners. The catalog connects both layers so consumers see technical context and business meaning in one place.

3

Governance Built In

Modern catalogs are not just search tools. They enforce governance: who owns each dataset, who can access it, what PII it contains, and whether its quality meets SLA thresholds. A catalog without governance is just a wiki that goes stale. A catalog with governance becomes the control plane for your data platform.

How a Data Catalog Works

Data Sources

Warehouse

Data Lake

BI Tools

Orchestrator

Metadata Ingestion Layer

Crawlers, connectors, query log parsers, change event hooks

Catalog Core

Search Index

Metadata Store

Lineage Graph

Quality Engine

Access Policies

Classification Tags

Consumers

Analysts

Engineers

Governance Teams

ML Engineers

The catalog sits between raw data sources and consumers, providing a governed metadata layer that makes data discoverable and trustworthy.

Core Components

A data catalog is not a single feature. It is six capabilities working together. Interviewers expect you to name these and explain how they interact.

Metadata Ingestion

Automated crawlers connect to warehouses, lakes, BI tools, and orchestrators to harvest technical metadata. Schema, column types, row counts, freshness, and partition info are pulled on a schedule or via event-driven hooks. The best catalogs also capture query logs to build usage and popularity signals.

Search and Discovery

Full-text search across table names, column names, descriptions, tags, and even sample values. Relevance ranking uses popularity (query frequency), freshness, and ownership signals. Think of it as Google for your data warehouse: analysts type a keyword and find the trusted, documented dataset in seconds.

Data Quality Scores

Quality metrics are surfaced directly in the catalog: freshness (when was this table last updated?), completeness (null rate per column), volume (row count trends), and schema drift. Some catalogs integrate with tools like Great Expectations or Monte Carlo to pull quality check results into each asset page.

Ownership and Stewardship

Every dataset has a designated owner (usually a team or individual). Owners are responsible for descriptions, quality SLAs, and responding to questions. Stewardship workflows let consumers request access, flag issues, or propose corrections. Without clear ownership, catalogs decay into graveyards of undocumented tables.

Lineage Integration

Catalogs display upstream and downstream dependencies for each asset. This is not lineage in isolation; it is lineage in context. When you find a dataset in the catalog, you immediately see where it came from, what transforms built it, and what dashboards depend on it.

Access Control and Compliance

Tag-based policies tie data classification (PII, HIPAA, internal) to access rules. When a column is tagged as PII, the catalog can enforce masking or restrict access to approved roles. Compliance teams use the catalog to audit who accessed what and whether PII handling meets regulatory requirements.

Real Catalog Systems and Tools

Interviewers expect you to know the ecosystem. Name the tool, but more importantly, explain where it fits and what trade-offs it makes.

DataHubOpen Source

Metadata platform originally built at LinkedIn, now maintained by Acryl Data. Supports ingestion from 50+ sources, GraphQL API, column-level lineage, and fine-grained access policies. The most popular open-source catalog.

Apache AtlasOpen Source

Metadata governance for the Hadoop ecosystem. Strong type system, classification propagation, and glossary support. Common in legacy on-prem environments with Hive, HBase, and Kafka.

CollibraEnterprise

Enterprise data intelligence platform. Combines catalog, glossary, data quality, and policy management. Workflow engine for stewardship and access requests. Targeted at Fortune 500 organizations with complex governance needs.

AlationEnterprise

Pioneered the data catalog category. Strong in search, curation, and BI integration. Behavioral analysis tracks query patterns to surface popular and trusted datasets. Used heavily in analytics-driven organizations.

AtlanModern SaaS

Cloud-native active metadata platform. Embedded collaboration (Slack-like threads on datasets), automated lineage, and persona-based UI for analysts, engineers, and governance teams. Growing fast in modern data stacks.

Unity CatalogDatabricks

Databricks native governance layer. Manages tables, volumes, models, and functions across workspaces. Fine-grained ACLs, automatic lineage, and AI/ML asset governance. Tight integration with Delta Lake and MLflow.

AWS Glue Data CatalogAWS

Managed Hive metastore for AWS. Stores table definitions for Athena, Redshift Spectrum, EMR, and Lake Formation. Crawlers auto-discover schemas from S3. The default catalog for AWS-native data lakes.

Google Data CatalogGCP

Fully managed metadata service for BigQuery, Pub/Sub, Cloud Storage, and Dataproc. Tag templates for custom metadata, policy tags for column-level security, and integration with Data Lineage API.

Data Catalog vs Data Dictionary vs Data Lineage

Interviewers frequently test whether you can distinguish these three concepts. They overlap but serve fundamentally different purposes.

Data Catalog
  • Centralized inventory of all data assets across the organization
  • Search, discovery, and browsing by keyword, tag, or domain
  • Metadata from automated ingestion plus human curation
  • Ownership, quality scores, usage popularity, and access policies
  • Scope: the entire data platform as a searchable product
Data Dictionary
  • Detailed definitions for columns and fields within a specific dataset
  • Business meaning, allowed values, data types, and constraints
  • Typically maintained as documentation or embedded in schema comments
  • Static, often per-table or per-database, no cross-platform scope
  • Scope: one dataset described in depth
Data Lineage
  • Tracks where data comes from, how it transforms, and where it goes
  • Upstream and downstream dependency graphs at table or column level
  • Captured by parsing SQL, instrumenting orchestrators, or runtime hooks
  • Enables impact analysis, debugging, and compliance auditing
  • Scope: the flow of data through pipelines and transformations

How Interviewers Test This

Interviewers rarely ask “define data catalog.” They create scenarios that require you to apply catalog thinking to real organizational problems.

Scenario 1|Self-service analytics

Analysts spend hours searching for the right table. How would you fix this?

What they want to hear

Deploy a data catalog with automated metadata ingestion, quality signals, and ownership. The catalog becomes the entry point for data discovery. Candidates should describe ingestion connectors, search UX, curation workflows, and how to bootstrap adoption across teams.

Scenario 2|PII compliance audit

An auditor asks you to list every system that stores customer email addresses. How do you answer?

What they want to hear

Use the catalog's classification system: tag columns containing PII, propagate tags through lineage, and query the catalog for all assets tagged 'email' or 'PII-direct'. Candidates should mention automated classification (regex, ML-based) and manual review workflows.

Scenario 3|Data mesh ownership

We are moving to a data mesh. How do domains publish and discover each other's data products?

What they want to hear

The catalog serves as the marketplace. Each domain registers data products with schema, SLAs, quality scores, and ownership. Consumers search the catalog, request access, and depend on published contracts. Lineage tracks cross-domain dependencies.

Scenario 4|Duplicate and conflicting data

Two teams have tables called 'revenue' with different numbers. How do you resolve this?

What they want to hear

Use the catalog to surface both tables, compare their lineage and definitions, and designate one as the certified or golden dataset. Add quality badges, deprecation labels, and redirect consumers to the trusted source. Governance workflow enforces the decision.

Scenario 5|Catalog adoption strategy

We bought a catalog tool six months ago and nobody uses it. What went wrong?

What they want to hear

Catalogs fail without curation, ownership, and integration into daily workflows. The fix: automate metadata ingestion (no manual entry), embed catalog links in BI tools and query editors, assign owners per domain, and gamify curation with completeness dashboards. Adoption is a product problem, not a tooling problem.

Interview Questions with Guidance

Q1

What is a data catalog and why does it matter?

A strong answer includes:

A data catalog is a centralized metadata store that enables discovery, governance, and trust across a data platform. It matters because without it, analysts waste hours asking 'does this table exist?' and 'can I trust this number?' The catalog answers both by combining technical metadata, business context, quality scores, and ownership in one searchable interface.

Q2

How does a data catalog differ from a data dictionary?

A strong answer includes:

A dictionary defines columns within a single dataset (types, meanings, constraints). A catalog spans the entire platform: all datasets, all sources, with search, lineage, quality, and access control. A dictionary is a chapter; a catalog is the library. Most catalogs contain dictionary information as a subset.

Q3

What metadata does a catalog store?

A strong answer includes:

Technical metadata: schema, column types, row counts, freshness, partitions, storage format. Business metadata: descriptions, domain tags, PII classifications, glossary terms. Operational metadata: query frequency, last queried by, pipeline dependencies, quality check results. Social metadata: ratings, comments, questions from consumers.

Q4

How would you design metadata ingestion for a catalog?

A strong answer includes:

Connector-based architecture: each source (warehouse, lake, BI tool, orchestrator) has a crawler or push integration. Crawlers run on schedule or trigger on schema changes. Ingested metadata is normalized into a common model (datasets, fields, tags, owners). Important: ingestion must handle schema evolution, deleted assets, and freshness tracking.

Q5

What is active metadata and why does it matter?

A strong answer includes:

Active metadata goes beyond passive documentation. It is metadata that drives automation: triggering alerts when freshness SLAs breach, auto-classifying PII columns, recommending datasets based on query patterns, and propagating governance tags through lineage. Active metadata turns the catalog from a reference manual into an operational control plane.

Q6

How do you handle PII classification in a catalog?

A strong answer includes:

Layer automated and manual approaches. Automated: regex patterns (email, SSN, phone), ML classifiers for freetext fields, and tag propagation through lineage (if source column is PII, downstream columns inherit the tag). Manual: data stewards review and confirm classifications. Tie PII tags to access policies so classification has enforcement teeth.

Q7

How do you measure catalog adoption?

A strong answer includes:

Key metrics: search volume (are people using it?), curation coverage (percentage of tables with descriptions and owners), time-to-discovery (how fast analysts find data), and consumer satisfaction (survey or NPS). Also track stale metadata: tables with no queries in 90 days that still show as 'active' indicate catalog decay.

Q8

How does a catalog support data governance?

A strong answer includes:

The catalog is the governance hub. It enforces ownership (every asset has an accountable team), classification (PII and sensitivity labels), access control (policy tags drive permissions), quality (SLA thresholds with alerting), and lifecycle management (deprecation workflows). Without a catalog, governance is policy documents that nobody follows.

Q9

How would you evaluate data catalog tools?

A strong answer includes:

Key criteria: ingestion breadth (number of source connectors), search quality (relevance ranking, filters), lineage depth (table vs column level), governance features (classification, access policies), extensibility (APIs, custom metadata), and adoption friction (SSO, embedded integrations, UX quality). Also consider open source vs SaaS vs cloud-native trade-offs.

Q10

How does a data catalog fit into a modern data stack?

A strong answer includes:

The catalog sits at the metadata layer, connecting warehouse (Snowflake, BigQuery), transformation (dbt), orchestration (Airflow), and BI (Looker, Tableau). It ingests metadata from all layers and provides a unified view. In a mature stack, the catalog is the entry point: analysts start in the catalog, find the dataset, then open it in their query tool.

Common Interview Mistakes

Treating the catalog as a one-time documentation project

Catalogs require continuous automation. Manual documentation goes stale within weeks. Automate ingestion, enforce ownership, and build curation into team workflows.

Confusing a data catalog with a data dictionary

A dictionary describes one dataset in depth. A catalog is a platform-wide inventory with search, lineage, governance, and access control. Interviewers will test whether you understand the difference in scope.

Naming catalog tools without explaining the architecture

Saying 'we use Collibra' is not an answer. Explain: connectors ingest metadata, a store normalizes it, search indexes it, lineage provides context, and policies enforce governance. Tools implement architecture.

Ignoring the adoption problem

Most catalog failures are adoption failures, not technology failures. If the catalog is not embedded in daily workflows (query editors, BI tools, Slack), people will not use it. Treat the catalog as a product with users, not infrastructure with admins.

Skipping data quality integration

A catalog without quality signals is a table of contents with no reviews. Consumers need to know: is this table fresh? Are nulls within tolerance? Did the last pipeline run succeed? Quality context is what makes catalog entries trustworthy.

Frequently Asked Questions

What is a data catalog?+
A data catalog is a centralized inventory of all data assets in an organization. It stores metadata (schemas, descriptions, owners, quality scores, lineage) and provides search and discovery so analysts and engineers can find, understand, and trust data without asking around on Slack.
What are the best data catalog tools?+
Open source: DataHub (most popular), Apache Atlas (Hadoop-centric). Enterprise: Collibra, Alation. Modern SaaS: Atlan. Cloud-native: Unity Catalog (Databricks), AWS Glue Data Catalog, Google Data Catalog. Choice depends on your stack, governance needs, and budget.
Do interviews ask about data catalogs?+
Yes. Catalog questions appear in system design rounds, governance discussions, and scenario-based questions about data discovery and compliance. They signal that a candidate thinks about the full data platform, not just individual pipelines.
How is a data catalog different from a data warehouse?+
A warehouse stores actual data (rows and columns). A catalog stores metadata about that data (what tables exist, who owns them, what they mean, how fresh they are). The catalog helps you find and understand data; the warehouse lets you query it.
What is metadata ingestion?+
The automated process of collecting technical metadata from data sources into the catalog. Crawlers connect to warehouses, lakes, and BI tools to harvest schemas, column types, row counts, and freshness. Good ingestion also captures query logs to measure dataset popularity.
How does a data catalog help with GDPR?+
The catalog tracks which datasets contain PII through classification tags. Combined with lineage, it shows every downstream system that processes personal data. This enables data subject access requests, right-to-deletion audits, and regulatory reporting.
What is active metadata?+
Metadata that drives automation rather than sitting as passive documentation. Examples: auto-classifying PII columns, triggering alerts when freshness SLAs breach, recommending datasets based on query patterns, and propagating governance tags through lineage automatically.
Can you build a data catalog or should you buy one?+
For most organizations, buy or adopt open source. Building a catalog requires metadata ingestion, search infrastructure, lineage integration, access control, and a UI. DataHub (open source) or cloud-native options (Glue, Unity Catalog) provide a strong starting point. Only build custom if you have unique requirements that no existing tool addresses.

Practice Data Engineering Concepts

DataDriven covers data modeling, SQL, pipeline design, and system design at interview difficulty.

Start Practicing