Data Catalog: Interview Questions & Tools Guide (2026)
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata (schemas, owners, quality scores, lineage, usage stats) and provides search so engineers and analysts can find, understand, and trust data without asking around on Slack.
What Is a Data Catalog
A data catalog answers one question for every person on your data team: what data do we have and can I trust it.
The Core Idea
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata: schemas, descriptions, ownership, quality scores, usage stats, and tags. Think of it as a search engine for your data platform. When an analyst asks 'do we have customer churn data?', the catalog should answer that question in seconds, not Slack threads.
Metadata as First-Class Asset
Catalogs treat metadata as a product. Technical metadata (column types, table sizes, freshness) is collected automatically. Business metadata (descriptions, domain tags, PII classifications) is curated by data owners. The catalog connects both layers so consumers see technical context and business meaning in one place.
Governance Built In
Modern catalogs are not just search tools. They enforce governance: who owns each dataset, who can access it, what PII it contains, and whether its quality meets SLA thresholds. A catalog without governance is just a wiki that goes stale. A catalog with governance becomes the control plane for your data platform.
How a Data Catalog Works
Data Sources
Warehouse
Data Lake
BI Tools
Orchestrator
Metadata Ingestion Layer
Crawlers, connectors, query log parsers, change event hooks
Catalog Core
Search Index
Metadata Store
Lineage Graph
Quality Engine
Access Policies
Classification Tags
Consumers
Analysts
Engineers
Governance Teams
ML Engineers
The catalog sits between raw data sources and consumers, providing a governed metadata layer that makes data discoverable and trustworthy.
Core Components
A data catalog is not a single feature. It is six capabilities working together. Interviewers expect you to name these and explain how they interact.
Metadata Ingestion
Automated crawlers connect to warehouses, lakes, BI tools, and orchestrators to harvest technical metadata. Schema, column types, row counts, freshness, and partition info are pulled on a schedule or via event-driven hooks. The best catalogs also capture query logs to build usage and popularity signals.
Search and Discovery
Full-text search across table names, column names, descriptions, tags, and even sample values. Relevance ranking uses popularity (query frequency), freshness, and ownership signals. Think of it as Google for your data warehouse: analysts type a keyword and find the trusted, documented dataset in seconds.
Data Quality Scores
Quality metrics are surfaced directly in the catalog: freshness (when was this table last updated?), completeness (null rate per column), volume (row count trends), and schema drift. Some catalogs integrate with tools like Great Expectations or Monte Carlo to pull quality check results into each asset page.
Ownership and Stewardship
Every dataset has a designated owner (usually a team or individual). Owners are responsible for descriptions, quality SLAs, and responding to questions. Stewardship workflows let consumers request access, flag issues, or propose corrections. Without clear ownership, catalogs decay into graveyards of undocumented tables.
Lineage Integration
Catalogs display upstream and downstream dependencies for each asset. This is not lineage in isolation; it is lineage in context. When you find a dataset in the catalog, you immediately see where it came from, what transforms built it, and what dashboards depend on it.
Access Control and Compliance
Tag-based policies tie data classification (PII, HIPAA, internal) to access rules. When a column is tagged as PII, the catalog can enforce masking or restrict access to approved roles. Compliance teams use the catalog to audit who accessed what and whether PII handling meets regulatory requirements.
Every problem comes from a real interview report. Run code in your browser.
Real Catalog Systems and Tools
Interviewers expect you to know the ecosystem. Name the tool, but more importantly, explain where it fits and what trade-offs it makes.
Metadata platform originally built at LinkedIn, now maintained by Acryl Data. Supports ingestion from 50+ sources, GraphQL API, column-level lineage, and fine-grained access policies. The most popular open-source catalog.
Metadata governance for the Hadoop ecosystem. Strong type system, classification propagation, and glossary support. Common in legacy on-prem environments with Hive, HBase, and Kafka.
Enterprise data intelligence platform. Combines catalog, glossary, data quality, and policy management. Workflow engine for stewardship and access requests. Targeted at Fortune 500 organizations with complex governance needs.
Pioneered the data catalog category. Strong in search, curation, and BI integration. Behavioral analysis tracks query patterns to surface popular and trusted datasets. Used heavily in analytics-driven organizations.
Cloud-native active metadata platform. Embedded collaboration (Slack-like threads on datasets), automated lineage, and persona-based UI for analysts, engineers, and governance teams. Growing fast in modern data stacks.
Databricks native governance layer. Manages tables, volumes, models, and functions across workspaces. Fine-grained ACLs, automatic lineage, and AI/ML asset governance. Tight integration with Delta Lake and MLflow.
Managed Hive metastore for AWS. Stores table definitions for Athena, Redshift Spectrum, EMR, and Lake Formation. Crawlers auto-discover schemas from S3. The default catalog for AWS-native data lakes.
Fully managed metadata service for BigQuery, Pub/Sub, Cloud Storage, and Dataproc. Tag templates for custom metadata, policy tags for column-level security, and integration with Data Lineage API.
Data Catalog vs Data Dictionary vs Data Lineage
Interviewers frequently test whether you can distinguish these three concepts. They overlap but serve fundamentally different purposes.
- Centralized inventory of all data assets across the organization
- Search, discovery, and browsing by keyword, tag, or domain
- Metadata from automated ingestion plus human curation
- Ownership, quality scores, usage popularity, and access policies
- Scope: the entire data platform as a searchable product
- Detailed definitions for columns and fields within a specific dataset
- Business meaning, allowed values, data types, and constraints
- Typically maintained as documentation or embedded in schema comments
- Static, often per-table or per-database, no cross-platform scope
- Scope: one dataset described in depth
- Tracks where data comes from, how it transforms, and where it goes
- Upstream and downstream dependency graphs at table or column level
- Captured by parsing SQL, instrumenting orchestrators, or runtime hooks
- Enables impact analysis, debugging, and compliance auditing
- Scope: the flow of data through pipelines and transformations
How Interviewers Test This
Interviewers rarely ask “define data catalog.” They create scenarios that require you to apply catalog thinking to real organizational problems.
“Analysts spend hours searching for the right table. How would you fix this?”
Deploy a data catalog with automated metadata ingestion, quality signals, and ownership. The catalog becomes the entry point for data discovery. Candidates should describe ingestion connectors, search UX, curation workflows, and how to bootstrap adoption across teams.
“An auditor asks you to list every system that stores customer email addresses. How do you answer?”
Use the catalog's classification system: tag columns containing PII, propagate tags through lineage, and query the catalog for all assets tagged 'email' or 'PII-direct'. Candidates should mention automated classification (regex, ML-based) and manual review workflows.
“We are moving to a data mesh. How do domains publish and discover each other's data products?”
The catalog serves as the marketplace. Each domain registers data products with schema, SLAs, quality scores, and ownership. Consumers search the catalog, request access, and depend on published contracts. Lineage tracks cross-domain dependencies.
“Two teams have tables called 'revenue' with different numbers. How do you resolve this?”
Use the catalog to surface both tables, compare their lineage and definitions, and designate one as the certified or golden dataset. Add quality badges, deprecation labels, and redirect consumers to the trusted source. Governance workflow enforces the decision.
“We bought a catalog tool six months ago and nobody uses it. What went wrong?”
Catalogs fail without curation, ownership, and integration into daily workflows. The fix: automate metadata ingestion (no manual entry), embed catalog links in BI tools and query editors, assign owners per domain, and gamify curation with completeness dashboards. Adoption is a product problem, not a tooling problem.
Interview Questions with Guidance
Ten questions that appear across system design, governance, and platform design rounds.
What is a data catalog and why does it matter?
A data catalog is a centralized metadata store that enables discovery, governance, and trust across a data platform. It matters because without it, analysts waste hours asking 'does this table exist?' and 'can I trust this number?' The catalog answers both by combining technical metadata, business context, quality scores, and ownership in one searchable interface.
How does a data catalog differ from a data dictionary?
A dictionary defines columns within a single dataset (types, meanings, constraints). A catalog spans the entire platform: all datasets, all sources, with search, lineage, quality, and access control. A dictionary is a chapter; a catalog is the library. Most catalogs contain dictionary information as a subset.
What metadata does a catalog store?
Technical metadata: schema, column types, row counts, freshness, partitions, storage format. Business metadata: descriptions, domain tags, PII classifications, glossary terms. Operational metadata: query frequency, last queried by, pipeline dependencies, quality check results. Social metadata: ratings, comments, questions from consumers.
How would you design metadata ingestion for a catalog?
Connector-based architecture: each source (warehouse, lake, BI tool, orchestrator) has a crawler or push integration. Crawlers run on schedule or trigger on schema changes. Ingested metadata is normalized into a common model (datasets, fields, tags, owners). Important: ingestion must handle schema evolution, deleted assets, and freshness tracking.
What is active metadata and why does it matter?
Active metadata goes beyond passive documentation. It is metadata that drives automation: triggering alerts when freshness SLAs breach, auto-classifying PII columns, recommending datasets based on query patterns, and propagating governance tags through lineage. Active metadata turns the catalog from a reference manual into an operational control plane.
How do you handle PII classification in a catalog?
Layer automated and manual approaches. Automated: regex patterns (email, SSN, phone), ML classifiers for freetext fields, and tag propagation through lineage (if source column is PII, downstream columns inherit the tag). Manual: data stewards review and confirm classifications. Tie PII tags to access policies so classification has enforcement teeth.
How do you measure catalog adoption?
Key metrics: search volume (are people using it?), curation coverage (percentage of tables with descriptions and owners), time-to-discovery (how fast analysts find data), and consumer satisfaction (survey or NPS). Also track stale metadata: tables with no queries in 90 days that still show as 'active' indicate catalog decay.
How does a catalog support data governance?
The catalog is the governance hub. It enforces ownership (every asset has an accountable team), classification (PII and sensitivity labels), access control (policy tags drive permissions), quality (SLA thresholds with alerting), and lifecycle management (deprecation workflows). Without a catalog, governance is policy documents that nobody follows.
How would you evaluate data catalog tools?
Key criteria: ingestion breadth (number of source connectors), search quality (relevance ranking, filters), lineage depth (table vs column level), governance features (classification, access policies), extensibility (APIs, custom metadata), and adoption friction (SSO, embedded integrations, UX quality). Also consider open source vs SaaS vs cloud-native trade-offs.
How does a data catalog fit into a modern data stack?
The catalog sits at the metadata layer, connecting warehouse (Snowflake, BigQuery), transformation (dbt), orchestration (Airflow), and BI (Looker, Tableau). It ingests metadata from all layers and provides a unified view. In a mature stack, the catalog is the entry point: analysts start in the catalog, find the dataset, then open it in their query tool.
Common Interview Mistakes
Treating the catalog as a one-time documentation project
Catalogs require continuous automation. Manual documentation goes stale within weeks. Automate ingestion, enforce ownership, and build curation into team workflows.
Confusing a data catalog with a data dictionary
A dictionary describes one dataset in depth. A catalog is a platform-wide inventory with search, lineage, governance, and access control. Interviewers will test whether you understand the difference in scope.
Naming catalog tools without explaining the architecture
Saying 'we use Collibra' is not an answer. Explain: connectors ingest metadata, a store normalizes it, search indexes it, lineage provides context, and policies enforce governance. Tools implement architecture.
Ignoring the adoption problem
Most catalog failures are adoption failures, not technology failures. If the catalog is not embedded in daily workflows (query editors, BI tools, Slack), people will not use it. Treat the catalog as a product with users, not infrastructure with admins.
Skipping data quality integration
A catalog without quality signals is a table of contents with no reviews. Consumers need to know: is this table fresh? Are nulls within tolerance? Did the last pipeline run succeed? Quality context is what makes catalog entries trustworthy.
Frequently Asked Questions
What is a data catalog?+
What are the best data catalog tools?+
Do interviews ask about data catalogs?+
How is a data catalog different from a data warehouse?+
What is metadata ingestion?+
How does a data catalog help with GDPR?+
What is active metadata?+
Can you build a data catalog or should you buy one?+
Practice Data Engineering Concepts
DataDriven covers data modeling, SQL, pipeline design, and system design at interview difficulty.