Data Catalog: Interview Questions & Tools Guide (2026)
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata (schemas, owners, quality scores, lineage, usage stats) and provides search so engineers and analysts can find, understand, and trust data without asking around on Slack.
What this guide actually says
A data catalog answers two questions for every person on your data team: what data do we have, and can I trust it. Modern catalogs are not search engines. They are governance control planes. Without active metadata (alerts, auto-classification, policy enforcement), the catalog becomes a wiki that goes stale. Without integration into daily workflows, it becomes shelfware. Adoption is the failure mode 80% of catalog projects hit.
What is a data catalog
Three things a catalog actually does. Interviewers expect you to name all three.
The core idea
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata: schemas, descriptions, ownership, quality scores, usage stats, and tags. Think of it as a search engine for your data platform. When an analyst asks 'do we have customer churn data?', the catalog should answer in seconds, not Slack threads.
Metadata as first-class asset
Catalogs treat metadata as a product. Technical metadata (column types, table sizes, freshness) is collected automatically. Business metadata (descriptions, domain tags, PII classifications) is curated by data owners. The catalog connects both layers so consumers see technical context and business meaning in one place.
Governance built in
Modern catalogs aren't just search tools. They enforce governance: who owns each dataset, who can access it, what PII it contains, and whether its quality meets SLA thresholds. A catalog without governance is just a wiki that goes stale. A catalog with governance becomes the control plane for your data platform.
Core components
A catalog isn't a single feature. It's six capabilities working together. Interviewers expect you to name these and explain how they interact.
Metadata Ingestion
Automated crawlers connect to warehouses, lakes, BI tools, and orchestrators to harvest technical metadata. Schema, column types, row counts, freshness, partition info pulled on a schedule or via event-driven hooks. The best catalogs also capture query logs to build usage and popularity signals.
Search and Discovery
Full-text search across table names, column names, descriptions, tags, even sample values. Relevance ranking uses popularity (query frequency), freshness, ownership signals. Think of it as Google for your warehouse: analysts type a keyword and find the trusted, documented dataset in seconds.
Data Quality Scores
Quality metrics surfaced directly in the catalog: freshness (when last updated?), completeness (null rate per column), volume (row count trends), schema drift. Some catalogs integrate with Great Expectations or Monte Carlo to pull quality check results into each asset page.
Ownership and Stewardship
Every dataset has a designated owner (usually a team or individual). Owners are responsible for descriptions, quality SLAs, and responding to questions. Stewardship workflows let consumers request access, flag issues, or propose corrections. Without clear ownership, catalogs decay into graveyards of undocumented tables.
Lineage Integration
Catalogs display upstream and downstream dependencies for each asset. Not lineage in isolation; lineage in context. When you find a dataset in the catalog, you immediately see where it came from, what transforms built it, and what dashboards depend on it.
Access Control and Compliance
Tag-based policies tie data classification (PII, HIPAA, internal) to access rules. When a column is tagged as PII, the catalog can enforce masking or restrict access to approved roles. Compliance teams use the catalog to audit who accessed what and whether PII handling meets regulatory requirements.
Real catalog systems and tools
Know the ecosystem. Name the tool, but more importantly, explain where it fits and what trade-offs it makes.
DataHub
Metadata platform from LinkedIn, now maintained by Acryl Data. Supports ingestion from 50+ sources, GraphQL API, column-level lineage, fine-grained access policies. The most popular open-source catalog.
Apache Atlas
Metadata governance for the Hadoop ecosystem. Strong type system, classification propagation, glossary support. Common in legacy on-prem environments with Hive, HBase, Kafka.
Collibra
Enterprise data intelligence platform. Combines catalog, glossary, data quality, policy management. Workflow engine for stewardship and access requests. Targeted at Fortune 500 with complex governance needs.
Alation
Pioneered the data catalog category. Strong in search, curation, BI integration. Behavioral analysis tracks query patterns to surface popular and trusted datasets. Used heavily in analytics-driven organizations.
Atlan
Cloud-native active metadata platform. Embedded collaboration (Slack-like threads on datasets), automated lineage, persona-based UI for analysts, engineers, governance. Growing fast in modern data stacks.
Unity Catalog
Databricks native governance layer. Manages tables, volumes, models, functions across workspaces. Fine-grained ACLs, automatic lineage, AI/ML asset governance. Tight integration with Delta Lake and MLflow.
AWS Glue Data Catalog
Managed Hive metastore for AWS. Stores table definitions for Athena, Redshift Spectrum, EMR, Lake Formation. Crawlers auto-discover schemas from S3. Default catalog for AWS-native data lakes.
Google Data Catalog
Fully managed metadata service for BigQuery, Pub/Sub, Cloud Storage, Dataproc. Tag templates for custom metadata, policy tags for column-level security, integration with Data Lineage API.
Data catalog vs data dictionary vs data lineage
Interviewers frequently test whether you can distinguish these three. They overlap but serve fundamentally different purposes.
| Dimension | Data Catalog | Data Dictionary | Data Lineage |
|---|---|---|---|
| Scope | Entire data platform as a searchable product | One dataset described in depth | Flow of data through pipelines and transformations |
| Primary content | Inventory of all assets with descriptions, ownership, quality, tags | Definitions for columns and fields within a specific dataset | Upstream/downstream dependency graphs at table or column level |
| Sources | Automated ingestion plus human curation | Documentation or embedded in schema comments | Parsing SQL, instrumenting orchestrators, runtime hooks |
| Use cases | Search, governance, access policies, quality signals | Understanding what a column means and what values are allowed | Impact analysis, debugging, compliance auditing |
How interviewers test this
Scenarios that require you to apply catalog thinking to real organizational problems.
Analysts spend hours searching for the right table. How do you fix this?
Deploy a catalog with automated metadata ingestion, quality signals, ownership. The catalog becomes the entry point for data discovery. Describe ingestion connectors, search UX, curation workflows, and how to bootstrap adoption across teams.
An auditor asks you to list every system that stores customer email addresses.
Use the catalog's classification system: tag columns containing PII, propagate tags through lineage, query the catalog for all assets tagged 'email' or 'PII-direct'. Mention automated classification (regex, ML-based) and manual review workflows.
We're moving to a data mesh. How do domains publish and discover each other's data products?
The catalog serves as the marketplace. Each domain registers data products with schema, SLAs, quality scores, ownership. Consumers search the catalog, request access, depend on published contracts. Lineage tracks cross-domain dependencies.
Two teams have tables called 'revenue' with different numbers. How do you resolve this?
Use the catalog to surface both tables, compare lineage and definitions, designate one as the certified or golden dataset. Add quality badges, deprecation labels, redirect consumers to the trusted source. Governance workflow enforces the decision.
We bought a catalog tool six months ago and nobody uses it. What went wrong?
Catalogs fail without curation, ownership, and integration into daily workflows. Fix: automate metadata ingestion (no manual entry), embed catalog links in BI tools and query editors, assign owners per domain, gamify curation with completeness dashboards. Adoption is a product problem, not a tooling problem.
Interview questions with guidance
Ten questions across system design, governance, and platform design rounds.
What is a data catalog and why does it matter?
A centralized metadata store enabling discovery, governance, and trust across a data platform. Matters because without it, analysts waste hours asking 'does this table exist?' and 'can I trust this number?' The catalog answers both by combining technical metadata, business context, quality scores, and ownership in one searchable interface.
How does a data catalog differ from a data dictionary?
A dictionary defines columns within a single dataset (types, meanings, constraints). A catalog spans the entire platform: all datasets, all sources, with search, lineage, quality, access control. A dictionary is a chapter; a catalog is the library.
What metadata does a catalog store?
Technical: schema, column types, row counts, freshness, partitions, storage format. Business: descriptions, domain tags, PII classifications, glossary terms. Operational: query frequency, last queried by, pipeline dependencies, quality check results. Social: ratings, comments, questions from consumers.
How would you design metadata ingestion for a catalog?
Connector-based architecture: each source has a crawler or push integration. Crawlers run on schedule or trigger on schema changes. Ingested metadata normalized into a common model (datasets, fields, tags, owners). Must handle schema evolution, deleted assets, freshness tracking.
What is active metadata and why does it matter?
Goes beyond passive documentation. Metadata that drives automation: triggering alerts when freshness SLAs breach, auto-classifying PII columns, recommending datasets based on query patterns, propagating governance tags through lineage. Turns the catalog from a reference manual into an operational control plane.
How do you handle PII classification in a catalog?
Layer automated and manual. Automated: regex patterns (email, SSN, phone), ML classifiers for freetext, tag propagation through lineage. Manual: data stewards review and confirm. Tie PII tags to access policies so classification has enforcement teeth.
How do you measure catalog adoption?
Search volume (are people using it?), curation coverage (% of tables with descriptions and owners), time-to-discovery, consumer satisfaction. Also track stale metadata: tables with no queries in 90 days that still show as 'active' indicate catalog decay.
How does a catalog support data governance?
The catalog is the governance hub. Enforces ownership (every asset has an accountable team), classification (PII labels), access control (policy tags drive permissions), quality (SLA thresholds with alerting), lifecycle (deprecation workflows). Without a catalog, governance is policy documents nobody follows.
How would you evaluate data catalog tools?
Ingestion breadth (source connectors), search quality (relevance ranking, filters), lineage depth (table vs column level), governance features (classification, access policies), extensibility (APIs, custom metadata), adoption friction (SSO, embedded integrations, UX). Consider open source vs SaaS vs cloud-native trade-offs.
How does a catalog fit into a modern data stack?
Sits at the metadata layer, connecting warehouse (Snowflake, BigQuery), transformation (dbt), orchestration (Airflow), BI (Looker, Tableau). Ingests metadata from all layers and provides a unified view. In a mature stack, the catalog is the entry point: analysts start there, find the dataset, then open it in their query tool.
Common interview mistakes
Treating the catalog as a one-time documentation project
Catalogs require continuous automation. Manual documentation goes stale within weeks. Automate ingestion, enforce ownership, build curation into team workflows.
Confusing a catalog with a data dictionary
A dictionary describes one dataset in depth. A catalog is a platform-wide inventory with search, lineage, governance, access control. Interviewers will test whether you understand the difference in scope.
Naming catalog tools without explaining the architecture
'We use Collibra' is not an answer. Explain: connectors ingest metadata, a store normalizes it, search indexes it, lineage provides context, policies enforce governance. Tools implement architecture.
Ignoring the adoption problem
Most catalog failures are adoption failures, not technology failures. If the catalog isn't embedded in daily workflows (query editors, BI tools, Slack), people won't use it. Treat the catalog as a product with users, not infrastructure with admins.
Skipping data quality integration
A catalog without quality signals is a table of contents with no reviews. Consumers need to know: is this fresh? Are nulls within tolerance? Did the last pipeline succeed? Quality context is what makes catalog entries trustworthy.
Frequently asked questions
What is a data catalog?+
What are the best data catalog tools?+
How is a catalog different from a warehouse?+
How does a catalog help with GDPR?+
What is active metadata?+
Should you build or buy a catalog?+
Practice data engineering concepts
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition