Concepts
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata (schemas, owners, quality scores, lineage, usage stats) and provides search so engineers and analysts can find, understand, and trust data without asking around on Slack.
Gartner estimates that data workers spend 30% of their time searching for and validating data. Organizations with mature catalog adoption report a 40% reduction in time-to-insight. The global data catalog market reached $1.2B in 2025 (IDC), growing at 22% CAGR.
A data catalog answers one question for every person on your data team: what data do we have and can I trust it.
A data catalog is a centralized inventory of every dataset in your organization. It stores metadata: schemas, descriptions, ownership, quality scores, usage stats, and tags. Think of it as a search engine for your data platform. When an analyst asks 'do we have customer churn data?', the catalog should answer that question in seconds, not Slack threads.
Catalogs treat metadata as a product. Technical metadata (column types, table sizes, freshness) is collected automatically. Business metadata (descriptions, domain tags, PII classifications) is curated by data owners. The catalog connects both layers so consumers see technical context and business meaning in one place.
Modern catalogs are not just search tools. They enforce governance: who owns each dataset, who can access it, what PII it contains, and whether its quality meets SLA thresholds. A catalog without governance is just a wiki that goes stale. A catalog with governance becomes the control plane for your data platform.
Data Sources
Warehouse
Data Lake
BI Tools
Orchestrator
Metadata Ingestion Layer
Crawlers, connectors, query log parsers, change event hooks
Catalog Core
Search Index
Metadata Store
Lineage Graph
Quality Engine
Access Policies
Classification Tags
Consumers
Analysts
Engineers
Governance Teams
ML Engineers
The catalog sits between raw data sources and consumers, providing a governed metadata layer that makes data discoverable and trustworthy.
A data catalog is not a single feature. It is six capabilities working together. Interviewers expect you to name these and explain how they interact.
Automated crawlers connect to warehouses, lakes, BI tools, and orchestrators to harvest technical metadata. Schema, column types, row counts, freshness, and partition info are pulled on a schedule or via event-driven hooks. The best catalogs also capture query logs to build usage and popularity signals.
Full-text search across table names, column names, descriptions, tags, and even sample values. Relevance ranking uses popularity (query frequency), freshness, and ownership signals. Think of it as Google for your data warehouse: analysts type a keyword and find the trusted, documented dataset in seconds.
Quality metrics are surfaced directly in the catalog: freshness (when was this table last updated?), completeness (null rate per column), volume (row count trends), and schema drift. Some catalogs integrate with tools like Great Expectations or Monte Carlo to pull quality check results into each asset page.
Every dataset has a designated owner (usually a team or individual). Owners are responsible for descriptions, quality SLAs, and responding to questions. Stewardship workflows let consumers request access, flag issues, or propose corrections. Without clear ownership, catalogs decay into graveyards of undocumented tables.
Catalogs display upstream and downstream dependencies for each asset. This is not lineage in isolation; it is lineage in context. When you find a dataset in the catalog, you immediately see where it came from, what transforms built it, and what dashboards depend on it.
Tag-based policies tie data classification (PII, HIPAA, internal) to access rules. When a column is tagged as PII, the catalog can enforce masking or restrict access to approved roles. Compliance teams use the catalog to audit who accessed what and whether PII handling meets regulatory requirements.
Interviewers expect you to know the ecosystem. Name the tool, but more importantly, explain where it fits and what trade-offs it makes.
Metadata platform originally built at LinkedIn, now maintained by Acryl Data. Supports ingestion from 50+ sources, GraphQL API, column-level lineage, and fine-grained access policies. The most popular open-source catalog.
Metadata governance for the Hadoop ecosystem. Strong type system, classification propagation, and glossary support. Common in legacy on-prem environments with Hive, HBase, and Kafka.
Enterprise data intelligence platform. Combines catalog, glossary, data quality, and policy management. Workflow engine for stewardship and access requests. Targeted at Fortune 500 organizations with complex governance needs.
Pioneered the data catalog category. Strong in search, curation, and BI integration. Behavioral analysis tracks query patterns to surface popular and trusted datasets. Used heavily in analytics-driven organizations.
Cloud-native active metadata platform. Embedded collaboration (Slack-like threads on datasets), automated lineage, and persona-based UI for analysts, engineers, and governance teams. Growing fast in modern data stacks.
Databricks native governance layer. Manages tables, volumes, models, and functions across workspaces. Fine-grained ACLs, automatic lineage, and AI/ML asset governance. Tight integration with Delta Lake and MLflow.
Managed Hive metastore for AWS. Stores table definitions for Athena, Redshift Spectrum, EMR, and Lake Formation. Crawlers auto-discover schemas from S3. The default catalog for AWS-native data lakes.
Fully managed metadata service for BigQuery, Pub/Sub, Cloud Storage, and Dataproc. Tag templates for custom metadata, policy tags for column-level security, and integration with Data Lineage API.
Interviewers frequently test whether you can distinguish these three concepts. They overlap but serve fundamentally different purposes.
Interviewers rarely ask “define data catalog.” They create scenarios that require you to apply catalog thinking to real organizational problems.
“Analysts spend hours searching for the right table. How would you fix this?”
Deploy a data catalog with automated metadata ingestion, quality signals, and ownership. The catalog becomes the entry point for data discovery. Candidates should describe ingestion connectors, search UX, curation workflows, and how to bootstrap adoption across teams.
“An auditor asks you to list every system that stores customer email addresses. How do you answer?”
Use the catalog's classification system: tag columns containing PII, propagate tags through lineage, and query the catalog for all assets tagged 'email' or 'PII-direct'. Candidates should mention automated classification (regex, ML-based) and manual review workflows.
“We are moving to a data mesh. How do domains publish and discover each other's data products?”
The catalog serves as the marketplace. Each domain registers data products with schema, SLAs, quality scores, and ownership. Consumers search the catalog, request access, and depend on published contracts. Lineage tracks cross-domain dependencies.
“Two teams have tables called 'revenue' with different numbers. How do you resolve this?”
Use the catalog to surface both tables, compare their lineage and definitions, and designate one as the certified or golden dataset. Add quality badges, deprecation labels, and redirect consumers to the trusted source. Governance workflow enforces the decision.
“We bought a catalog tool six months ago and nobody uses it. What went wrong?”
Catalogs fail without curation, ownership, and integration into daily workflows. The fix: automate metadata ingestion (no manual entry), embed catalog links in BI tools and query editors, assign owners per domain, and gamify curation with completeness dashboards. Adoption is a product problem, not a tooling problem.
What is a data catalog and why does it matter?
A data catalog is a centralized metadata store that enables discovery, governance, and trust across a data platform. It matters because without it, analysts waste hours asking 'does this table exist?' and 'can I trust this number?' The catalog answers both by combining technical metadata, business context, quality scores, and ownership in one searchable interface.
How does a data catalog differ from a data dictionary?
A dictionary defines columns within a single dataset (types, meanings, constraints). A catalog spans the entire platform: all datasets, all sources, with search, lineage, quality, and access control. A dictionary is a chapter; a catalog is the library. Most catalogs contain dictionary information as a subset.
What metadata does a catalog store?
Technical metadata: schema, column types, row counts, freshness, partitions, storage format. Business metadata: descriptions, domain tags, PII classifications, glossary terms. Operational metadata: query frequency, last queried by, pipeline dependencies, quality check results. Social metadata: ratings, comments, questions from consumers.
How would you design metadata ingestion for a catalog?
Connector-based architecture: each source (warehouse, lake, BI tool, orchestrator) has a crawler or push integration. Crawlers run on schedule or trigger on schema changes. Ingested metadata is normalized into a common model (datasets, fields, tags, owners). Important: ingestion must handle schema evolution, deleted assets, and freshness tracking.
What is active metadata and why does it matter?
Active metadata goes beyond passive documentation. It is metadata that drives automation: triggering alerts when freshness SLAs breach, auto-classifying PII columns, recommending datasets based on query patterns, and propagating governance tags through lineage. Active metadata turns the catalog from a reference manual into an operational control plane.
How do you handle PII classification in a catalog?
Layer automated and manual approaches. Automated: regex patterns (email, SSN, phone), ML classifiers for freetext fields, and tag propagation through lineage (if source column is PII, downstream columns inherit the tag). Manual: data stewards review and confirm classifications. Tie PII tags to access policies so classification has enforcement teeth.
How do you measure catalog adoption?
Key metrics: search volume (are people using it?), curation coverage (percentage of tables with descriptions and owners), time-to-discovery (how fast analysts find data), and consumer satisfaction (survey or NPS). Also track stale metadata: tables with no queries in 90 days that still show as 'active' indicate catalog decay.
How does a catalog support data governance?
The catalog is the governance hub. It enforces ownership (every asset has an accountable team), classification (PII and sensitivity labels), access control (policy tags drive permissions), quality (SLA thresholds with alerting), and lifecycle management (deprecation workflows). Without a catalog, governance is policy documents that nobody follows.
How would you evaluate data catalog tools?
Key criteria: ingestion breadth (number of source connectors), search quality (relevance ranking, filters), lineage depth (table vs column level), governance features (classification, access policies), extensibility (APIs, custom metadata), and adoption friction (SSO, embedded integrations, UX quality). Also consider open source vs SaaS vs cloud-native trade-offs.
How does a data catalog fit into a modern data stack?
The catalog sits at the metadata layer, connecting warehouse (Snowflake, BigQuery), transformation (dbt), orchestration (Airflow), and BI (Looker, Tableau). It ingests metadata from all layers and provides a unified view. In a mature stack, the catalog is the entry point: analysts start in the catalog, find the dataset, then open it in their query tool.
Treating the catalog as a one-time documentation project
Catalogs require continuous automation. Manual documentation goes stale within weeks. Automate ingestion, enforce ownership, and build curation into team workflows.
Confusing a data catalog with a data dictionary
A dictionary describes one dataset in depth. A catalog is a platform-wide inventory with search, lineage, governance, and access control. Interviewers will test whether you understand the difference in scope.
Naming catalog tools without explaining the architecture
Saying 'we use Collibra' is not an answer. Explain: connectors ingest metadata, a store normalizes it, search indexes it, lineage provides context, and policies enforce governance. Tools implement architecture.
Ignoring the adoption problem
Most catalog failures are adoption failures, not technology failures. If the catalog is not embedded in daily workflows (query editors, BI tools, Slack), people will not use it. Treat the catalog as a product with users, not infrastructure with admins.
Skipping data quality integration
A catalog without quality signals is a table of contents with no reviews. Consumers need to know: is this table fresh? Are nulls within tolerance? Did the last pipeline run succeed? Quality context is what makes catalog entries trustworthy.
DataDriven covers data modeling, SQL, pipeline design, and system design at interview difficulty.
Start Practicing