# Data Quality Report

> The data is not as clean as it looks.

Canonical URL: <https://datadriven.io/problems/data_quality_report>

Domain: Python · Difficulty: hard · Seniority: L4

## Problem

Given a list of record dicts, return a dict per column name with 'null_count' and 'non_null_count'. Consider a value null when it is Python None.

## Worked solution and explanation

### Why this problem exists in real interviews

Profiling null rates per column is the first step in any data quality pipeline. This tests **iterating over records and columns simultaneously**, accumulating counts in a nested dictionary structure, and correctly identifying `None` values.

---

### Break down the requirements

#### Step 1: Discover all column names

The column set comes from the union of all keys across all records, since some records may be missing keys.

#### Step 2: Count nulls and non-nulls per column

For each record and each known column, check if the value is `None` and increment the appropriate counter.

#### Step 3: Handle missing keys as null

If a record lacks a column entirely, that counts as a null for that column.

---

### The solution

**Per-column null profiling with nested dict accumulation**

```python
def data_quality_report(records):
    columns = set()
    for record in records:
        for key in record:
            columns.add(key)
    result = {}
    for col in columns:
        result[col] = {'null_count': 0, 'non_null_count': 0}
    for record in records:
        for col in columns:
            value = record.get(col)
            if value is None:
                result[col]['null_count'] += 1
            else:
                result[col]['non_null_count'] += 1
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n * c) where n is the number of records and c is the number of unique columns. Each cell is visited once.
> 
> **Space:** O(c) for the result dict, one entry per column.

> **Interviewers Watch For**
>
> Using `record.get(col)` instead of `record[col]` to handle missing keys gracefully. Also, using `is None` rather than falsy checks to avoid treating `0` or `''` as null.

> **Common Pitfall**
>
> Only checking keys present in the first record. If different records have different columns, you must discover the full column set by scanning all records.

---

## Common follow-up questions

- How would you add type distribution per column? _(Tests extending the profiler to track int/str/float counts alongside null counts.)_
- What if the dataset has millions of records and hundreds of columns? _(Tests awareness of columnar profiling strategies and memory efficiency.)_
- How would you detect columns that are entirely null? _(Tests post-processing: filter the report for columns where `non_null_count` is 0.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/data_quality_report)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.