# The Molecule Report

> Four letters. A lot of math hidden in the sequence.

Canonical URL: <https://datadriven.io/problems/the_molecule_report>

Domain: Python · Difficulty: easy · Seniority: L4

## Problem

Given a DNA sequence string, return a dict with: 'nucleotide_counts' (dict of A/C/G/T counts), 'gc_content' ((G+C)/total*100 as float, 0.0 for empty sequence), 'most_common_dinucleotide' (the most frequent 2-char substring; tie-break alphabetically, empty string if len<2), 'is_valid' (True iff the sequence contains only A, C, G, T). NULL gc_content for empty is 0.0.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **multi-metric computation from a single pass** over string data. It checks whether candidates can extract counts, percentages, the most common subsequence, and a validation flag from one input, demonstrating structured problem decomposition.

---

### Break down the requirements

#### Step 1: Count each nucleotide base

Iterate through the string and tally occurrences of A, T, G, and C.

#### Step 2: Compute GC content

GC content is `(count_G + count_C) / len(sequence) * 100`. Handle empty sequences to avoid division by zero.

#### Step 3: Find the most common 2-character subsequence

Count all consecutive pairs and find the one with the highest frequency.

#### Step 4: Validate the sequence

Check that every character is one of A, T, G, C.

---

### The solution

**Single pass with multi-metric aggregation**

```python
def dna_summary(sequence):
    base_counts = {'A': 0, 'T': 0, 'G': 0, 'C': 0}
    valid = True
    pair_counts = {}
    for i in range(len(sequence)):
        ch = sequence[i]
        if ch in base_counts:
            base_counts[ch] += 1
        else:
            valid = False
        if i < len(sequence) - 1:
            pair = sequence[i:i+2]
            if pair in pair_counts:
                pair_counts[pair] += 1
            else:
                pair_counts[pair] = 1
    total = len(sequence)
    gc_content = 0.0
    if total > 0:
        gc_content = (base_counts['G'] + base_counts['C']) / total * 100
    most_common_pair = ''
    max_pair_count = 0
    for pair in pair_counts:
        if pair_counts[pair] > max_pair_count:
            max_pair_count = pair_counts[pair]
            most_common_pair = pair
    result = {
        'base_counts': base_counts,
        'gc_content': gc_content,
        'most_common_pair': most_common_pair,
        'valid': valid
    }
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n) with a single pass for counting bases and pairs.
> 
> **Space:** O(k) where k is the number of distinct 2-character pairs. At most 16 for a valid DNA sequence (4^2).

> **Interviewers Watch For**
>
> Do you handle the empty sequence edge case? Division by zero on GC content is a common bug. Also, returning a structured dict shows you can design clean API responses.

> **Common Pitfall**
>
> Checking validity by comparing `sum(base_counts.values()) == len(sequence)`. This works only if `base_counts` does not count invalid characters, which depends on your implementation. An explicit per-character check is safer.

---

## Common follow-up questions

- What if the sequence could be RNA (containing U instead of T)? _(Tests parameterizing the valid character set.)_
- How would you find the longest repeated subsequence? _(Tests moving beyond 2-character pairs to variable-length pattern matching.)_
- What if the sequence were gigabytes long? _(Tests streaming computation vs loading everything into memory.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_molecule_report)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.