# Loan Management Schema

> Money out, payments back. The balance has to be exact.

Canonical URL: <https://datadriven.io/problems/loan_management_schema>

Domain: Data Modeling · Difficulty: easy · Seniority: L4

## Problem

We're a consumer lending company that offers personal loans, auto loans, and mortgages. Customers make monthly payments, but sometimes they pay early, miss payments, or refinance. The operations team needs outstanding balances and the risk team needs to flag delinquent accounts. Can you design the schema?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can recognize **normalization opportunities** without over-normalizing and model a transaction log that supports derived balances. It is an easy prompt that still rewards grain discipline: outstanding balance is not stored, it is computed.

> **Trick to Solving**
>
> When a prompt lists several “types” of something (personal, auto, mortgage) with their own rates and terms, the trick is to lift those attributes into a dimension. Before drawing tables, a strong candidate asks: is outstanding balance a stored column or a derived aggregate?
> 
> 1. Pull loan types into a dim table with rate and term columns
> 2. Keep loans as the instance fact of a customer taking a loan
> 3. Model every payment as a row in loan_transactions
> 4. Derive outstanding balance via SUM, never store it

---

### Break down the requirements

#### Step 1: Identify the four entities

`customers`, `loan_types`, `loans`, `loan_transactions`. Each has a clean grain and no redundant attributes.

#### Step 2: Normalize loan types

Rate, term, and product name belong on `loan_types`. Repeating them per loan row invites update anomalies when the product team renames a product.

#### Step 3: Grain of loan_transactions

One row equals one money movement on one loan: payment, disbursement, fee, or refund. `paid_amount` is signed or typed.

#### Step 4: Compute outstanding balance

`principal_amount - SUM(paid_amount)` via a GROUP BY on `loan_id`. The aggregate is always in sync because the underlying fact is the ledger.

---

### The solution

Below is one defensible model. The transaction log as source of truth is the conceptual anchor: balances are derived, never cached in-row.

> **Why this works**
>
> Deriving balance from a ledger is the accounting-grade pattern. Storing balance as a column looks efficient until the first missed update creates drift, at which point reconciliation is a nightmare. A SUM over a properly indexed fact is both correct and fast enough.

> **Interviewers watch for**
>
> A strong candidate says “the balance is a derived view” out loud and pushes back on any design that caches it as an UPDATEable column without a corresponding event row.

> **Common pitfall**
>
> Storing `outstanding_balance` as a column on `loans` and keeping it in sync via triggers. The first replication lag or missed trigger creates a silent drift between the ledger and the balance, which is exactly the failure the ledger model exists to prevent.

---

### The analysis pattern

**Current outstanding balance per loan**

```sql
SELECT
    l.loan_id,
    c.full_name,
    lt.product_name,
    l.principal_amount - COALESCE(SUM(tx.paid_amount), 0) AS outstanding_balance
FROM loans l
JOIN customers c ON c.customer_id = l.customer_id
JOIN loan_types lt ON lt.loan_type_id = l.loan_type_id
LEFT JOIN loan_transactions tx
  ON tx.loan_id = l.loan_id
 AND tx.transaction_type = 'payment'
WHERE l.status = 'active'
GROUP BY l.loan_id, c.full_name, lt.product_name, l.principal_amount
```

---

### Trade-offs and alternatives

**Derived balance from ledger**

Balance is a SUM over loan_transactions at read time.

* One source of truth
* No drift possible
* Read cost scales with transaction count per loan

**Stored balance column**

Balance maintained as an UPDATEable column on loans.

* O(1) read per loan
* Silent drift on missed updates
* Reconciliation jobs become mandatory

---

## Common follow-up questions

- How do you handle a refinance where an old loan is closed and a new one is opened? _(Tests whether the candidate models refinance as a status transition plus a parent_loan_id link.)_
- How do you compute days past due for each loan? _(Tests whether the candidate derives it from expected payment schedule vs actual transactions.)_
- How would you partition loan_transactions if volume reached 500M rows? _(Tests scale thinking: date partitioning and clustering by loan_id.)_
- What changes if a payment applies partially to interest and partially to principal? _(Tests whether transaction_type is fine-grained enough to decompose.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/loan_management_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.