# Same-Day Session and Transaction Correlation

> Same day session and purchase. Connected?

Canonical URL: <https://datadriven.io/problems/same_day_session_and_transaction_correlation>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

Find users who started a session and placed a transaction on the same calendar day. For those users, show user ID, the date, total transactions, and total transaction amount for that day.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `user_sessions` and `transactions` tables, this challenge probes your ability to apply self-join in a session analysis setting. Correctly referencing the `user_id` and `transaction_date` columns is essential to a working solution.

---

### Break down the requirements

#### Step 1: Join `transactions` to `user_sessions`

The join connects the two tables on their shared key. This brings the columns needed for filtering and aggregation into a single row set.

#### Step 2: Aggregate by `t.user_id`

`GROUP BY t.user_id, date(t.transaction_date` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

#### Step 4: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Join `transactions` to `user_sessions` to find same-day session and...**

```sql
SELECT t.user_id, date(t.transaction_date) AS the_date, COUNT(DISTINCT t.transaction_id) AS total_transactions, SUM(t.total_amount) AS total_amount
FROM transactions t
INNER
JOIN user_sessions us ON t.user_id = us.user_id AND date(t.transaction_date) = date(us.session_start)
GROUP BY t.user_id, date(t.transaction_date)
ORDER BY t.user_id, the_date
```

> **Cost Analysis**
>
> With ~180M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> Using `COUNT(*)` instead of `COUNT(DISTINCT col)` counts duplicates. If the prompt says 'unique', you need DISTINCT inside the aggregate.

---

## Common follow-up questions

- What would happen to your result if `user_sessions.session_start` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `session_start` and uses DISTINCT or deduplication where needed.)_
- `transactions.transaction_id` has roughly 100,000,000 distinct values. What index strategy would you use to avoid a full scan on `transactions`? _(Tests indexing knowledge specific to the high-cardinality `transaction_id` column in `transactions`.)_
- If `user_sessions` contained late-arriving rows that were inserted after your query ran, how would you design an incremental update instead of re-aggregating? _(Tests understanding of incremental aggregation patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/same_day_session_and_transaction_correlation)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.