Writing Legible SQL with Common Table Expressions

By Lucas Gray on July 26, 2016

Metal letters Image by bark

When it comes to writing SQL, it’s a common predicament: You start with something clear and simple that grows and grows into something completely illegible. One subselect becomes a subselect within a subselect, quickly turning into a ball of mud. Enter Common Table Expressions (CTEs).

Benefits of Common Table Expressions

CTEs shine as a way to…

  1. Break down your problem into smaller bits
  2. Keep things things tidy and legible
  3. Reuse resultsets (unlike subqueries)

Think of CTEs as temporary resultsets you can use in the rest of your query. Sometimes they’re called WITH clauses.

Not all databases support CTEs, but the ones that do typically feature a syntax like the following…

WITH commonTableExpression AS (SELECT a,b,c FROM stuff WHERE...)

A Practical Example

Consider the following use case I encountered recently. We have two tables: deliveries and clients. Clients have many deliveries. We need to figure out how many deliveries each client has had in the last week, broken down by day. So I whipped up a simple query…

SELECT d.created_at::date, c.name, count(1) 
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
GROUP BY c.name, d.created_at::date
ORDER BY c.name DESC

In English, we’re asking for delivery counts, broken down by client, for the last week. We get something like this –

"created_at","name","count"
"2016-06-23","Beekeeper","7"
"2016-06-24","Beekeeper","11"
"2016-06-27","Beekeeper","4"
"2016-06-28","Beekeeper","4"
"2016-06-29","Beekeeper","4"
"2016-06-27","Client A","448"
"2016-06-23","Client B","3"
"2016-06-24","Client B","3"
"2016-06-27","Client B","3"
"2016-06-28","Client B","3"
"2016-06-29","Client B","4"
"2016-06-28","Client C","21"
"2016-06-24","Client D","496"
...etc

Not too shabby for such a simple query. But a couple wild business requirements appear!

Business Requirements
  1. We need rows for days where no clients delivered anything (one row per client) – filling the gaps.
  2. If we have more than 5 clients, let’s just grab the top five for the week, and don’t include us (Beekeeper) in the report.

This shouldn’t be too hard, right?

Requirement 1: Filling Gaps

Let’s try to tackle #1 first. We will need to start by considering every date in our daterange, instead of using only dates provided to us in the deliveries table. No easy way to add that dynamically springs to mind, and the best answer that comes up in a google search involves a calendar table. At some point I suppose it usually seems necessary to create one to join to for these sorts of things. But what a pain – let’s just write some SQL!

Let’s see if we can combine a CTE with a nifty Postgres function named generate_series to do exactly what we need.

WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date)

Great, CTEs and a Postgres function bailed us out. The nice thing about this approach is I can now use the weekdaysToUse resultset for any subsequent CTE and the final query as well, and we’ll definitely need it.

We’re getting closer. Let’s add the joins to deliveries and clients, then run the query and see what we get.

WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date
)

SELECT wtu.date, c.name, count(1) as cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN clients c ON c.id = d.client_id
GROUP BY wtu.date, c.name
"date","name","cnt"
"2016-06-28","Client A","4"
"2016-06-21","Client B","40"
"2016-06-21","Client C","11"
"2016-06-23","Client B","3"
"2016-06-22","Client A","4"
"2016-06-27","Beekeeper","448"
"2016-06-29","Client A","4"
"2016-06-21","Client A","5"
"2016-06-24","Client B","3"
"2016-06-24","Client C","496"
"2016-06-28","Client A","21"
"2016-06-27","Client A","4"
"2016-06-29","Beekeeper","4"
"2016-06-24","Client A","11"
"2016-06-21","Beekeeper","3"
"2016-06-27","Beekeeper","3"
"2016-06-28","Beekeeper","3"
"2016-06-23","Client A","7"
"2016-06-22","Beekeeper","4"

Wait a sec, where are my zero delivery rows? We started the query off with weekdaysToUse, but will only get a row returned where we had at least a delivery from a client for that day. Shoot. Let’s make a query to fill in rows where a client did not have a delivery for that day. If we had that resultset, we could combine it with the results from the previous, and we’d be set.

Maybe we can try to use what we’ve learned so far and build up to it with CTEs.

WITH gapFiller AS (
  SELECT wtu.date, c.name, 0 as cnt
  FROM weekdaysToUse wtu, clients c
  WHERE NOT EXISTS(
    SELECT 1 
    FROM deliveries d 
    WHERE d.created_at = wtu.date 
    and d.client_id = c.id
  )
)
"date","name","cnt"
"2016-06-21","Client A","0"
"2016-06-22","Client A","0"
"2016-06-23","Client A","0"
"2016-06-24","Client A","0"
"2016-06-25","Client A","0"
"2016-06-26","Client A","0"
"2016-06-27","Client A","0"
"2016-06-28","Client A","0"
"2016-06-29","Client A","0"
"2016-06-21","Client B","0"
"2016-06-22","Client B","0"
"2016-06-23","Client B","0"
"2016-06-24","Client B","0"
"2016-06-25","Client B","0"
"2016-06-26","Client B","0"
"2016-06-27","Client B","0"
"2016-06-28","Client B","0"
"2016-06-29","Client B","0"
"2016-06-21","Client C","0"
"2016-06-22","Client C","0"
"2016-06-23","Client C","0"
"2016-06-24","Client C","0"
"2016-06-25","Client C","0"
"2016-06-26","Client C","0"
"2016-06-27","Client C","0"
"2016-06-28","Client C","0"
"2016-06-29","Client C","0"
"2016-06-21","Client D","0"
"2016-06-22","Client D","0"
"2016-06-23","Client D","0"
"2016-06-24","Client D","0"
"2016-06-25","Client D","0"
"2016-06-26","Client D","0"
"2016-06-27","Client D","0"
"2016-06-28","Client D","0"
"2016-06-29","Client D","0"

If we take that resultset and UNION ALL the rows that weren’t zero earlier, we’ll be all set.

Requirement 2: Using Only Top Clients

Remember our business requirement about top clients that aren’t us?

Here is a subquery solution,

SELECT d.created_at::date, c.name, COUNT(1) 
FROM deliveries d
JOIN clients c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
AND c.id IN (
  SELECT c1.id FROM clients c1 
  JOIN deliveries d1 ON c1.id = d1.client_id
  WHERE d1.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c1.name <> 'Beekeeper'
  GROUP BY c1.id
  ORDER BY count(d1.id) DESC
  LIMIT 5
)
group by c.name, d.created_at::date
order by c.name desc

and here is a CTE solution.

WITH topFiveClientsThisWeekThatArentUs AS (
SELECT c.id, c.name FROM clients c 
  JOIN deliveries d ON c.id = d.client_id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c.name <> 'Beekeeper'
  GROUP BY c.id
  ORDER BY COUNT(d.id) DESC
  LIMIT 5
)

SELECT d.created_at::date, c.name, count(1) 
FROM deliveries d
JOIN topFiveClientsThisWeekThatArentUs c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
GROUP BY c.name, d.created_at::date
ORDER BY c.name desc

The CTE allows you to logically structure your thinking, and improves readability 6 months from now when another developer has to grasp what the query you wrote was trying to do. Bonus points for choosing a smart name for the intermediate resultset. Mine is long, but describes exactly what it is – top five clients this week that aren’t us.

Putting It All Together

Now for the big reveal – but first, I wrote a naive subquery solution to the problem to demonstrate the alternative:

Subquery Solution
SELECT d AS date, name AS clientName, cnt

FROM (SELECT d FROM generate_series(
  date_trunc('day', now()) - INTERVAL '7 days',
  current_date,
  '1 day'::interval
) d) as weekdaysToUse --the days to use

JOIN ( --client/delivery combos
  SELECT d.created_at::date, c.name, d.client_id, count(1) AS cnt
  FROM clients c
  LEFT JOIN deliveries d ON d.client_id = c.id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
  GROUP BY d.created_at::date, c.name, d.client_id 
  
  UNION ALL
  
  SELECT weekdaysToUse.d, c.name, c.id, 0 AS cnt
  FROM clients c, (SELECT d FROM generate_series(
      date_trunc('day', now()) - INTERVAL '7 days', 
      current_date,
      '1 day'::interval
    ) d) AS weekdaysToUse
  WHERE NOT EXISTS(
      SELECT 1 FROM deliveries del 
      WHERE del.created_at::date = weekdaysToUse.d
        AND del.client_id = c.id
    ) --gap filler
) AS deliveries

ON weekdaysToUse.d = deliveries.created_at

WHERE deliveries.client_id IN (
  SELECT c.id
  FROM deliveries d
  JOIN clients c on c.id = d.client_id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c.name <> 'Beekeeper'
  GROUP BY c.id
  ORDER BY cnt DESC LIMIT 5
) --only top 5 that aren't us

ORDER BY date asc, clientName ASC

Yuck, all the requirements are just jumbled together. I also need to specify the date math and generate series all over the place, since I can’t reuse it without a CTE, and don’t want to scan too much data.

Now, at last, here is our complete solution featuring CTEs.

CTE Solution
WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date
),

topClients AS (
  SELECT c.id, c.name FROM clients c
    JOIN deliveries d ON c.id = d.client_id
    WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
      AND c.name <> 'Beekeeper'
    GROUP BY c.id
    ORDER BY COUNT(d.id) DESC
    LIMIT 5
),

dayClientCombosWithData AS (
  SELECT wtu.date, c.name, count(1) AS cnt
  FROM weekdaysToUse wtu
  LEFT JOIN deliveries d ON wtu.date = d.created_at::date
  JOIN topClients c ON c.id = d.client_id
  GROUP BY wtu.date, c.name
),

gapFiller AS (
  SELECT wtu.date, c.name, 0 as cnt
  FROM weekdaysToUse wtu, topClients c
  WHERE NOT EXISTS(
    SELECT 1 FROM DELIVERIES d WHERE d.created_at = wtu.date
  )
)

SELECT merged.date, merged.name, max(merged.cnt)
FROM (
  SELECT date, name, cnt FROM dayClientCombosWithData
  UNION ALL
  SELECT date, name, cnt FROM gapFiller 
  ) merged
GROUP BY merged.date, merged.name
ORDER BY merged.date, merged.name

Straightforward, easy to understand at every step. Not too shabby!

Still Writing SQL in a Terminal?

Beekeeper is a free and full featured SQL IDE with query autocomplete, error detection, dataset previews, and result visualization. We support SparkSQL, Hive, Postgres and more!

Create your account »

Lucas Gray bio photo

Lucas Gray

Lead Software Engineer. Runner. ETL jockey.

Email Twitter Github Stackoverflow