with - sql server select last record of each group
Select first row in each GROUP BY group? (8)
As the title suggests, I'd like to select the first row of each set of rows grouped with a
Specifically, if I've got a
purchases table that looks like this:
SELECT * FROM purchases;
id | customer | total ---+----------+------ 1 | Joe | 5 2 | Sally | 3 3 | Joe | 2 4 | Sally | 1
I'd like to query for the
id of the largest purchase (
total) made by each
customer. Something like this:
SELECT FIRST(id), customer, FIRST(total) FROM purchases GROUP BY customer ORDER BY total DESC;
FIRST(id) | customer | FIRST(total) ----------+----------+------------- 1 | Joe | 5 2 | Sally | 3
Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in
purchases and 10k distinct
customer_id (avg. 20 rows per customer).
For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).
CREATE TABLE purchases ( id serial , customer_id int -- REFERENCES customer , total int -- could be amount of money in Cent , some_column text -- to make the row bigger, more realistic );
I use a
serial (PK constraint added below) and an integer
customer_id since that's a more typical setup. Also added
some_column to make up for typically more columns.
Dummy data, PK, index - a typical table also has some dead tuples:
INSERT INTO purchases (customer_id, total, some_column) -- insert 200k rows SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,200000) g; ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id); DELETE FROM purchases WHERE random() > 0.9; -- some dead rows INSERT INTO purchases (customer_id, total, some_column) SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id); VACUUM ANALYZE purchases;
customer table - for superior query
CREATE TABLE customer AS SELECT customer_id, 'customer_' || customer_id AS customer FROM purchases GROUP BY 1 ORDER BY 1; ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id); VACUUM ANALYZE customer;
In my second test for 9.5 I used the same setup, but with
random() * 100000 to generate
customer_id to get only few rows per
Object sizes for table
Generated with this query.
what | bytes/ct | bytes_pretty | bytes_per_row -----------------------------------+----------+--------------+--------------- core_relation_size | 20496384 | 20 MB | 102 visibility_map | 0 | 0 bytes | 0 free_space_map | 24576 | 24 kB | 0 table_size_incl_toast | 20529152 | 20 MB | 102 indexes_size | 10977280 | 10 MB | 54 total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157 live_rows_in_text_representation | 13729802 | 13 MB | 68 ------------------------------ | | | row_count | 200045 | | live_tuples | 200045 | | dead_tuples | 19955 | |
row_number() in CTE, (see other answer)
WITH cte AS ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) SELECT id, customer_id, total FROM cte WHERE rn = 1;
row_number() in subquery (my optimization)
SELECT id, customer_id, total FROM ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) sub WHERE rn = 1;
DISTINCT ON (see other answer)
SELECT DISTINCT ON (customer_id) id, customer_id, total FROM purchases ORDER BY customer_id, total DESC, id;
4. rCTE with
LATERAL subquery (see here)
WITH RECURSIVE cte AS ( ( -- parentheses required SELECT id, customer_id, total FROM purchases ORDER BY customer_id, total DESC LIMIT 1 ) UNION ALL SELECT u.* FROM cte c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id > c.customer_id -- lateral reference ORDER BY customer_id, total DESC LIMIT 1 ) u ) SELECT id, customer_id, total FROM cte ORDER BY customer_id;
customer table with
LATERAL (see here)
SELECT l.* FROM customer c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id = c.customer_id -- lateral reference ORDER BY total DESC LIMIT 1 ) l;
ORDER BY (see other answer)
SELECT (array_agg(id ORDER BY total DESC)) AS id , customer_id , max(total) AS total FROM purchases GROUP BY customer_id;
Execution time for above queries with
EXPLAIN ANALYZE (and all options off), best of 5 runs.
All queries used an Index Only Scan on
purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per
1. 273.274 ms 2. 194.572 ms 3. 111.067 ms 4. 92.922 ms 5. 37.679 ms -- winner 6. 189.495 ms
B. The same with Postgres 9.5
1. 288.006 ms 2. 223.032 ms 3. 107.074 ms 4. 78.032 ms 5. 33.944 ms -- winner 6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per
1. 381.573 ms 2. 311.976 ms 3. 124.074 ms -- winner 4. 710.631 ms 5. 311.976 ms 6. 421.679 ms
Original (outdated) benchmark from 2011
I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (
A) to the above
DISTINCT ON solution (
Select the whole table, results in 5958 rows in this case.
A: 567.218 ms B: 386.673 ms
WHERE customer BETWEEN x AND yresulting in 1000 rows.
A: 249.136 ms B: 55.111 ms
Select a single customer with
WHERE customer = x.
A: 0.143 ms B: 0.072 ms
Same test repeated with the index described in the other answer
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms 1B: 193.547 ms 2A: 249.796 ms -- special index not used 2B: 28.679 ms 3A: 0.120 ms 3B: 0.048 ms
On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:
WITH summary AS ( SELECT p.id, p.customer, p.total, ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS rk FROM PURCHASES p) SELECT s.* FROM summary s WHERE s.rk = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest x.customer, x.total FROM PURCHASES x JOIN (SELECT p.customer, MAX(total) AS max_total FROM PURCHASES p GROUP BY p.customer) y ON y.customer = x.customer AND y.max_total = x.total GROUP BY x.customer, x.total
In PostgreSQL this is typically simpler and faster (more performance optimization below):
SELECT DISTINCT ON (customer) id, customer, total FROM purchases ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2) id, customer, total FROM purchases ORDER BY 2, 3 DESC, 1;
total can be NULL (won't hurt either way, but you'll want to match existing indexes):
... ORDER BY customer, total DESC NULLS LAST, id;
DISTINCT ONis a PostgreSQL extension of the standard (where only
DISTINCTon the whole
SELECTlist is defined).
List any number of expressions in the
DISTINCT ONclause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.
Bold emphasis mine.
DISTINCT ONcan be combined with
ORDER BY. Leading expressions have to match leading
DISTINCT ONexpressions in the same order. You can add additional expressions to
ORDER BYto pick a particular row from each group of peers. I added
idas last item to break ties:
"Pick the row with the smallest
idfrom each group sharing the highest
totalcan be NULL, you most probably want the row with the greatest non-null value. Add
NULLS LASTlike demonstrated. Details:
SELECTlist is not constrained by expressions in
ORDER BYin any way. (Not needed in the simple case above):
You don't have to include any of the expressions in
You can include any other expression in the
SELECTlist. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 10. But the feature has been there at least since version 7.1, so basically always.
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized for real world applications. But use it if read performance is crucial. If you have
DESC NULLS LAST in the query, use the same in the index so Postgres knows sort order matches.
Effectiveness / Performance optimization
You have to weigh cost and benefit before you create a tailored index for every query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data, and in Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.
For few rows per customer, this is very efficient (even more so if you need sorted output anyway). The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough
work_memto process the involved sort step in RAM and not spill to disk. Generally setting
work_memtoo high can have adverse effects. Consider
SET LOCALfor exceptionally big queries. Find how much you need with
EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:
For many rows per customer, a loose index scan would be (much) more efficient, but that's not currently implemented in Postgres (up to v10).
There are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:
I had a simple benchmark here which is outdated by now. I replaced it with a detailed benchmark in this separate answer.
In Postgres you can use
array_agg like this:
SELECT customer, (array_agg(id ORDER BY total DESC)), max(total) FROM purchases GROUP BY customer
This will give you the
id of each customer's largest purchase.
Some things to note:
array_aggis an aggregate function, so it works with
array_agglets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
- Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
- You could use
array_aggin a similar way for your third output column, but
DISTINCT ON, using
array_agglets you keep your
GROUP BY, in case you want that for other reasons.
The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.
Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.
Supported by any database:
select * from purchase join ( select min(id) as id from purchase join ( select customer, max(total) as total from purchase group by customer ) t1 using (customer, total) group by customer ) t2 using (id) order by customer
This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.
t1, t2 are subquery alias which could be removed depending on database.
using (...)clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g.
on t2.id = purchase.idetc. The USING syntax works in SQLite, MySQL and PostgreSQL.
The solution is not very efficient as pointed by Erwin, because of presence of SubQs
select * from purchases p1 where total in (select max(total) from purchases where p1.customer=customer) order by total desc;
Very fast solution
SELECT a.* FROM purchases a JOIN ( SELECT customer, min( id ) as id FROM purchases GROUP BY customer ) b USING ( id );
and really very fast if table is indexed by id:
create index purchases_id on purchases (id);
If you want to select any (by your some specific condition) row from the set of aggregated rows.
If you want to use another (
sum/avg) aggregation function in addition to
max/min. Thus you can not use clue with
You can use next subquery:
SELECT ( SELECT **id** FROM t2 WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount ) ) id, name, MAX(amount) ma, SUM( ratio ) FROM t2 tf GROUP BY name
You can replace
amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row
But if you wanna to do such things you probably looking for window functions