- Returning TOP N Records
- Like this? Follow me ↯
- If you like reading about sql, mysql, oracle, postgresql, top, limit, or rownum then you might also like:
- Comments
- Select Top N Rows in PostgreSQL
- Create a Table in PostgreSQL
- Use the FETCH Clause to Select the Top N Rows in PostgreSQL
- Use the LIMIT Clause to Select the Top N Rows in PostgreSQL
- Use the ROW_NUMBER() Function to Select the Top N Rows in PostgreSQL
- Use the RANK() Function to Select the Top N Rows in PostgreSQL
- Use the DENSE_RANK() Function to Select the Top N Rows in PostgreSQL
- Table Indexing in PostgreSQL
- Performance Analysis
- Atomic Spin
- Atomic Object’s blog on everything we find fascinating.
- Selecting Top N Per Group in PostgreSQL
- citusdata/postgresql-topn
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
Returning TOP N Records
Returning only the first N records in a SQL query differs quite a bit between database platforms. For example, you can’t say on oracle select top 100. Here’s some of the different ways you can return the top 10 records on various database platforms (SQL Server, PostgreSQL, MySQL, etc):
Microsoft SQL Server
PostgreSQL and MySQL
Oracle
Sybase
Firebird
Due to these differences if you want to keep your code database independent you should use the maxrows attribute in the cfquery tag in ColdFusion. The tradeoffs to database independence is performance, I would expect maxrows to be slower than specifying the rows in the SQL.
PostgreSQL and MySQL have a cool feature that will let you return an arbitrary range of rows (eg return rows 10-20). This is very handy for displaying pages of records:
The above query will return rows 20-30
Like this? Follow me ↯
Returning TOP N Records was first published on April 18, 2003.
If you like reading about sql, mysql, oracle, postgresql, top, limit, or rownum then you might also like:
Comments
MySQL also supports arbitrary range of rows.
e.g.
SELECT column FROM table
LIMIT 10,20
Hi all,
Every thing is well .but when we are using query for randomly fetching record from large table
This Query take more time
«select * from
(select * from table
order by dbms_random.value)
where rownum *
from
select top
from
for all n = no.of records u need to fetch at a time.
and k = multiples of n(eg. n=5; k=0,5,10,15. )
Sybase also admits SQL Server notattion:
SELECT TOP 10 column FROM table
And it is best than
SET rowcount 10
because rowcount is a param of the connection, and have to reconfigure the connection to set rowcount = 0 for the following queries, , if you do not want to have unexpected results.
Meanwhile, Firebird supports getting arbitrary rows, too akin to PostGreSQL/MySQL
SELECT column FROM table
LIMIT 10 OFFSET 20
In Firebird 2 (released a long time ago) and newer, it’s
SELECT column FROM table
ROWS 20 TO 30
fetch records except first 10 records in the table
answer: in sql
select * from (select rownum r,emp.*from emp) where r not between 1 and 10;
Select Top N Rows in PostgreSQL
Getting the highest values in a table is an important query to know. One of its areas of application is in generating reports.
This tutorial walks you through different ways to get the N highest scores in a PostgreSQL table.
Create a Table in PostgreSQL
Consider a student_score table with N students and their respective scores as shown here:
id | studentid | score |
---|---|---|
1 | 1 | 54 |
2 | 2 | 75 |
3 | 3 | 52 |
4 | 4 | 55 |
5 | 5 | 93 |
6 | 6 | 74 |
7 | 7 | 92 |
8 | 8 | 64 |
9 | 9 | 89 |
10 | 10 | 81 |
Here’s the INSERT statement to populate the table with 1 million rows of test data:
Use the FETCH Clause to Select the Top N Rows in PostgreSQL
The FETCH clause returns only the specified number of scores. The table is sorted in descending order, with the highest value at the top and the score decreasing as there is a progression down the table.
Then the clause is used to return only 10 of the highest scores, as specified in the query below.
This query, however, returns the top 10 students with the highest scores. In this example, 100 is the highest score any student can achieve, and the query’s output shows that at least 10 students scored 100.
It doesn’t give the 10 different highest values that any student has achieved. If there is a need to get the highest 10 unique scores acquired by any student, the DISTINCT clause is used as shown here:
In addition, if there is a requirement to know how many students had the highest 10 scores recorded, this query can be used to accomplish that:
Use the LIMIT Clause to Select the Top N Rows in PostgreSQL
The LIMIT clause returns the maximum number of rows specified. In this case, 10 is the maximum number of rows to be returned.
Using this clause also requires that the rows be sorted in descending order. Here’s the query:
Use the ROW_NUMBER() Function to Select the Top N Rows in PostgreSQL
The ROW_NUMBER() function can be used to obtain the same result. The query is given below.
Use the RANK() Function to Select the Top N Rows in PostgreSQL
Here’s a query using the RANK() function:
Use the DENSE_RANK() Function to Select the Top N Rows in PostgreSQL
Here’s a query using the DENSE_RANK() function:
Table Indexing in PostgreSQL
To improve performance on a large database, you can use indexes. The score column sorted in descending order can be indexed in this example.
It is done by creating an index and adding specific conditions, in this case, sorting the score column in descending order, as shown below.
As mentioned earlier, indexing improves query performance for tables with larger records most effectively. But it isn’t advisable to use it if the table is frequently modified (witH frequent inserts and updates).
Also, PostgreSQL will not use indexes in certain cases, such as when a table scan is quicker.
Performance Analysis
Before creating the score_desc_idx index, the execution time for each operation was given as,
Clause/Function | Execution time (ms) |
---|---|
FETCH | 844 |
LIMIT | 797 |
ROW_NUMBER() | 745 |
RANK() | 816 |
DENSE_RANK() | 701 |
However, this was just after a single run. Running it multiple times gave varying times which did not exceed a range.
You can use the EXPLAIN ANALYZE clause to determine which approach is less costly for your database.
It can be seen that PostgreSQL did not use the index created earlier, as scanning the table is relatively fast in this case. Since adding an index does not make any significant change to the execution time, the index can be removed by executing this command:
It’s up to you to decide which approach works best, and any approach can be used to achieve the same result.
Atomic Spin
Atomic Object’s blog on everything we find fascinating.
Selecting Top N Per Group in PostgreSQL
For a recent project, I found myself needing to get certain data out of our database. I wanted to select the top two items, grouped by color, and sorted by when they were created. My project was being built on PostgreSQL, so I turned to my handy dandy window functions.
Window functions are similar to aggregate functions, but instead of operating on groups of records to produce a single output record, they operate on records related to the current row. We can use that to add a column to our result set that represents the rank of each item in its respective group. Here’s the completed query:
Let’s break down what’s going on in each piece. The most interesting piece is the inner query that utilizes the OVER clause.
OVER describes how to window the results. PARTITION BY splits, or partitions, the data into groups with the same value for the specified column. In this case, it builds groups of the same color. ORDER BY within an OVER clause tells PostgreSQL how to order each group.
Once you’ve built your window, you’re ready for the window function: rank() returns the position of the record within its group or window. The above results might look like:
color | other_value | created_at | rank |
red | 12 | 2016-01-22 | 1 |
red | 18 | 2016-01-23 | 2 |
red | 19 | 2016-01-24 | 3 |
blue | 12 | 2016-01-22 | 1 |
blue | 18 | 2016-01-23 | 2 |
blue | 19 | 2016-01-24 | 3 |
Now we’re getting close to the goal. However, we want just the top two from each group. We cannot use a simple LIMIT because LIMIT doesn’t know about our partitions. If we use a sub-select from these results, we can now filter by rank.
This gives us just what we wanted:
color | other_value | created_at | rank |
red | 12 | 2016-01-22 | 1 |
red | 18 | 2016-01-23 | 2 |
blue | 12 | 2016-01-22 | 1 |
blue | 18 | 2016-01-23 | 2 |
This is such a common pattern that PostgreSQL has a built-in to shortcut the process, which is great if you’re only interested in the top ranking record from each group. It’s called DISTINCT ON .
Here’s a quick example:
So next time you need to grab the top N records from each group, reach for a rank()/PARTITION query.
citusdata/postgresql-topn
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
TopN is an open source PostgreSQL extension that returns the top values in a database according to some criteria. TopN takes elements in a data set, ranks them according to a given rule, and picks the top elements in that data set. When doing this, TopN applies an approximation algorithm to provide fast results using few compute and memory resources.
The TopN extension becomes useful when you want to materialize top values, incrementally update these top values, and/or merge top values from different time intervals. If you’re familiar with the PostgreSQL HLL extension, you can think of TopN as its cousin.
When to use TopN
TopN becomes helpful when serving customer-facing dashboards or running analytical queries that need sub-second responses. Ranking events, users, or products in a given dimension becomes important for these workloads.
TopN is used by customers in production to serve real-time analytics queries over terabytes of data.
Calculating TopN elements in a set by applying count, sort, and limit is simple. As data sizes increase however, this method becomes slow and resource intensive.
The open source TopN extension enables you to serve instant and approximate results to TopN queries. To do this, you first materialize top values according to some criteria in a data type. You can then incrementally update these top values, or merge them on-demand across different time intervals.
TopN was originally created to help Citus Data customers, who needed to scale out their PostgreSQL databases across dozens of machines. These customers needed to compute top values over terabytes of data in less than a second. We realized that the broader Postgres community could benefit from TopN , and decided to open source it for all users.
How does TopN work
The TopN approximation algorithm keeps a predefined number of frequent items and counters. If a new item already exists among these frequent items, the algorithm increases the item’s frequency counter. Else, the algorithm inserts the new item into the counter list when there is enough space. If there isn’t enough space, the algorithm evicts the bottom half of all counters. Since we typically keep counters for many more items (e.g. 100*N) than we are actually interested in, the actual top N items are unlikely to get evicted and will typically have accurate counts.
You can increase the algoritm’s accuracy by increasing the predefined number of frequent items/counters.
Once you have PostgreSQL, you’re ready to build TopN. For this, you will need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation’s bin/ directory path. For example:
You can run the regression tests as the following.
In this example, we take example customer reviews data from Amazon. We’re then going to analyze the most reviewed products based on different criteria.
Let’s start by downloading and decompressing source data files.
Next, we’re going to connect to PostgreSQL and create the TopN extension.
Let’s then create our example table and load data into it.
Now, we’re going to create an aggregation table that captures the most popular products for each month. We’re then going to materialize top products for each month.
From this table, you can compute the most popular/reviewed product for each day, in the blink of an eye.
You can also instantly find the top 10 reviewed products across any time interval, in this case January.
Or, you can quickly find the most reviewed product for each month in 2000.
TopN provides the following user-defined functions and aggregates.
A PostgreSQL type to keep the frequent items and their frequencies.
This is the aggregate add function. It creates an empty JSONB and inserts series of item from given column to create aggregate summary of these items. Note that the value must be TEXT type or casted to TEXT .
This is the aggregate for union operation. It merges the JSONB counter lists and returns the final JSONB which stores overall result.
Gives the most frequent n elements and their frequencies as set of rows from the given JSONB .
Adds the given text value as a new counter into the JSONB and returns a new JSONB if there is an enough space for one more counter. If not, the counter is added and then the counter list is pruned.
Takes the union of both JSONB s and returns a new JSONB .
Sets the number of counters to be tracked in a JSONB . If at some point, the current number of counters exceed topn.number_of_counters * 3, the list is pruned. The default value is 1000 for topn.number_of_counters . When you increase this setting, TopN uses more space and provides more accurate estimates.
TopN is compatible with the PostgreSQL 9.6, 10, 11, 12, and 13 releases. TopN is also compatible with all supported Citus releases, including Citus 6.x, 7.x, 8.x, and 9.x. If you need to run TopN on a different version of PostgreSQL or Citus, please open an issue. Opening a pull request (PR) is also highly appreciated.
The TopN extension to Postgres was created by and is maintained by the Citus database team, now part of Microsoft. The Citus team also created Citus, an open source extension to Postgres that transforms Postgres into a distributed database.
About
TopN is an open source PostgreSQL extension that returns the top values in a database according to some criteria