Virtual Indexing Part 4: The NTally Table

A set-based, more performant alternative to SQL Server’s NTILE function coming right up.

Introduction

Few authors have impacted my career like Itzik Ben-Gan. If you are trying to get a grasp on how to develop high-performing, set-based SQL then a good Ben-Gan book will be a vital part of your SQL toolbox. Today Mr. Ben-Gan posted a challenge to create the fastest number generator. I offered up rangeAB which leverages Jeff Moden’s fnTally function which the fastest “number generator” in the business based on my experience.

When I first learned about the how to use a numbers table I began trying to use them to solve problems I’d never seen other people solve without a loop or other iterative method. I thought it was fun and doing so helped me grow my skills. One thing I never saw anyone do is use a numbers table to develop a set-based alternative to NTILE. I even heard people say that it’s not possible. Today we’re going to come up with two functions, each which blow the doors of NTILE with respect to speed and IO. The fist using a virtual auxiliary table of numbers (AKA “Tally Table”), the second using a physical numbers tables.

Quick NTILE overview

Before I dive into the algorithm let’s quickly review what NTILE does. NTILE is a window ranking function used to divide a set of rows as evenly as possible.

In High-Performance T-SQL Using Window Functions Ben-Gan explains it like this:

The NTILE function allows you to arrange the rows within the window partition in roughly equally sized tiles, based on the input number of tiles and specified window ordering. For example, suppose that you want to arrange the rows from the OrderValues view in 10 equally sized tiles based on val ordering. There are 830 rows in the view; hence, with 10 requested tiles, the tile size is 83 (that’s 830 divided by 10). So the first 83 rows (the first tenth) based on val ordering will be assigned with tile number 1, the next 83 with tile number 2, and so on.

High-Performance T-SQL Using Window Functions, Itzik Ben Gan

Let’s use fnTally let’s demo generate 13 rows and use NTILE to divide the rows into 4 tile groups.

Figure 1: NTILE test

DECLARE  @tiles BIGINT = 4,
         @rows  BIGINT = 13;

--==== NTILE CLASSIC
SELECT f.N, Tile = NTILE(@tiles) OVER (ORDER BY f.N)
FROM   dbo.fnTally(1,@rows) AS f;

Figure 1.2: NTILE test results

N     Tile
----- -------
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    3
11    4
12    4
13    4

To calculate the number of tiles required, for each number, from 1 to @tile, the quantity is calculated by taking the @rows/@tiles, then adding 1 for each number that is less than, or equal to: @rows%@tiles. In the example below we get 3 tile numbers for 1,2,3 and 4 (13/4 = 3). Then the number 1 gets an additional member because 1<=(13%4). If add a row then 2 will also get an additional member (14%4 = 2), if we set @rows to 15 then the number 1,2&3 will have 4 members (15%4=3). If we set @rows to 16 we get 4 rows for the numbers 1,2,3 & 4 (16/4 = 4, 16%4 = 0.)

This is how NTILE works, this is how our alternative must also work.

The NTally Table – What it is and how it replaces NTILE

This heading is a nod to The “Numbers” or “Tally” Table: What it is and how it replaces a loop. You can think of an NTally table as a tally table with tiles or a tiled tally table. I developed my first “NTally” in 2014 using a physical numbers table (the DDL for my numbers table can be found here.) An example of my first NTally table is posted here: Odd (to me) Behavior with NTILE() and NULL Values in Source Data. I have since improved on the physical numbers table version and developed one using a virtual table. Let’s examine the logic.

Figure 2.1: NTally Logic

DECLARE  @tiles BIGINT = 4,
         @rows  BIGINT = 13,
         @desc  BIT    = 0;

--==== NTILE CLASSIC
SELECT RN = f.N, Tile = NTILE(@tiles) OVER (ORDER BY f.N)
FROM   dbo.fnTally(1,@rows) AS f;

--==== ALTERNATIVE #1: PHYSICAL Auxiliary table of Numbers (dbo.NTally)
SELECT RN = ROW_NUMBER() OVER (ORDER BY t.N), Tile = t.N
FROM   dbo.tally AS t
CROSS APPLY
(
  SELECT TOP(@rows/@tiles+IIF(t.N<=@rows%@tiles,1,0)) $
  FROM       dbo.tally AS t1
  CROSS JOIN dbo.tally AS t2
) AS x(x)
WHERE t.N<=@tiles;

--==== ALTERNATIVE #2: PHYSICAL Auxiliary table of Numbers (dbo.NTallyRangeAB)
SELECT      RN     = ROW_NUMBER() OVER (ORDER BY t.RN),
            Tile   = t.RN
FROM        core.rangeAB(1,@tiles,1,1)                                  AS t
CROSS APPLY (VALUES(IIF(@desc=0,t.RN,t.OP)))                            AS d(D)
CROSS APPLY core.rangeAB(1,@rows/@tiles+IIF(d.D<=@rows%@tiles,1,0),1,1) AS x;

Each query returns the same thing except that I renamed “N” to “RN”.

Figure 2.2: NTally Results

RN    Tile
----- ------
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    3
11    4
12    4
13    4
The functions:
Simple Demo

The NTally table is not ready to replace NTILE out of the box, a small amount of assembly is required; this is what the RN column is for. Below is an example of how to “tile” a table using NTILE followed by examples which leverage dbo.NTally and dbo.NTallyRangeAB, which leverages dbo.rangeAB.

The NTally solutions are easy to assemble: instead of “NTILE(@tiles) OVER (ORDER BY s.SomeID)” we’ll use ROW_NUMBER inside a CTE with the same ORDER BY. With the row number in place (RN) we can use it to join to our NTally table and return a “tiled” result set without NTILE.

Note that the function needs the number of rows ahead of time, a simple COUNT of the rows passed to the function as the @rows parameter does the trick. This will add a little penalty but we’ll still see a huge time/IO gains nonetheless.

Figure 3.1: NTally Functional Demo

--==== 1. Sample Data
IF OBJECT_ID('tempdb..#SomeTable') IS NOT NULL DROP TABLE #SomeTable;
CREATE TABLE #SomeTable
(
  SomeId BIGINT PRIMARY KEY
);
INSERT #SomeTable(SomeId) SELECT f.N FROM dbo.GetNums(1,14) AS f;

--==== 2. Parameters and examples
DECLARE  @tiles BIGINT = 5,
         @rows  BIGINT = 14,
         @desc  BIT    = 0;

PRINT CHAR(13)+'NTILE CLASSIC'+CHAR(13)+REPLICATE('-',90);
  SELECT s.SomeId, Tile = NTILE(@tiles) OVER (ORDER BY s.SomeID)
  FROM   #SomeTable AS s;

PRINT CHAR(13)+'NTally'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      a.SomeID, nt.Tile
  FROM        anchor AS a
  CROSS JOIN  dbo.NTally(@tiles, (SELECT COUNT(*) FROM #SomeTable)) AS nt
  WHERE       a.rn = nt.rn

PRINT CHAR(13)+'NTallyRangeAB'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      a.SomeID, nt.Tile
  FROM        anchor AS a
  CROSS APPLY core.NTallyRangeAB(@tiles, (SELECT COUNT(*) FROM #SomeTable),@desc) AS nt
  WHERE       a.rn = nt.rn;

Each query returns the correct result set:

Figure 3.2: Results

SomeID  Tile
------- -------
1       1
2       1
3       1
4       2
5       2
6       2
7       3
8       3
9       3
10      4
11      4
12      4
13      5
14      5

If you have kept up with this series you can think of dbo.NTally as an indexed function and dbo.NTallyRangeAB as a function that leverages the virtual index. Now for performance.

Test #1: 1,000,000 Rows – Time and IO

I’m running SQL 2019 which allows for batch mode on rowstore. I’m going to run the NTILE query twice, first as-is, then with batch mode disabled. I can’t compete with batch mode but we’ll get close. We’ll compare the results with the NTally functions. Unlike the NTILE examples here, the NTally versions get faster when they get a parallel plan; I excluded the parallel NTILE test because it was too awful.

Note, too, that the optimizer tries to use a Hash Match join algorithm for joining the row numbers. A merge join is the right join for this task; in the examples below I’ll let the optimizer pick the join, then force a merge join so you can see the performance difference.

Figure 4: 1M Row Test

IF OBJECT_ID('tempdb..#SomeTable') IS NOT NULL DROP TABLE #SomeTable;
CREATE TABLE #SomeTable
(
  SomeId BIGINT PRIMARY KEY
);
INSERT #SomeTable(SomeId) SELECT f.N FROM dbo.GetNums(1,1000000) AS f;

SET STATISTICS TIME, IO ON;
  DECLARE @Tiles BIGINT = 3, @NT BIGINT;

  PRINT CHAR(13)+'NTILE CLASSIC with Batch Mode'+CHAR(13)+REPLICATE('-',90);
  SELECT @NT = NTILE(@Tiles) OVER (ORDER BY s.SomeID)
  FROM   #SomeTable AS s;

  PRINT CHAR(13)+'NTILE CLASSIC No Batch Mode'+CHAR(13)+REPLICATE('-',90);
  SELECT @NT = NTILE(@Tiles) OVER (ORDER BY s.SomeID)
  FROM   #SomeTable AS s
  OPTION(USE HINT('DISALLOW_BATCH_MODE')); -- Crawls when Parallel

  PRINT CHAR(13)+'dbo.NTALLY SERIAL, Hash Join'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  CROSS APPLY dbo.NTally(3, (SELECT COUNT(*) FROM #SomeTable)) AS nt
  WHERE       a.rn = nt.rn
  OPTION (MAXDOP 1);

  PRINT CHAR(13)+'dbo.NTally SERIAL, Merge Join'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  INNER 
  MERGE JOIN dbo.NTally(@Tiles, (SELECT COUNT(*) FROM #SomeTable)) AS nt
    ON       a.rn = nt.rn
  OPTION (MAXDOP 1);

  PRINT CHAR(13)+'dbo.NTally PARALLEL, Merge Join'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  INNER 
  MERGE JOIN dbo.NTally(@Tiles, (SELECT COUNT(*) FROM #SomeTable)) AS nt
    ON       a.rn = nt.rn
  CROSS JOIN dbo.make_parallel() AS x;

  PRINT CHAR(13)+'dbo.NTallyRangeAB SERIAL, Hash Join'+CHAR(13)+REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  CROSS APPLY core.NTallyRangeAB(@Tiles, (SELECT COUNT(*) FROM #SomeTable),0) AS nt
  WHERE       a.rn = nt.rn
  OPTION (MAXDOP 1);

  PRINT CHAR(13)+'dbo.NTallyRangeAB SERIAL, Merge Join'+CHAR(13)+
    REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  INNER 
  MERGE JOIN core.NTallyRangeAB(@Tiles, (SELECT COUNT(*) FROM #SomeTable),0) AS nt
    ON       a.rn = nt.rn
  OPTION (MAXDOP 1);

  PRINT CHAR(13)+'dbo.NTallyRangeAB Parallel, Merge Join'+CHAR(13)+
    REPLICATE('-',90);
  WITH anchor AS
  (
   SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
   FROM   #SomeTable AS s
  )
  SELECT      @NT = nt.Tile -- a.SomeID, nt.* --AS TileGroup
  FROM        anchor AS a
  INNER 
  MERGE JOIN core.NTallyRangeAB(@Tiles, (SELECT COUNT(*) FROM #SomeTable),0) AS nt
    ON       a.rn = nt.rn
  CROSS JOIN dbo.make_parallel() AS x;
SET STATISTICS TIME, IO OFF;
GO

Figure 4.2 Results:

NTILE CLASSIC with Batch Mode
------------------------------------------------------------------------------------------
Table '#SomeTable_00000001CF14'. Scan count 1, logical reads 2108...
Table 'Worktable'. Scan count 1, logical reads 46274, physical reads 0...

 SQL Server Execution Times: CPU time = 281 ms,  elapsed time = 284 ms.

NTILE CLASSIC No Batch Mode
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 3, logical reads 2867958, physical reads 0... 
Table '#SomeTable_00000001CF14'. Scan count 1, logical reads 2108...

 SQL Server Execution Times: CPU time = 2375 ms,  elapsed time = 2378 ms.

dbo.NTALLY SERIAL, Hash Join
------------------------------------------------------------------------------------------
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750
Table 'tally'. Scan count 3, logical reads 1756, physical reads 0...

 SQL Server Execution Times: CPU time = 984 ms,  elapsed time = 997 ms.

dbo.NTally SERIAL, Merge Join
------------------------------------------------------------------------------------------
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750, physical reads 0...
Table 'Worktable'. Scan count 0...
Table 'tally'. Scan count 3, logical reads 599, physical reads 0...

 SQL Server Execution Times: CPU time = 922 ms,  elapsed time = 917 ms.

dbo.NTally PARALLEL, Merge Join
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 0...
Table 'Worktable'. Scan count 0...
Table 'tally'. Scan count 9, logical reads 601, physical reads 0...
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750, physical reads 0

 SQL Server Execution Times: CPU time = 2408 ms,  elapsed time = 790 ms.

dbo.NTallyRangeAB SERIAL, Hash Join
------------------------------------------------------------------------------------------
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750, physical reads 0

 SQL Server Execution Times: CPU time = 906 ms,  elapsed time = 946 ms.

dbo.NTallyRangeAB SERIAL, Merge Join
------------------------------------------------------------------------------------------
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750, physical reads 0...
Table 'Worktable'. Scan count 0...

 SQL Server Execution Times: CPU time = 969 ms,  elapsed time = 977 ms.

dbo.NTallyRangeAB Parallel, Merge Join
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 0, logical reads 0...
Table '#SomeTable_00000001CF14'. Scan count 7, logical reads 14750, physical reads 0...

 SQL Server Execution Times: CPU time = 1639 ms,  elapsed time = 535 ms.

Completion time: 2020-12-10T20:02:28.6874120-06:00

NTILE with batch mode runs for 284MS with a serial plan. Without batch mode, however, just can’t compete with the dbo.NTally or dbo.NTallyRangeAB. First, check those reads – 2,867,958 and it runs for 2 seconds. Each NTally solution runs for under 1 second with the best performance being NTallyRangeAB as the winner at 1/2 a second and 14,750 reads – 4X speed increase and a 99.5% I/O reduction.

No Looking Back

I snuck a bonus feature on dbo.NTallyRangeAB. You may have noticed that I added a third column named TileOP, this is just the tile group in descending order with the distribution flipped, e.g. the “overflow” tiles are assigned to the highest tile numbers first. For this we simply return TileOP instead of the Tile row. Let’s use our 14 row test from figure 3.2

Figure 5.1 NTallyRangeAB TileOP Demo:

DECLARE  @tiles BIGINT = 5,
         @rows  BIGINT = 14,
         @desc  BIT    = 0;

WITH anchor AS
(
 SELECT s.SomeID, ROW_NUMBER() OVER (ORDER BY s.SomeID) AS rn
 FROM   #SomeTable AS s
)
SELECT      a.SomeID, nt.Tile, nt.TileOP
FROM        anchor AS a
CROSS APPLY core.NTallyRangeAB(@tiles, (SELECT COUNT(*) FROM #SomeTable),@desc) AS nt
WHERE       a.rn = nt.rn
ORDER BY    a.SomeId

I returned the Tile and TileOP columns so you can see the difference.

Figure 5.2. Results:

SomeID   Tile   TileOP
-------- ------ --------
1        1      5
2        1      5
3        1      5
4        2      4
5        2      4
6        2      4
7        3      3
8        3      3
9        3      3
10       4      2
11       4      2
12       4      2
13       5      1
14       5      1

Some interesting lessons here. First, NTILE teamed with batch mode is nasty fast and tough to beat. If batch mode is not an option then NTally and NTallyRangeAB are the only game in town as far as I’m concerned .

This concludes Part 4 of this series. Forgive any typos – this is the first/last draft. Cheers!

Author: Alan Burstein

SQL Performance Ninja with 20+ years of writing high-performing, elegant SQL code. Expert at developing super fast NLP algorithms

3 thoughts on “Virtual Indexing Part 4: The NTally Table”

  1. So have you tried testing the performance of MAXDOP 1 when calling your fnTally table, per Itzik’s example in that number series discussion we’ve been having? He claims there’s a performance benefit…didn’t know if that would apply to this?

Leave a Reply