Virtual Indexing Part 5: The Multidimensional Virtual Index

In this installment of the Virtual Indexing series we’ll examine how to combine indexes to return a distinct set of numbers faster than using GROUP BY or DISTINCT

Introduction

As you may be aware, the challenge is on! The primary objective of this article is to further illustrate the power of rangeAB.  In Part 1 of this series I introduced the Virtual Index, the suitably sorted stream returned by SQL window functions which can be used for sorting, grouping and alike. In Part 3 we introduced the Indexed Function as well as my RangeAB functions. In this installment we’re going to combine a physical index to a virtual one to create what I call, a multi-dimensional virtual Index, and use it to return s distinct set of numbers from a very large set. Based on my testing, this technique completely decimates DISTINCT, GROUP BY, ROW_NUMBER or any other method for returning a distinct set.

The Problem

A common annoyance for me when developing ETL packages, is when I need to extract millions of non-distinct numbers into a table then use that table as a lookup downstream. The first solution to this problem is to perform a SELECT DISTINCT against that table each time I need unique values. A better solution is to create a new table to store a distinct set of values and then reference that table downstream. Either way, a DISTINCT against millions of rows can be slow and IO intensive, even with an index in place.

Sample Data

WARNING – the code below, as-is, creates One Billion rows in ~45 minutes. I need to use that many rows because of how efficient this technique is. If you are playing along at home be sure to adjust the @ii as needed, each iteration returns 10,000,000 rows. @ii is set to 100 for my billion-row test (100*10000000)= one billion. With the clustered index in place the sample data takes up 200MB/100 Million rows, a total of 20GB for one billion.

If you’re playing along at home you’ll need a copy of my random numbers function. The routine works like this: first it creates 15,000 random numbers between 1 and 25,000. This creates gaps in the sequence which will matter later in this series. The while loop then cross joins #t to itself and re-inserts 10,000,000 duplicates for each iteration (@ii). The random numbers function is attached here:

Figure 1. The Setup Routine

--==== WARNING – AS-IS, this routine will create 1,000,000,000 rows in 45 minutes...
--==== On my PC this runs about 4 minutes per 10 iterations
--==== With that in mind – set @ii accordingly!
IF OBJECT_ID('tempdb..#t') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t (N INT NOT NULL INDEX nc_n CLUSTERED(N));
GO
INSERT #t SELECT r.RND FROM dbo.randomNumbers(15000,1,25000) AS r;
GO
--==== 1 Billion in 45 minutes  at 10M 100X batches (100M in 4 minutes)
DECLARE 
  @st DATETIME = GETDATE(),
  @i  INT      = 1,
  @ii INT      = 100;

WHILE @i<=@ii
BEGIN
  PRINT  @i;
  INSERT #t SELECT TOP (10000000)  t.N FROM #t AS t, #t AS t2 ORDER BY (SELECT NULL);
  SET @i+=1
END;

Returning unique values from a large set – Old School

Below is the typical way we would retrieve a unique set from the sample data stored in #t. Let’s run a quick performance test, first with a serial execution plan, the second with a parallel plan.

Figure 2.1. Returning a DISTINCT set using GROUP BY

PRINT CHAR(10)+'GROUP BY - serial'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT   @N=t.N
  FROM     #t AS t
  GROUP BY t.N
  OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,@st,GETDATE());
GO
PRINT CHAR(10)+'GROUP BY - parallel'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT   @N=t.N
  FROM     #t AS t
  GROUP BY t.N
  OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,@st,GETDATE());
GO

Note the straightforward execution plan. All billion+ rows are retrieved from #t then a hash match is used to filter out duplicates. With a serial plan the query runs for 2 minutes, with a parallel plan it runs for 12 seconds.

Figure 2.2. Old School Execution plan

Returning a distinct set using GROUP BY

Returning a distinct set from a large set of numbers – New School

Now we’re going to use RangeAB to return a unique set, note my comments.

Figure 3.1. Using RangeAB to return a distinct set

  SELECT N = fnFilter.N1
  FROM  (SELECT MAX(t.N) FROM #t AS t) AS b(Mx) -- (1) get the highest N from #t: MAX(N)
  CROSS APPLY
  (                         -- (2) Call RangeAB to retrieve the numbers 1 through MAX(N)
    SELECT       r.RN, r.N1 -- (3) Only return numbers from rangeAB that exist in #t
    FROM         core.rangeAB(1,b.Mx,1,1) AS r  -- Note that b.Mx is the MAX(#t.N) 
    CROSS APPLY (SELECT TOP (1) 1 FROM #t AS t WHERE t.N = r.N1) AS z(x) -- gets row goals
  ) AS fnFilter -- Inline Function for filtering
  GROUP BY fnFilter.N1      -- (4) Use GROUP BY to eliminate the 1-many between r.N1 & t.N
  ORDER BY fnFilter.N1;     -- Not required but no penalty

The result set comes back instantly. It doesn’t matter if I force a serial or parallel plan. This, regardless of whether I leverage batch mode processing. This is true with both RangeAB and RangeAB2. If I run this with statistics time and IO on, we see:

Figure 3.2. STATISTICS TIME, IO results

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'Worktable'. Scan count 0, logical reads 0...
Table '#t_00000001E3BF'. Scan count 25000, logical reads 108801, physical reads 0...

 SQL Server Execution Times: CPU time = 78 ms,  elapsed time = 174 ms.

That’s one heck of an improvement. Next let’s review what’s happening.

How it Works

Instead of dragging out all billion+ rows we are retrieving the numbers 1 to MAX(#t.N) from our tally table – approximately 25,000 rows. The execution plan reveals how the magic happens:

Figure 4.1. New School (using RangeAB) execution plan details

Note that we see 27,000 rows are pulled from our tally table – I don’t know what the extra 2,000 rows are about, but it won’t impact performance here

The query scans #t once for each unique number (N1) returned by our numbers table function but only returns them if there is a match. For each of the 25,000 possible numbers returned from the range function a scan is performed against #t for the number’s existence. This is where the “Scan count 25000” from Figure 3.2 comes from. 25,000 scans may seem excessive, but that number represents one-row index seeks which is nothing compared to a scan against a billion rows. In this example, there are 11,335 unique numbers – the exact number of rows returned from #t.

This two dimensional matrix will help visualize what is happening: instead of scanning four rows from #t to determine that 3 is one of the unique values, we are visiting each possible value only once from virtual index on RangeAB.

Figure 4.2. 2-Dimensional Matrix

Multidimensional Index Performance Test

For the test data I am leveraging the setup routine from Figure 1. Note that I run my alternative queries five times each but the old school, GROUP BY versions only run once. This is because they are that much slower.

Figure 5.1. New School Test Harness

DECLARE @rowz INT = (SELECT COUNT(*) FROM #t);
PRINT CONCAT('Multidimensional Virtual Index Test:, ',@rowz,' rows, N=1-25K'+
  CHAR(10),REPLICATE('-',90));
GO
PRINT CHAR(10)+'GROUP BY - serial'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT   @N=t.N
  FROM     #t AS t
  GROUP BY t.N
  OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,@st,GETDATE());
GO
PRINT CHAR(10)+'GROUP BY - parallel'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT   @N=t.N
  FROM     #t AS t
  GROUP BY t.N
  OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,@st,GETDATE());
GO
--==== MAGIC WITH CLUSTERED INDEX, NO INDEX = DEATH ON HUGE GAPS
PRINT CHAR(10)+'APPLY + TOP 1 - Serial'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT @N = fnFilter.N1
  FROM  (SELECT MAX(t.N) FROM #t AS t) AS b(Mx) -- (1) get the highest N from #t: MAX(N)
  CROSS APPLY               -- (2) Call RangeAB to retrieve the numbers 1 through MAX(N)
  (
    SELECT       r.RN, r.N1 -- (3) Only return numbers from rangeAB that exist in #t
    FROM         core.rangeAB(1,b.Mx,1,1) AS r
    CROSS APPLY (SELECT TOP (1) 1 FROM #t AS t WHERE t.N = r.N1) AS z(x)
  ) AS fnFilter
  GROUP BY fnFilter.N1      -- (4) Use GROUP BY to eliminate the 1-many between r.N1 & t.N
  ORDER BY fnFilter.N1      -- Not required but no penalty
  OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 5
PRINT CHAR(10)+'APPLY + TOP 1 - Parallel'+CHAR(10)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @N BIGINT;
  SELECT @N = fnFilter.N1
  FROM  (SELECT MAX(t.N) FROM #t AS t) AS b(Mx)
  CROSS APPLY
  (
    SELECT      r.RN, r.N1
    FROM        core.rangeAB(b.Mn,b.Mx,1,1) AS r
    CROSS APPLY (SELECT TOP (1) 1 FROM #t AS t WHERE t.N = r.N1) AS z(x)
  ) AS fnFilter
  GROUP BY fnFilter.N1
  ORDER BY fnFilter.N1 -- No penalty
  OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 5

Figure 5.2. New School test results

Multidimensional Virtual Index, 1000015000 rows, N=1-25K
------------------------------------------------------------------------------------------

GROUP BY - serial
------------------------------------------------------------------------------------------
111353

GROUP BY - parallel
------------------------------------------------------------------------------------------
12147

APPLY + TOP 1 - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
70
66
70
63
64
Batch execution completed 5 times.

APPLY + TOP 1 - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
84
97
100
77
100
Batch execution completed 5 times.

Completion time: 2020-12-12T12:21:49.8432474-06:00

The old school GROUP BY solution runs for 113 seconds with a serial plan and speeds up by a factor of ten (nearly) with a parallel plan. The rangeAB version runs for 100MS or less with serial or parallel execution plans. It’s interesting to note that parallel execution does not improve the time for the RangeAB solution. In Part 6 of this series we’ll add an additional dimension to speed things up even more especially with a parallel execution plan. See you there, thanks for reading!

Author: Alan Burstein

SQL Performance Ninja with 20+ years of writing high-performing, elegant SQL code. Expert at developing super fast NLP algorithms

One thought on “Virtual Indexing Part 5: The Multidimensional Virtual Index”

Leave a Reply

%d bloggers like this: