Before you continue
This post is intended for new and old data professionals alike. To get the most of this article you should have a basic understanding of tally tables and Jeff Moden’s tally table function, fnTally. You also need a basic understanding of how to use APPLY table operator; if not read Understanding and Using APPLY (Part 1), Part 2 and The Cascading (CROSS) APPLY. Each is a quick read and can change your career.
Less Code, More Fast – this is my coding philosophy in four words and the tagline for this site. Replacing bloated iterative procedural code with clean, set-based, pure, reusable code makes it easy to deploy code faster and with fewer bugs. Veteran programmers who write high-performing declarative code, such as SQL Developers, know that pure, set-based, loop-free code performs exponentially faster than code that uses loops, mutable data structures and procedural constructs such as “goto”. A correctly written tally table function, such as fnTally, count from 0 to 1,000,000,000 in 90 seconds on my laptop. I wouldn’t even try with a loop. Replacing loops with a tally table (or some other form of lazy sequence) is a low-risk, low-cost, high-reward way to write cleaner, faster code.
The Virtual Index
One small benefit to writing loops to perform tasks in your functions and procedures is that you don’t have to worry about sorting. With a set-based approach that leverages a tally table, if the order of the numbers returned matters then the optimizer will have to sort the numbers for you unless they are presorted, AKA indexed. When using physical tally table that is properly indexed sorting is not a concern, but what about with a tally table function like fnTally? You can’t index a function right?
Returning the numbers 1 to 7, in order, without a sort
An important consideration when replacing a loop with a tally table is sorting. The most common way to handle sorting in the SQL world is to presort using an index. fnTally is a function, you can’t add an index to a function so how do you return the numbers [ 1 2 3 4 5 6 7] in that order without sorting them? You can’t unless they are already sorted. fnTally returns it’s numbers as an ordered set. This ordered stream of numbers is what I refer to as the virtual index. Virtual indexing can be thought of as the art/science of:
- Understanding what the virtual index is
- Knowing when\how to use the virtual index
- Understanding the alternatives and when to use them
If you are a set- based SQL coding warrior then you know that you can use ROW_NUMBER to create a Virtual Auxiliary Table of Numbers, AKA: tally table. With that virtual table comes with a virtual index. To understand and appreciate the difference between a virtual index and a physical index, lets create a mini-numbers temp table with the numbers 1 to 6. We won’t include an index which means that the numbers are unordered.
Figure 1 – Mini tally table (#nums)
--==== Create a small Number's table IF OBJECT_ID('tempdb..#nums','U') IS NOT NULL DROP TABLE #nums; CREATE TABLE #nums (N INT NOT NULL); INSERT #nums VALUES (1),(2),(3),(4),(5),(6);
This table could be better described as a “tally heap” or a “virtual auxiliary heap of numbers” Let’s run a three queries against #nums without and index present then review execution plans.
Heap vs Index Tests
First to test our queries against #nums without an index in place:
Figure 2 – Test #1: No index (Heap)
SELECT TOP(3) t.N FROM #nums AS t ORDER BY t.N; SELECT t.N, AVG(t.N) FROM #nums AS t GROUP BY t.N; SELECT [SUM] = SUM(t.N) OVER (ORDER BY t.N), [LAG] = LAG(t.N,1,0) OVER (ORDER BY t.N) FROM #nums AS t;
Figure 3 – Test #1 execution plans without an index on #nums
Each query required a sort operator. TOP with an ORDER BY, grouping (the GROUP BY), and the window functions (SUM OVER and LAG) are examples of operations which require an ordered set, AKA “suitably sorted stream.” If the stream of input values is unordered, the most common remedy is a sort operator to create the suitably sorted stream when the query runs.
Hover over the properties for the stream aggregate and sequence operators in the second and third plans to see what each operator does and what they need:
Figure 3.1. – Stream Aggregate and Sequence Project Details
Suitable Sorting with a Physical Index
Sorting at runtime should be avoided, especially when dealing with a large volume of rows. The typical sort cost is n log n which is slower than linear. It’s like the exact opposite of a bulk discount – the cost per row increases as you increase rows. Let’s presort with a unique clustered index on #nums.
Figure 4 – Index for our numbers table (#nums)
CREATE UNIQUE CLUSTERED INDEX uq_nums ON #nums(N);
Now execute the same queries from Test #1(Figure 2) with the clustered index in place.
Figure 4.1. – Execution plan with index in place on #nums
Now that #nums is presorted there isn’t a need to sort at runtime.
The Suitably Sorted Virtual Index
First for the same queries but using fnTally and the execution plan.
Figure 5: Virtual Index Test Query
SELECT TOP(3) t.N FROM dbo.fnTally(1,6) AS t ORDER BY t.N; SELECT t.N, AVG(t.N) FROM dbo.fnTally(1,6) AS t GROUP BY t.N; SELECT [SUM] = SUM(t.N) OVER (ORDER BY t.N), [LAG] = LAG(t.N,1,0) OVER (ORDER BY t.N) FROM dbo.fnTally(1,6) AS t;
Figure 5.1: Virtual Index Test Execution Plans
That’s the exact same execution plan as in figure 4.1, after the clustered index was added to #nums! The only difference is the nested loop joins on the right which are used to generate the rows. fnTally returns the numbers as an ordered set. In other words, for each query:
- Using #nums without an index forces the optimizer to add a sort operator to execution plan for handling TOP, GROUP BY and a couple window function functions (LAG and SUM() OVER())
- Adding an index to #nums presorts the values and eliminates the requirement to sort the values at runtime
- fnTally is an inline table values function(iTVF) and cannot be indexed
- In this example fnTally’s N column behaves like the indexed N column on #nums
The CTE tally table returns the numbers as an ordered set and generates an identical execution plan as when using a persisted tally table with an index. This is what I mean by, a virtual index. The virtual index here is the ordered set of numbers that ROW_NUMBER creates. There is a lot more to this topic that I look forward to exploring in Part 2 of this series.
9 thoughts on “Virtual Indexing Part 1: The Suitably Sorted Stream”
Brilliant again Alan, very interesting and thought provoking!
In other words, “I will be thinking all night” 😉
Again, good job!
Thanks again sir! It means a so much!
In Dwain Camps’ simpletalk bio it reads,
“[Dwain] has a special interest in developing solutions to complex, data intensive problems using high performance SQL because the declarative nature of SQL allows development of algorithmically unique solutions that procedural languages may not be capable of”
I’m trying to prove him right 😉