Virtual Indexing Part 6: Combinatorial Index Optimization

In Part 5 we learned how to return a distinct set of numbers from a one Billion row table in under 1/10th of a second. We can do better!

Disclaimer

I’ve never made it to Part 6 of a technical series. For me, a huge obstacle to getting quality content out in a timely manner is the proof-reading and final editing process. Like a few of my friends in this field, I have many new techniques and programming concepts that I’m dying to share which have been decaying in MS Word Docs and .sql files for years. I started my NTally article in 2014 for example.

In September, while quarantined and after listening to too much David Goggins, I decided to step up my writing (and research) game and simply began publishing the first draft of my work the moment I’m done authoring it… What I’m trying to say is, I’ve decided to err on the side of delivering content. This means there will be typos, now and in the future. Sorry. I feel better getting it done, however, typos and all, than taking my good ideas to the grave. If you are a writer/blogger or are aspiring to do so, I suggest following my lead and just get it done – It feels so much bettr! đŸ˜‰

Introduction

This series began with the introduction of the Virtual Index, the suitably sorted stream of numbers returned by T-SQL ROW_NUMBER which act very similar to an physical index with respect to sorting, grouping and filtering. Unlike a physical index, created using DDL, the virtual index is invoked with a SELECT statement against fnTally and goes away when you’re query is finished. No disk space to consume, fragmentation to address, nothing to check into your CI/CD system – just a magical, in-memory index (suitably sorted stream) which comes and goes as requested.

In Part 5 we reviewed how to return a distinct set of numbers from a one Billion row table in under 1/10th of a second using what I have coined, a multidimensional virtual index – that is, combining a virtual index with a real one speeds things up. More technically, we’re giving the optimizer a wider range of options than it would have with only one index in place.

So, if adding one virtual index to your query can speed things up, can we speed them up even further by adding more? I mean there basically free and can be pulled out of thin air – why not? That’s what we’re going to review today.

Dimensionality

Seasoned SQL developers as well as anyone who has written MDX expressions is familiar with the concept of dimensionality. Using SSAS for example, we can write and MDX expression to return a matrix with two or more dimensions as shown below. The Month dimension is on the x-axis, AKA the columns(0) axis in MDX; the Vehicle dimension is on y-axis, AKA rows(1). For brevity I included two Month members and three Vehicle members. The numbers can represent anything (e.g. Sales, Inventory, etc.) Using MDX we can easily write an MDX expression to return something that looks like this:

Figure 1. Two-dimensional matrix with Vehicle on the x-axis, month(s) on the y-axis.

Two-dimension matrix with a dimension for months and a second dimension for vehicles.

Now let’s say we wanted an deeper level of granularity for each vehicle color. We could add a third dimension for color to our expression to return something like this:

Figure 2. Three-dimensional matrix; a 3rd dimension added for color

Here we’re added a third dimension for “color” to transform it into a three-dimension matrix.

What does this have to do with indexing or performance you ask? Let’s apply this same concept, leveraging RangeAB, to perform what I’ve coined, Combinatorial Index Optimization. Think of it as combinatorial optimization for virtual indexes and indexed functions.

Combinatorial Index Optimization

Below is our query from Part 5 of this series where we reduced the the time it took to return a distinct set of numbers from a one billion set from 111 seconds using traditional methods, but only 0.07 seconds using a two-dimensional index.

  SELECT fnFilter.N1 --fnFilter.RN-(b.Mn-1)
  FROM  (SELECT MIN(t.N), MAX(t.N) FROM #t AS t) AS b(Mn,Mx)
  CROSS APPLY
  (
    SELECT      r.RN, r.N1
    FROM        core.rangeAB(b.Mn,b.Mx,1,1) AS r
    CROSS APPLY (SELECT TOP (1) 1 FROM #t AS t WHERE t.N = r.N1) AS z(x)
  ) AS fnFilter
  GROUP BY fnFilter.N1, b.Mn
  ORDER BY fnFilter.N1;

As you may recall, we reduced the runtime by using the tally table for the left side of the one-to-many relationship and only returning numbers that exist in the source table (#t). We’ve reduced the question from, “where is the next distinct value?” to a simple yes/no question: “Is 1 in here? Is 2 in here?” This is an example of a two-dimensional virtual index created by way of combinatorial index optimization.

Figure 3. Two-dimensional Virtual Index

The tally table (RangeAB in this example)

Virtual Indexing while wearing 3-D glasses

As I have said before, the great thing about a virtual index is that it’s free. Armed with a correctly constructed tally table function I can conjure up one any time I wish just like an AD&D Level 7 Magic User.

In the real-world, for a table with millions/billions of rows there would almost certainly be a table or other indexed object (e.g. an indexed view) that I could use to join to for filtering. Either way, using rangeAB, and it’s @Gap parameter we can add another index for an even cooler type of filtering. First let’s review our 2-D solution. Note the code and my comments:

Figure 4. How the Two-dimensional Virtual Index works

Next our 3-D solution.

DECLARE @Gap BIGINT = 100;

SELECT @N = fnFilter.N1
  FROM        (SELECT MAX(t.N) FROM #t AS t) AS mx(N)
  CROSS APPLY core.rangeAB(1,mx.N,@Gap,1) AS b
  CROSS APPLY
  (
    SELECT r.RN, r.N1
    FROM   core.rangeAB(b.N1,b.N2-1,1,1) AS r
    CROSS APPLY (SELECT TOP (1) 1 FROM #t AS t WHERE t.N = r.N1) AS z(x)
  ) AS fnFilter
  WHERE    EXISTS (SELECT 1 FROM #t AS t WHERE t.N BETWEEN b.N1 AND b.N2-1) -- $$$
  GROUP BY fnFilter.N1
  ORDER BY fnFilter.N1;

This image will help explain the changes.

Figure 5. How the Three-dimensional Virtual Index works

Adding this 3rd dimension will dramatically improve performance in two distinct ways. First it’s going to reduce the amount of work required simply by ignoring entire partitions of our sequence once the EXISTS statement determines it’s safe to do so. As Figure 2 should help you understand a 2-D virtual index, here’s what a 3-D virtual index looks like.

Figure 6. How the Three-dimensional virtual index works

Partitions without members are ignored

Performance Test 1

For brevity I set @gap to 3 for the example above. Notice how, for 1-3 and 13-15, our second tally table (by order of appearance in this article) is pre-filtering entire subsets of work. Next the performance test. For this test I used the same test harness from Part 5 of this series to create one billion rows with numbers ranging from 1 to 10,000 with very few gaps. For the 3-D solution I set @g to 100.

Figure 7. Test #1 Results – 1 Billion, Dense

The 3-D index improves with a parallel plan

In this case the 3-D index, with a serial plan slowed things down just a smidgen because I purposely made sure there were few gaps. I included this test to demonstrate how, unlike the 2-D index, we get a faster result set with a parallel plan. This despite the fact that the additional rangeAB call couldn’t really filter anything. Adding an additional dimension allows the optimizer better multithreading options. That is the second distinct advantage of the 3-D index over a 2-D one.

Performance Test 2

For this test I filled a table with the 50,000 numbers ranging from 1 to 1,000,000. Next I used my test harness to duplicate the rows until I had two Billion. The means my tally table must generate 1,000,000 rows for the left side of my query. This is where the 3-D index shines. I set my @gap parameter to 6400 for this test.

Figure 8. Test #2 Results – 2 Billion, Sparse

3-D virtual index with gap reduction saves the day.

The 2-D solution runs for 12 seconds regardless of how many CPUs you throw at it. The 3-D solution runs for 1.3 seconds serial, 0.396 seconds with a parallel plan. That’s correct, less than 1/2 a second to extract 50K distinct values from a set of 2,000,000,000.

Conclusion

This, again, is what I mean by More Fast, Less Code. In Part 7 we’ll dive into what is arguably my favorite topic: Finite Opposite Numbers. You’ll want to be seated for that one.

Author: Alan Burstein

SQL Performance Ninja with 20+ years of writing high-performing, elegant SQL code. Expert at developing super fast NLP algorithms

Leave a Reply