Virtual Indexing Part 8: The Indexed Splitter Function

The fastest way to split a string is to have already done so.

Introduction

Happy 2021! Indexed views and tally tables – a virtually unexplored topic, one I will write about at length this year. This will be a quick read and there is likely to be at least a couple typos; time is short so today is about quality, not quantity. In Part 3: The Indexed Function I introduced you to I said “the fastest way to perform a calculation is to have already done so;” in this installment I will demonstrate that the fastest way to split a string is also to have already done so.

When I introduced the Indexed function I used prime numbers as my example. 7 is always a prime number, Factorial(5) always equals 120. Creating an Indexed prime numbers, factorial or any function where a function always returns the same output for a given argument. What if we wanted a more complex indexed function, such as one that splits a string?

I know about database normalization but I am not responsible for designing most of the databases I work with. If you’ve been in this racket for a couple fortnights you have run across that vender DB written by great application developers with terrible back-end databases. It’s not uncommon to see fields that contain a delimited list – ‘1,4,13,99’ instead of a single atomic value. You can write an ETL package that leverages a good string splitter (such as DelimitedSplit8K or DelimitedSplit8k_Lead) to regularly split and copy the values into a correctly normalized table. In this article I will show you a much faster, maintenance-free alternative which leverages a Numbers Table and an indexed view.

Introducing The “Indexed Splitter”

For this you will need a tally table with at least as many rows as the longest string you intend to split. 8,000 is enough unless you need to split string longer than 8K. My tally table (dbo.tally) has 1,000,000 rows. First the sample data.

Sample data

CREATE TABLE dbo.Test (ID INT, Column1 VARCHAR(20));
INSERT INTO  dbo.Test (ID,Column1) 
  VALUES (1,'ORM;PS;SUP'),(2,'ABC;XYZ;999;123');

Next we’re going to borrow Jeff Moden’s DelimitedSplit8K logic to create a set-based splitter that leverages permanent tally table (e.g. an regular ol properly indexed numbers table named dbo.tally. There’s a small performance hit compared to delimitedSplit8K but the upside makes it more than worth it, as you will see in a moment. Here’s our splitter as a view

The Splitter as a View

CREATE OR ALTER VIEW dbo.TestSplit WITH SCHEMABINDING AS 
SELECT 
  Id   = t.ID, 
  item = 
    SUBSTRING
    (
      t.Column1,
      tt.N+SIGN(tt.N-1),
      ISNULL(NULLIF((CHARINDEX(';',t.Column1,tt.N+1)),0),LEN(t.Column1)+1)-(tt.N)-SIGN(tt.N-1)
    ),
  ItemIndex = tt.N+1
FROM       dbo.Test  AS t
CROSS JOIN dbo.tally AS tt
WHERE      tt.N <= LEN(t.Column1)
AND        (tt.N = 1 OR SUBSTRING(t.column1,tt.N,1) = ';');
GO

The values are split and returned with their associated ID. This logic is going to be applied to an index view which means we can’t include parameters; the delimiter and source of the string need to be embedded in our logic. Above we are splitting the contents of dbo.test.Column1 using a semicolon as the delimiter.

SELECT f.Id, F.Item FROM dbo.TestSplit AS f;

Below is exactly what we expected.

Function Output

Result Set

Next the execution plan.

Splitter View Execution Plan

That’s a good plan: set-based, linear and fast. There are faster splitters out there but none that work without the APPLY table operator. APPLY is not allowed in Indexed Views. Subqueries either. My splitter logic doesn’t use APPLY , subqueries or any other logic that prevents me from adding an index to my view so let’s see if we can add a unique clustered index and turn this view into an indexed Split where the split happens ahead of time and only once (unless the value is modified or deleted).

CREATE UNIQUE CLUSTERED INDEX uq_cl__testSplit ON dbo.TestSplit(Id,Item)

No error. Let’s run the same SELECT query from earlier and check the execution plan.

The Indexed Split Execution plan

It doesn’t get cleaner or faster than that. This is a powerful example of the what you can accomplish when using a physical numbers table to build an Indexed view. This barely scratches the surface as I hope to demonstrate in the near future.

Make my index a double

In 2020 “make it a double” became:

2o2o got me like ^^^

I digress.

Now that your indexed function is in place you can expand it’s power by adding more indexes! I’ll have my clustered index with a nonclustered chaser. Here we’ll create an index for item.

CREATE NONCLUSTERED INDEX nc__testSplit__item ON dbo.TestSplit(Item);

Now a query that filters on item.

SELECT f.* 
FROM   dbo.TestSplit AS f WITH (NOEXPAND)
WHERE  f.item <> '999,123';

To my chagrin, the optimizer often ignores nonclustered indexes on indexed views which is the case here. This is why I’m using NOEXPAND. Nonetheless, the execution plan is even better than before: an indexed seek against a narrower and lighter nonclustered index.

Execution plan leveraging a second, nonclustered index

The Indexed Splitter Function

Let’s add our RowNumber column and wrap the guy in an iTVF for a Nasty fast index function that splits a string ahead of time and maintains itself as needed! We’ll include the option to exclude one or more values using an @exclude parameter.

The Function

CREATE FUNCTION dbo.itvf_TestSplit
(
  @exclude VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT f.ID, f.Item, f.ItemIndex
FROM   dbo.TestSplit AS f WITH (NOEXPAND)
WHERE  f.Item <> @exclude;
GO

Let’s give’r a spin and check the execution plan.

Execution Plan

This image has an empty alt attribute; its file name is image-7.png

Same as before but now we are accepting parameters which means we now have an indexed function that can split strings ahead of time. Not even STRING_SPLIT or a CLR can compete here.

What about an “Item Number”?

Sometimes you need the item’s ordinal in the string; for this we need an item number (labeled as “ItemNumber” in my function logic. To return an item number (without a sort in the execution plan) some assembly will be required. DelimitedSplit8K and DelimitedSplit8K_LEAD both use ROW_NUMBER() to produce the ItemNumber column. ROW_NUMBER is not allowed in indexed views which is why I included the “ItemIndex” . It represents the item’s position in the string, something I normally don’t need. Here it’s perfect however – it allows me to add an ItemNumber while still avoiding a sort in my final plan. Not a big deal with strings 8K or less, but it is for longer strings (something an indexed function can handle at no extra cost.) First the new index:

EATE NONCLUSTERED INDEX nc_poc__testSplit__ItemIndex ON dbo.TestSplit(Id, ItemIndex);

With that index in place let’s run a query which includes the item number.

SELECT 
  Id= split.Id,
  ItemNumber = ROW_NUMBER() OVER (PARTITION BY split.ID ORDER BY split.ItemIndex),
  split.Item
FROM dbo.itvf_TestSplit('999,123') AS split;

Output

Result set that includes ItemNumber

Execution plan

easy peasy lemon-squeezy

Still a fantastic execution plan the same low cost as before with more complexity. Again, the calculations happened already, this function is leveraging one or more indexes to retrieve, not compute, the required value.

Conclusion

The Indexed Split is another game changer yet barely scratches the surface of the power of the Virtual Index and the Indexed Function. When you add a physical numbers tally to you Indexed views they can do more than pre-join or pre-aggregate values. Here we introduced the world’s first “indexed split” and we’re still just getting started! Tomorrow I’m going to introduce the Indexed Hierarchy. Thanks for reading!

Author: Alan Burstein

SQL Performance Ninja with 20+ years of writing high-performing, elegant SQL code. Expert at developing super fast NLP algorithms

Leave a Reply