How to find strings longer than 8K in SQL Server (Part 1)

Ever try searching a string for a substring larger than 8,000 characters? It’s actually more difficult than you think… without a good N-Grams function, that is.

Before you continue

FIRST, this article leverages my NGrams2B function which can be found here: Nasty Fast N-Grams (Part 1). NGrams2B is the same as the 8K version but handles VARCHAR(MAX) strings. Since publishing this article I have since dropped the “2B” from the function name; you will see it referred to simply as dbo.ngrams in the code examples below.

SECOND, For people following my N-Grams series on SQL Server Central (SSC), let not your heart be troubled: I am in the process of wrapping up my N-Grams Series (Parts 2-6). This is a quick “one-off” intended to keep my new blog posts moving along…

Introduction

Riddle me this: what will the query below return?

DECLARE @SearchText VARCHAR(MAX) = REPLICATE('X',9000),
        @SomeString VARCHAR(MAX) = REPLICATE('X',8001);

SELECT CHARINDEX(@SearchText,@SomeString);

You would expect it to return a Zero(0) – @SearchText is a string of 9000 X’s, @SomeString is 8001 X’s. The query, however, returns a 1. That’s not right, how can this be? To get to the bottom of this let’s append this line to our query:

SELECT LEN(@SearchText), LEN(@SomeString)

This returns: 8000 for each. Ahhhhh, our sample strings are being implicitly converted from VARCHAR(MAX) to VARCHAR(8000). I make this mistake periodically when using REPLICATE for creating sample strings. To get around the 8K limitation we can case the input string as VARCHAR(MAX):

--==== Casting "X" as VARCHAR(MAX) circumvents the truncation issue
DECLARE 
  @SearchText VARCHAR(MAX) =       REPLICATE(CAST('X' AS VARCHAR(MAX)),9000),
  SomeString  VARCHAR(MAX) = '###'+REPLICATE(CAST('X' AS VARCHAR(MAX)),9001);

SELECT LEN(@SearchText), LEN(@SomeString);

This prevents the truncation. Let’s run add our CHARINDEX query again:

SELECT CHARINDEX(@SearchText,@SomeString);
Results:
Msg 8152, Level 16, State 10, Line 132
String or binary data would be truncated.

Not as easy as you thought huh?. Perhaps PATINDEX can save the day…

SELECT PATINDEX('%'+@SearchText+'%',@SomeString);

Same error, grrrr… What about LIKE? Surely we can solve this using LIKE right?

SELECT CASE WHEN @SearchText LIKE '%'+@SomeString+'%' THEN 1 ELSE 0 END;

Fail – same error. I encountered this problem for the first time earlier this year and attempting to solve it took a toll on my keyboard and self-esteem.

The Problem

To my knowledge, SQL Server doesn’t provide a way to search for a VACHAR longer than 8,000 or 4,000 for NVARCHAR. Based on my research it’s obvious that this is not a common problem. It will be more common as Developers continue to push the limits of what you can do with strings in SQL Server. And if I’m wrong? So what. Solving problems like this is fun and talking about N-Grams never gets old.

dbo.NGrams to the Rescue (as per usual)

dbo.NGrams and dbo.NGrams8K are Super heroes

Ok-LIKE, CHARINDEX and PATINDEX are not valid options. To quote Ned Flanders, “As melon scratchers go, this one is a honeydoodle!” Fortunately SQL Server is powerful enough to check two VARCHAR(MAX) strings for equality. This is where a good N-Grams function is a super hero.

First we’ll tokenize (split) the @SomeString into tokens as long as @SearchText. It’s six characters long so we can split @SomeString into 6-grams.

--==== Sample Search Text and String to analyze
DECLARE @SearchText VARCHAR(MAX) = 'ABC123',
        @SomeString VARCHAR(MAX) = 'AAABC123XYZ';

--==== Tokenize @SomeString into @SearchText-sized tokens
SELECT   ng.Position, ng.Token
FROM     dbo.ngrams(@SomeString,LEN(@SearchText)) AS ng -- LEN(@SeachText=6)
ORDER BY ng.Position;
Results:
Position  Token
--------- ---------
1         AAABC1
2         AABC12
3         ABC123
4         BC123X
5         C123XY
6         123XYZ

The value of @SearchText (“ABC123”) is at Position 3: AAABC123XYZ. All that’s left is to add a WHERE filter to compare @SearchText to each token returned by dbo.NGrams.

SELECT   ng.Position, ng.Token
FROM     dbo.ngrams(@SomeString,LEN(@SearchText)) AS ng
WHERE    ng.Token = @SearchText
ORDER BY ng.Position;
Returns:
Position   Token   
---------- ---------------
3          ABC123

This is what we want. Note that, thanks to the virtual index that my N-Grams function uses, I can sort by the token’s position in the string without a sort in the execution plan. That’s going to help us because we need to add a TOP(1) clause to identify the location of the first instance of @SearchText . Remember, we only need to know if the text exists. Adding TOP(1), with an ORDER BY to keep it deterministic, has multiple benefits: not only do we get a predictable result, this also assists the cardinality estimator and enables us to tap into the SQL optimizer’s Row Goals which, when correctly leveraged, allow for obscene performance gains.

Here we’re going to use strings (8K+) as variables from earlier but with a couple small changes then:

  1. 1. Add the TOP(1) clause along with the proper ORDER BY clause
  2. 2. Wrap the logic up in a subquery named f
  3. 3. Retrieve the value MAX(f.ItemIndex) from f
  4. 4. Add an ISNULL our MAX() to return 0 when there isn’t a match
DECLARE @SearchText VARCHAR(MAX) = REPLICATE(CAST('X' AS VARCHAR(MAX)),9000),
        @SomeString VARCHAR(MAX) = '###'+REPLICATE(CAST('X' AS VARCHAR(MAX)),9001);

SELECT ItemIndex = ISNULL(MAX(f.ItemIndex),0)
FROM
(
  -- Returns the first position of @SearchText
  SELECT TOP(1) ng.Position
  FROM     dbo.ngrams(@SomeString,LEN(@SearchText)) AS ng
  WHERE    ng.Token = @SearchText
  ORDER BY ng.Position
) AS f(ItemIndex);

This returns: 4 which is correct – @SearchText is a series of 9000 X’s and exists in Position 4 in @SomeString. And that’s it – easy peasy. Next for performance.

Introducing the 2 Billion+ row test harness

I said two Billion, with a B – not a typo. I began the heading with “Introducing” because any test harness exceeding a billion rows deserves a proper introduction. I say, rows, not characters, because dbo.NGrams tokenizes (splits) strings into rows. You’re going to see many more billion+ row test harnesses as we explore the PerfML Universe.

The VARCHAR(MAX) character limit is 2,147,483,647. I rounded my sample values down to two billion, created a search string 50K characters long then stuffed that value into @SomeString, nested between 100,011 junk characters followed by two billion more for a grand total of 2,000,150,011 characters. We’ll run the query twice: once with a serial plan, another with a parallel plan using make_parallel() by Adam Machanic.

Test Harness

SET STATISTICS TIME ON;
PRINT 'Build the test string'+CHAR(13)+REPLICATE('-',90);

  DECLARE @Match VARCHAR(MAX) = REPLICATE(CAST('X' AS VARCHAR(MAX)),50000),
          @Junk1 VARCHAR(MAX) = REPLICATE(CAST('Z' AS VARCHAR(MAX)),100011),
          @Junk2 VARCHAR(MAX) = REPLICATE(CAST('#' AS VARCHAR(MAX)),2000000000);

  DECLARE @SearchText VARCHAR(MAX) = @Match,
          @SomeString VARCHAR(MAX) = @Junk1+@Match+@Junk2;

PRINT 'Performance Test #1: Serial Execution'+CHAR(13)+REPLICATE('-',90);

  SELECT ItemIndex = ISNULL(MAX(f.ItemIndex),0)
  FROM
  (
    -- Returns the first position of @SearchText
    SELECT TOP(1) ng.Position
    FROM     dbo.ngrams(@SomeString,LEN(@SearchText)) AS ng
    WHERE    ng.Token = @SearchText
    ORDER BY ng.Position
  ) AS f(ItemIndex)
  OPTION (MAXDOP 0);

PRINT 'Performance Test #2: Parallel Execution'+CHAR(13)+REPLICATE('-',90);

  SELECT ItemIndex = ISNULL(MAX(f.ItemIndex),0)
  FROM
  (
    -- Returns the first position of @SearchText
    SELECT TOP(1) ng.Position
    FROM     dbo.ngrams(@SomeString,LEN(@SearchText)) AS ng
    WHERE    ng.Token = @SearchText
    ORDER BY ng.Position
  ) AS f(ItemIndex)
  CROSS JOIN dbo.make_parallel() --  NOOOO >> OPTION (QUERYTRACEON 8649);

SET STATISTICS TIME OFF;

How do you think we’ll do? How many minutes do you think this will run? Drum roll………….

2B Row Test Results

Build the test string
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 12453 ms,  elapsed time = 12576 ms.
 SQL Server Execution Times: CPU time = 7032 ms,  elapsed time = 7060 ms.

Performance Test #1: Serial Execution
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 1953 ms,  elapsed time = 1950 ms.

Performance Test #2: Parallel Execution
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 3568 ms,  elapsed time = 3570 ms.

This ran for about 25 seconds on my laptop. These results are more impressive when we dig deeper. SQL Server spent ~20 seconds building the string, another 1.9 seconds to find what we were looking for with a serial plan, 3.5 seconds for parallel. make_parallel adds a bit of spaghetti to the plan but, aside from that, the plans are identical.

Execution plan for the first (serial) query

If we hover over the second Nested Loops Join operator in the plan we can see the actual number of rows generated by dbo.NGrams:

The matching search text is 10,011 characters deep which explains the 100,012 operations

100,012 rows (pseudo iterations) to parse a 2,000,150,011 character string; this in under 2 seconds with only one CPU. That’s Nasty Fast! Can you guess how we got to 100,012 rows? I buried the search string 100,011 characters deep. The deeper the search string is, the number of rows dbo.NGrams creates increases, something we’ll address in Part 2.

make_parallel is a Super Power

Queries which leverage dbo.NGrams and dbo.NGrams8k generally perform better with parallel execution but not in this case. Here I’d likely force a serial plan to prevent a slower parallel plan. I included a parallel solution to help you understand the magic which is make_parallel. I first discussed the obvious benefits of make_parallel in PerfML Part 2 Self-Tuning with a Tally Table. Now for a not-so-obvious benefit, and it’s big.

In my experience, make_parallel and TraceFlag 8649 basically perform identically ~95% of the time. The only reason I ever use TraceFlag 8649 is during development as make_parallel makes the execution plans less readable. Every once in a while, not often, TraceFlag 8649 causes a query to run for minutes (even hours) instead of seconds/milliseconds. This is due to Intra-Query Parallel Deadlocking. What makes this type of deadlocking distinctly painful is the pure randomness of it. Sometimes it happens every 10-100 query executions, sometimes it’s a few times a month. This an example of the type of adventures undocumented trace flags will take you on. make_parallel doesn’t have this issue. Just another reason make_parallel deserves a cape.

Conclusion

2B strings are not common but this should give you a taste for how this technique will hold up against millions or even billions of smaller strings. This is the power of the tally table and dbo.NGrams. Unless someone can point me to a book, article or forum post that details a faster way to find a string longer that 8K in SQL Server I’m going to aver that this is the fastest known method for finding VARCHARs 8001 characters long in SQL Server without an Index. But can we make it faster, much faster, unimaginably faster! See you at Part 2 of this series. Thanks for reading!

Author: Alan Burstein

SQL Performance Ninja with 20+ years of writing high-performing, elegant SQL code. Expert at developing super fast NLP algorithms

Leave a Reply