Disclaimer
This article is a little more advanced than previous ones but will still be useful to programmers of all skill levels. To get the most from this article I suggest running the attached code locally, then reading what I have to say about it. I rushed to get this out so there may be typos. I’ll let my code do a lot of the talking.
Functions used in this article:
Introduction
In the my very first article I mentioned that I was going to use leverage the techniques detailed in this blog to present a solution to the Longest Common Subsequence (LCSQ) between three strings which is an NP-Complete [1] problem. In April 2013 I posted an article titled, A Set-Based Solution to the Longest Common Subsequence Problem. It was my first attempt at solving this problem using only a lazy sequence (AKA tally table or numbers table) exclusively, meaning no loops, cursors, temp tables and all other procedural/iterative techniques. The article was my most popular article even though my solution was wrong as a reader kindly pointed out. Instead of taking the article down I tried to develop a correct purely set-based LCSQ function and update the article once I did. I tried and tried unsuccessfully with no avail. This was infuriating because, though not easy, the problem seemed simple enough to solve with 10-20 hours of effort, 30 tops. For two years, I would periodically take a swing at it as it was a such an intriguing problem. This is when/how I learned about NP Complexity, Complexity Theory and the unsolved math problem known simply as P vs NP.
Long story short, if you can come up with an algorithm that solves any NP-Complete problem in deterministic polynomial time, you have proven the P=NP and, in theory, can be awarded $1,000,000. Alternatively, if you can prove that no such algorithm is possible, that P!=NP you will also get paid. Most people who study this problem, roughly 99%, believe P != NP. I am in the tiny minority that believes otherwise, in part, for the reasons discussed here (6:30 minutes in.)
25 years of programming has taught me is that the tools that our Math Teachers and Professors have given us are not adequate to solve complex problems quickly, only quick enough. The purpose of this entire blog is to introduce new tools that empower developers to solve any programming problem faster, including NP-Hard and NP-Complete ones. This, in almost any programming language. The code here is in T-SQL because it’s what I know best, but I intend to re-write the code in this series using Python and JavaScript.
What’s ahead
In this installment we’re going to:
- Use a Longest Common Subsequence (LCSQ) function for some basic AI as we determine what someone meant to say.
- Learn how to speed up any LCSQ function (two or more strings), written in any programming language, without changing any code, leveraging what I have coined: Sequence Reduction.
- Introduce a set-based solution to the LCSQ between two strings (LCSQ2) and three strings (LCSQ3)
Unless someone can demonstrate otherwise, I will aver the LCSQ3 function posted below is the first purely set-based, immutable, loop and recursion free, leveraging nothing more than a lazy sequence and some new ideas. Better yet, our set-based function is going to give you:
- A deterministic result
- The length and text of the longest common subsequence
- One row per subsequence when there are ties
- The location where the subsequence(s) starts, ends and it’s pattern
- A tuning parameter (discussed here) that allows you to transform the function into an approximation or heuristic, or a combination of the three for a much, much faster, deterministic result
Let’s do a quick demo:
DECLARE @S1 VARCHAR(8000) = 'XA.BCD123',
@S2 VARCHAR(8000) = 'ABC-D-12Z3',
@S3 VARCHAR(8000) = '12DABCYP123';
SELECT lcsq3.* FROM samd.LCSQ3_AJB_POC(@S1,@S2,@S3,1) AS lcsq3; --==== Longest Common Subsequence – three strings
Results:

The compressed length, 6 in this case, represents the length of the longest common subsequence between the strings, XA.BCD123, ABC-D-123Z and 12DABCYP123. S1_SfxIdx is where the subsequence begins, and the compression key represents the pattern of the sequence. Consider this table which will help you visualize how the compression key works:

The function’s name ends with _POC, for Proof of Concept. I wanted to show that it can be done. In Part2 we’ll dive deep into how to make our LCSQ functions even faster leveraging techniques I’ve covered in earlier articles including finite opposite numbers, indexed functions, combinatorial index optimization and much more. Throughout this series we’ll learn how to make your function faster the more CPUs you throw at it. Better yet, we’re going to learn how to teach your LCSQ3 function to teach itself to solve problems faster. From here on in, I intend to talk as little as possible and let my portable declarative code do my talking for me.
Playing along at home
Attached in the intro is all the code used in this series. The are 31 functions total which you’ll become more familiar with as this series progresses. There’s~1800 lines of code but 90% of it is inline documentation. Please feel free to load it locally and play mad data scientist. Remember, however, to give credit where credit where credit is due.
The code is formatted in a way that you can go to Edit > Outlining > Toggle all Outlining to collapse the code for a high-level view of what’s included.

Using the Longest Common Subsequence for some basic AI
The challenge
Black Mirror is my favorite show of all time. In the White Christmas episode we learned about cookies. The Irony here that we’re using N-Grams to analyze the names of characters whose digital consciousness was transferred to “the cloud” using a recaller to extract and upload their memory ngrams. I digress.
We have three names in a table named @cookie containing an ID and Name of three Black Mirror characters. The second table, @unmatched, are names of each character spelled incorrectly. We need to match the misspelled name with the best guess for what they were trying to spell.
Cookie Sample data
DECLARE @cookie TABLE (CookieId INT IDENTITY, CookieName VARCHAR(20));
INSERT @cookie VALUES('Joe Porter'),('Mr. Matt Trent'),('Greta Cookie');
DECLARE @unmatched TABLE (Unmatched VARCHAR(30));
INSERT @unmatched VALUES('Mr. Matthew Trent'),('Gretta Cookie'),('Joey Porter');
Get the similarity between all possible combinations
First a basic cross join (cartesian product) to get the LCSQ between each possible combination. We are using Phil Factor’s LCSQ function to extract the LCSQ between each pair of names.
--==== How to get the similarity between all strings
SELECT
UnmatchedName = u.Unmatched,
QualifiedName = c.CookieName,
LCSq = dbo.LongestCommonSubsequence(c.CookieName,u.Unmatched)
FROM @cookie AS c
CROSS JOIN @unmatched AS u;
Results:
UnmatchedName QualifiedName LCSq
------------------- -------------------- -----------------
Mr. Matthew Trent Joe Porter e rt
Mr. Matthew Trent Mr. Matt Trent Mr. Matt Trent
Mr. Matthew Trent Greta Cookie rete
Gretta Cookie Joe Porter e oe
Gretta Cookie Mr. Matt Trent rtt e
Gretta Cookie Greta Cookie Greta Cookie
Joey Porter Joe Porter Joe Porter
Joey Porter Mr. Matt Trent rtr
Joey Porter Greta Cookie e oe
Next, we can use our Bernie8K function to calculate the similarity between the two strings dividing LCSQ by L2.
--== 1.1.2. How to get the similarity between all strings
SELECT
UnmatchedName = u.Unmatched,
QualifiedName = c.CookieName,
LCSq = lcsq.Txt,
LCSqLength = Txt.Ln,
Similarity = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
FROM @cookie AS c
CROSS JOIN @unmatched AS u
CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched) AS b
CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
CROSS APPLY (VALUES(LEN(lcsq.Txt))) AS Txt(Ln);
Results:
UnmatchedName QualifiedName LCSq LCSqLength Similarity
-------------------- -------------------- ------------------ ----------- -----------
Mr. Matthew Trent Joe Porter e rt 4 0.235
Mr. Matthew Trent Mr. Matt Trent Mr. Matt Trent 14 0.824
Mr. Matthew Trent Greta Cookie rete 4 0.235
Gretta Cookie Joe Porter e oe 4 0.308
Gretta Cookie Mr. Matt Trent rtt e 5 0.357
Gretta Cookie Greta Cookie Greta Cookie 12 0.923
Joey Porter Joe Porter Joe Porter 10 0.909
Joey Porter Mr. Matt Trent rt 3 0.214
Joey Porter Greta Cookie e oe 4 0.333
Now let’s get the best suggestion for each misspelled name. For this we’ll wrap our logic in a subquery then, using TOP(1) WITH ties, we can get the best suggestion for each with this ORDER BY clause.
ORDER BY ROW_NUMBER()
OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
--== 1.1.3. Get the best suggestion
SELECT TOP(1) WITH TIES similarity.*
FROM
(
SELECT
Unmatched = u.Unmatched,
Qualified = c.CookieName,
LCSq = lcsq.Txt,
LCSqLength = Txt.Ln,
Sim = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
FROM @cookie AS c
CROSS JOIN @unmatched AS u
CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched) AS b
CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
CROSS APPLY (VALUES(LEN(lcsq.Txt))) AS Txt(Ln)
) AS Similarity
ORDER BY ROW_NUMBER()
OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
Results:
UnmatchedName QualifiedName LCSq LCSqLength Similarity
------------------ ---------------- --------------- ----------- ----------
Mr. Matthew Trent Mr. Matt Trent Mr. Matt Trent 14 0.824
Gretta Cookie Greta Cookie Greta Cookie 12 0.923
Joey Porter Joe Porter Joe Porter 10 0.909
And there you have it! We just wrote a query that figured out (presumably) what the name the user was intending to type. Hopefully this demystifies what’s going on behind the scenes with stuff like spell-check and auto-suggestion algorithms.
Sequence Reduction
As mentioned earlier, LCSQ2 is NP-Hard and LCSQ3 is NP-Complete. We’ll use sequence reduction against two strings so that you can calculate similarity faster, then against three strings to fulfil my promise to demonstrate how to “solve problems in NP without any code changes.”
First the LCSQ between two strings. The longest common subsequence between the strings: “AWWWDABGAGWAWADADAXAAWA” and “BBAXZZXXXBZBGZGZZZZGGGZ” is ABGG. I colored the first instance of ABGG in red for each string. Leveraging sequence cleaning we can reduce the characters in the set and get the correct answer by comparing AABGAGAAAAXAAA & BBAXXXXBBGGGGG. Leveraging sequence compression, we get the correct answer by comparing AABGAGAAAAXAA & BAXXBBGGGG. Sequence reduction allows us to get the correct answer by comparing ABGAGAXA & BAXBGG.
This chart will help you visualize the savings we enjoy for each technique.

The original pair of strings are each 23 characters long, the pair of strings transformed via sequence reduction are 8 and 6 characters long. To be clear, in this example, it’s possible to calculate the longest common subsequence between two 23 character strings by shrinking the size of the input for a dramatic increase in better performance while still guaranteeing the correct answer. Furthermore, since we’re only using functional pure, set-based code – our sequence reductions functions will run in nasty fast time (0MS.)
In the example above we can reduce the original input, which totals 46 characters, to 14 characters leveraging sequence reduction. Here, we’ve reduced the complexity of LCSQ2 from O(246) to O(214). That’s good news for any exact algorithm, approximation, or heuristic. Now Let’s review how to clean, compress, and reduce a set of two or more strings.
1. Sequence Cleaning
Sequence cleaning is performed by removing characters that exist in one string but not the other (orphans). Predicate logic dictates that characters that exist in only one string cannot be part of any common subsequence. Applying this concept to our strings, these:

Become:

The W’s, D’s and Z’s have been removed. Simple enough, eh?
2. Sequence Compression
Taking this concept to the next level, let’s say we string1 contains only one “A” and string2 contains three. If “A” is part of the LCSQ, and only one “A” exists in string1, then only one of the three A’s in string2 will be matched. It’s not possible to determine which “A” will be included in the common subsequence but there is a way to determine, in some instances, which ones can be removed. Enter Sequence Compression.
Consider the LCSQ between “ABC“ and each of these: “ABAAC”, “ABBBCA”, “ABCCDC”. “ABC” contains only one A, one B and one C. Therefore any consecutive series of A’s, B’s and/or C’s can be compressed to one. Compare the before and after for these three “sequence compressed” strings.

Sequence compression can be applied whenever on of the strings in the set has more of any specific character than the other strings in the set. When comparing ABCB to ABBBBC we can reduce the second string to ABBC. There are only two B’s in the first so BBBB can become BB. Think of sequence compression as an advanced form of sequence cleaning. Applying this concept to our strings, these:

…are transformed into these:

Did you notice that the letter D was removed (cleaned) from the third string. Ther letter D is an orphan and is compressed to 0. With sequence compression, sequence cleaning occurs organically. With sequence compression you are both cleaning orphan characters and compressing islands of repeated characters (when possible).
So, with sequence compression we’re cleaning and compressing strings at the same time. That’s a good thing but it’s also a bad thing. What we really want is to clean then compress our strings! Why? Because cleaning leads to better compression. Enter Sequence Reduction.
3. Clean then Compression = Sequence Reduction
Sequence Reduction is performed by cleaning your string first, then performing sequence compression. Consider these queries:
DECLARE @S1 VARCHAR(8000) = 'AWWWDABGAGWAWADADAXAAWA',
@S2 VARCHAR(8000) = 'BBAXZZXXXBZBGZGZZZZGGGZ',
@A VARCHAR(256) = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
;--==== No Reduction
SELECT S1 = @S1, S2 = @S2, L1 = LEN(@S1), L2 = LEN(@S2),
LCSS = dbo.LongestCommonSubsequence(@S1,@S2);
;--==== Cleaning Only
SELECT sc.S1, sc.S2, L1 = LEN(sc.S1), L2 = LEN(sc.S2),
LCSS = dbo.LongestCommonSubsequence(sc.S1,sc.S2)
FROM samd.SequenceClean8KLight(@S1,@S2) AS sc;-- AABGAGAAAAXAAA BBAXXXXBBGGGGG
;--==== Cleaning AND Compression
SELECT sr.S1_New, sr.S2_New, L1 = LEN(sr.S2_New), L2 = LEN(sr.S2_New),
LCSS = dbo.LongestCommonSubsequence(sr.S1_New,sr.S2_New)
FROM samd.SequenceCompressAll8K_NG(@S1,@S2,@A,0) AS sr; -- AABGAGAAAAXAA BAXXBBGGGG
;--==== Clean THEN Compress (Proper Sequence Reduction: Cleaning THEN Compression)
SELECT sr.S1_New, sr.S2_New, L1 = LEN(sr.S2_New), L2 = LEN(sr.S2_New),
LCSS = dbo.LongestCommonSubsequence(sr.S1_New,sr.S2_New)
FROM samd.SequenceReduce8K_NG(@S1,@S2,@A,0) AS sr --ABGAGAXA BAXBGG
The chart below details which function is responsible for which transformation.

In this article I will focus on sequence cleaning for brevity and explore sequence cleaning in Part2, sequence reduction in Part3.
Sequence Cleaning with samd.SequenceClean8KLight
samd.SequenceClean8KLight is simple, three simple functions:
- samd.SequenceGroup8K
- samd.SequenceExtract8K
- samd.PatReplace8K
samd.SequenceGroup8K, our sequence grouper, identifies characters that exist in one string but not the other. samd.SequenceExtract8K returns a distinct set of unmatched characters as a string. That string is then passed to PatReplace8K and all unmatched characters are removed from both strings.
Sequence Grouping is the process of returning all characters in two or more strings along with a count of each character for each string compared. This is how we can determine why two strings are different. Consider this query:
SELECT sg.* FROM samd.SequenceGroup8K('AABDE23', 'ACC123X') AS sg;
Returns:

The first string (S1) contains one “B” and the seconds string (S2) doesn’t contain any. The letter “C” appears twice in S2 but not at all in S1. Any character that has a 0 count under S1 or S2 can be removed.
Sequence Extraction is the process of extracting a distinct set of common (matched) characters, e.g. a character that appears in both strings) or a distinct set of unmatched characters. The results are returned in the form of a string for consumption by samd.Patexclude8K.Consider these two queries:
DECLARE @S1 VARCHAR(8000) = 'AABDE23',
@S2 VARCHAR(8000) = 'ACC123X';
SELECT se.* FROM samd.SequenceExtract8K(@S1,@S2,0) AS se; -- Returns: 1BCDEX
SELECT se.* FROM samd.SequenceExtract8K(@S1,@S2,1) AS se; -- Returns: 23A
The first returns the orphans: 1BCDEX; the second query returns the common or matched characters: 23A. The matched characters represent your alphabet; we’ll discuss this later in the series.
Sequence Cleaning, as mentioned earlier is the process of removing unmatched characters from both strings, a vital step in performing Sequence Reduction. For this article I’ll be using samd.SequenceClean8KLight. It is the barebones version of a more advanced sequence cleaning function. Let’s clean @S1 and @S2.
DECLARE @S1 VARCHAR(8000) = 'AABDE23',
@S2 VARCHAR(8000) = 'ACC123X';
SELECT S1_original = sc.S1,
S2_original = sc.S2,
S1_New = sc.S1,
S1_New = sc.S2
FROM samd.SequenceClean8KLight(@S1,@S2) AS sc;

Next for underlying programming logic behind each function.
Set-Based Grouping, Extraction and Cleaning
Set-based, functionally pure code is vital. Loops, recursion, mutable variables/data structures, and impure functions will ruin everything. NGrams8k has your back.
1. Sequence Grouping via samd.SequenceGroup8K
For this article we’ll use samd.SequenceGroup8K. It’s nasty fast not the fastest; we’ll introduce a much better sequence grouper later in this series. I started with this version as it’s the easiest to understand.
DECLARE @S1 VARCHAR(8000) = 'AABDE23',
@S2 VARCHAR(8000) = 'ACC123X';
SELECT
Token = ISNULL(S1.Token,S2.Token),
S1 = f.S1,
S2 = f.S2,
Total = ABS(f.S1-f.S2) -- Diffs
FROM
( --==== Split @S1 into unigrams
SELECT ng.Token, Total = COUNT(*)
FROM samd.ngrams8k(@S1,1) AS ng
GROUP BY ng.Token --==== Group and count @S1 unigrams
) AS S1
FULL JOIN
( --==== Split @S2 into unigrams
SELECT ng.Token, Total = COUNT(*)
FROM samd.ngrams8k(@S2,1) AS ng
GROUP BY ng.Token --==== Group and count @S2 unigrams
) AS S2
ON S1.Token = S2.Token --==== Find matching unigrams
CROSS APPLY (VALUES(ISNULL(S1.Total,0),ISNULL(S2.Total,0))) AS f(S1,S2); --==== Pre-Agg
The logic is simple provided you are comfortable with full outer joins. We can get a count of each character with a simple GROUP BY, then joining on the character returning a 0 when there isn’t a match. If we wanted to perform this operation on three or more strings, we would simply add another subquery per additional string. Our ISNULL(S1.Token,S2.Token) will be replaced with COALESCE(S1.Token,S2.Token, S3.Token…) We will do this later while performing sequence cleaning on three strings.
2. Sequence Extraction Logic
Now the sequence extractor. This guy takes two input strings, performs a sequence grouping, then leverages STRING_AGG to return the matched or unmatched characters as a string.
Logic:
SELECT Pattern = STRING_AGG(sg.Token,'') -- Order doesnt matter; no need for WITHIN GROUP
FROM samd.SequenceGroup8K(@S1,@S2) AS sg
WHERE SIGN(sg.S1*sg.S2) = 0; -- 0 for unmatched, 1 for matched
The result set for these queries will help you better understand samd.SequenceExtract8K.
DECLARE @S1 VARCHAR(8000) = 'ABC123AABBCC##',
@S2 VARCHAR(8000) = 'ABCXXYYYZZ123CC';
SELECT sg.* FROM samd.SequenceGroup8K(@S1,@S2) AS sg;
SELECT se.Pattern FROM samd.SequenceExtract8K(@S1,@S2,0) AS se;
SELECT se.Pattern FROM samd.SequenceExtract8K(@S1,@S2,1) AS se;
Results:

3. Sequence Cleaning Logic
The most straight-forward way to accomplish this is to use SeqeunceExtract8k to extract a distinct string of unmatched characters and pass them to PatReplace8K as a pattern of characters to replace with nothing (remove).
3.1. PatReplace8K version logic
SELECT
S1 = S1.NewString,
S2 = S2.NewString
FROM samd.SequenceExtract8K(@S1,@S2,0) AS se
CROSS APPLY samd.patReplace8K(@S1,STUFF('[]',2,0,se.Pattern),'') AS S1
CROSS APPLY samd.patReplace8K(@S2,STUFF('[]',2,0,se.Pattern),'') AS S2;
3.2. TRANSLATE version logic
Included with the attached code is a T-SQL TRANSLATE version of SequenceReduce8KLight name SequenceReduce8KLight_T2; If you are on SQL 2017+ you may want to try it out; performs a wee-bit better than the PatReplace8K version.
SELECT
S1 = S1.NewString,
S2 = S2.NewString
FROM samd.SequenceExtract8K(@S1,@S2,0) AS se
CROSS APPLY (VALUES(REPLICATE(CHAR(2),LEN(se.Pattern)))) AS t(Pat)
CROSS APPLY (VALUES(REPLACE(TRANSLATE(@S1,se.Pattern,t.Pat),CHAR(2),''))) AS S1(NewString)
CROSS APPLY (VALUES(REPLACE(TRANSLATE(@S2,se.Pattern,t.Pat),CHAR(2),''))) AS S2(NewString);
Applying sequence cleaning to our Black Mirror cookie example
Below is the original logic without cleaning, the second example demonstrates how to clean your strings before they are passed to the LCSQ.
Original Logic
SELECT TOP(1) WITH TIES similarity.*
FROM
(
SELECT
Unmatched = u.Unmatched,
Qualified = c.CookieName,
LCSq = lcsq.Txt,
LCSqLength = Txt.Ln,
Sim = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
FROM @cookie AS c
CROSS JOIN @unmatched AS u
CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched) AS b
CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
CROSS APPLY (VALUES(LEN(lcsq.Txt))) AS Txt(Ln)
) AS Similarity
ORDER BY ROW_NUMBER()
OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
Sequence cleaning added
SELECT TOP(1) WITH TIES similarity.*
FROM
(
SELECT
Unmatched = u.Unmatched,
Qualified = c.CookieName,
LCSq = lcsq.Txt,
LCSqLength = Txt.Ln,
Sim = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
FROM @cookie AS c
CROSS JOIN @unmatched AS u
CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched) AS b
CROSS APPLY samd.SequenceClean8KLight(b.S1,b.S2) AS r
CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(r.S1,r.S2))) AS lcsq(Txt)
CROSS APPLY (VALUES(LEN(lcsq.Txt))) AS Txt(Ln)
) AS Similarity
ORDER BY ROW_NUMBER()
OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
Both return the same thing, the second example is done by comparing less characters. This query demonstrates the savings from sequence cleaning only:
DECLARE @cookie TABLE (CookieId INT IDENTITY, CookieName VARCHAR(20));
INSERT @cookie VALUES('Joe Porter'),('Matthew Trent'),('Greta Cookie');
DECLARE @unmatched TABLE (Unmatched VARCHAR(30));
INSERT @unmatched VALUES('Matthew Trint'),('Gretta Cookie'),('Gretta'),('Joey Porter');
SELECT
S1 = c.CookieName,
S2 = u.Unmatched,
New = r.S1,
New = r.S2,
OriginalLen = LEN(c.CookieName+u.Unmatched),
NewLen = LEN(r.S1+r.S2),
Savings = LEN(c.CookieName+u.Unmatched)-LEN(r.S1+r.S2)
FROM @cookie AS c
CROSS JOIN @unmatched AS u
CROSS APPLY samd.SequenceClean8KLight(c.CookieName,u.Unmatched) AS r;
Returns:
S1 S2 New New OriginalLen NewLen Savings
------------- -------------- ---------- ------------- ----------- ------- --------
Joe Porter Matthew Trint e rter tte Trt 23 13 10
Joe Porter Gretta Cookie oe orte rett ooe 23 16 7
Joe Porter Gretta erter rett 16 9 7
Joe Porter Joey Porter Joe Porter Joe Porter 21 20 1
Matthew Trent Matthew Trint Matthew .. Matthew Trnt 26 25 1
Matthew Trent Gretta Cookie atte Tret retta e 26 16 10
Matthew Trent Gretta atteTret retta 19 13 6
Matthew Trent Joey Porter tte Tret e rter 24 14 10
Greta Cookie Matthew Trint reta ie atte Trit 25 16 9
Greta Cookie Gretta Cookie Greta Co.. Gretta Cookie 25 25 0
Greta Cookie Gretta Gretae Gretta 18 12 6
Greta Cookie Joey Porter ret ooe oe orter 23 15 8
That’s 75 fewer characters leveraging sequence cleaning. Score!
A Set Based Solution to the Longest Common Subsequence (Three Strings)
I’ll exclude my two-string LCSQ function for now but included it with the attached code. We’ll dive deeper into the inner workings of both LCSQ functions in Part2; samd.LCSQ3_AJB_POC is doing a quick cameo for the purposes of demonstrating the power of sequence cleaning against three strings (and that you can solve basically any problem without a loop).
What’s worth calling out now is how both LCSQ functions accept a gap parameter (@gap). Yes, we’re using a brute-force algorithm but we’re doing so without the need to evaluate every subsequence. Both functions leverage the design pattern detailed in PerfML Part 2: Self-Tuning with a Tally Table. This design pattern allows you to solve LCSQ2 in 2(N/G+G-1)time and LCSQ3 in 3(N/G+G-1) time; something we’ll also dig deeper into in Part2. All you need to know for now is that our LCSQ3 is set-based, relatively fast (for all of its functionality), runs faster with a parallel execution plan, and deterministic. The function name ends with “_POC” as it’s my proof-of-concept. The focus here is more about learning than tuning.
The Function
samd.LCSQ3_AJB_POC accepts three input strings and calculates the LCSQ between the three. samd.Bernie8K_3S is used to identify the shortest of the three strings as L1. L1 represents the longest possible subsequence between the three strings. samd.SubSequenceExtractALL looks like it was named by Yoda but works as advertised. It takes three parameters: an input string (@string), a gap parameter (@gap) and maximum subsequence length (@SL). When @gap=1, the function extracts every subsequence size 1 through @SL. When @gap=2 the function returns every subsequence sizes 1,3,5,etc though @SL. We use using L1 as maximum subsequence length (@SL) when splitting our strings into since there aren’t going to be any common subsequences longer than the shortest string in the set.
All three strings are passed to samd.SubSequenceExtractALL as separate calls and only subsequences that exist in all three strings are returned. Then we sort by the length of the matched subsequences in descending order returning the longest match. Leveraging TOP(1) WITH TIES sorted by the length of the subsequence we are able to return ties if they exist. The @gap parameter allows you to use the function for approximations. When @gap = 1, and returns a six character subsequence, you know that the LCSQ is six characters long. When @gap = 2 the function is 2X as fast can be off by one; in this case, for example, if the function returns a six-character string you know that the LCSQ is 6-7 characters long. When @gap = 3, the function is 3X faster but can be off by as much as two; if it says the LCSQ is six characters long, you know that it might by 6, 7 or 8. A classic case of trading performance for accuracy.
CREATE OR ALTER FUNCTION samd.LCSQ3_AJB_POC
(
@S1 VARCHAR(20),
@S2 VARCHAR(20),
@S3 VARCHAR(20),
@gap BIGINT
)
RETURNS TABLE AS RETURN
SELECT DISTINCT TOP(1) WITH TIES
CompressedLength = s1.CompressedLength,
S1_SfxIdx = s1.SuffixIndex,
S1_Compression_key = s1.CompressionKey,
S1_Uncompressed_Len = s1.UncompressedLength,
S1_Subsequence = s1.Subsequence,
S2_SfxIdx = s2.SuffixIndex,
S2_Compression_key = s2.CompressionKey,
S2_Uncompressed_Len = s2.UncompressedLength,
S2_Subsequence = s2.Subsequence,
S3_SfxIdx = s3.SuffixIndex,
S3_Compression_key = s3.CompressionKey,
S3_Uncompressed_Len = s3.UncompressedLength,
S3_Subsequence = s3.Subsequence
FROM samd.Bernie8K_3S(@S1,@S2,@S3) AS b
CROSS APPLY samd.SubSequenceExtractALL(b.S1,@gap,b.L1) AS s1
CROSS APPLY samd.SubSequenceExtractALL(b.S2,@gap,b.L1) AS s2
CROSS APPLY samd.SubSequenceExtractALL(b.S3,@gap,b.L1) AS s3
WHERE s1.Subsequence = s2.Subsequence
AND s1.Subsequence = s3.Subsequence
ORDER BY s1.CompressedLength DESC;
Again, we’ll keep making this guy faster and faster throughout this series until you’re sick of it. Stay tuned.
A Three String Circus (Performance Test)
Unlike problems in P, with problems in NP, even reducing the input size (N) by one is a big gain. Let’s create a test harness with our example from earlier to demonstrate. Here we’re going to get the IO stats and CPU time for:
- The cleaning function only (SequenceReduce3s8KLight) which cleans three strings
- samd.LCSQ3_AJB_POC without any reduction
- samd.LCSQ3_AJB_POC with sequence cleaning ran three times, once for @gap=1, once for @gap=2 and once for @gap=3.
LCSQ3 Test Harness
DECLARE @S1 VARCHAR(8000) = 'XA.BCD123',
@S2 VARCHAR(8000) = 'ABC-D-12Z3',
@S3 VARCHAR(8000) = '12DABCYP123';
SET STATISTICS IO, TIME ON;
PRINT CHAR(10)+'Cleaning Only'+CHAR(10)+REPLICATE('-',90);
SELECT sr3.* FROM samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3;
PRINT CHAR(10)+'LCSQ3 CORE'+CHAR(10)+REPLICATE('-',90);
SELECT lcsq3.* FROM samd.LCSQ3_AJB_POC(@S1,@S2,@S3,1) AS lcsq3;
PRINT CHAR(10)+'Reduce Light serial @gap = 1'+CHAR(10)+REPLICATE('-',90);
SELECT lcsq3.*
FROM samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3
CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,1) AS lcsq3
OPTION (MAXDOP 1); -- default optimizer option
PRINT CHAR(10)+'Reduce Light parallel @gap = 1'+CHAR(10)+REPLICATE('-',90);
SELECT lcsq3.*
FROM samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3
CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,1) AS lcsq3
CROSS APPLY core.make_parallel();
PRINT CHAR(10)+'Reduce Light parallel @gap = 2'+CHAR(10)+REPLICATE('-',90);
SELECT lcsq3.*
FROM samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3
CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,2) AS lcsq3
CROSS APPLY core.make_parallel();
PRINT CHAR(10)+'Reduce Light parallel @gap = 3'+CHAR(10)+REPLICATE('-',90);
SELECT lcsq3.*
FROM samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3
CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,3) AS lcsq3
CROSS APPLY core.make_parallel();
GO
SET STATISTICS IO, TIME OFF;
Results:
Reduction Only
------------------------------------------------------------------------------------------
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.
LCSQ3 CORE
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 167, logical reads 534191, physical reads 0, page server reads 0, read-ahead reads 187, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 2079 ms, elapsed time = 1840 ms.
LCSQ3 + Reduce Light serial @gap = 1
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 137, logical reads 210268, physical reads 0, page server reads 0, read-ahead reads 23, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 750 ms, elapsed time = 769 ms.
Reduce Light parallel @gap = 1
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 129, logical reads 111171, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 515 ms, elapsed time = 524 ms.
Reduce Light parallel @gap = 2
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 81, logical reads 61921, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 219 ms, elapsed time = 210 ms.
Reduce Light parallel @gap = 3
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 65, logical reads 45475, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
SQL Server Execution Times: CPU time = 234 ms, elapsed time = 234 ms.
Completion time: 2021-08-10T20:54:09.6885549-05:00
Reviewing the results, the key take-aways are:
- The cleaning process is instantaneous – 0MS
- The core LCSQ3 version without any reduction is the slowest; even with a parallel plan the runtime is 1.84 seconds.
- Adding sequence cleaning speeds the time up to 0.75 seconds with a parallel plan, ~0.5 seconds when parallel
- Sequence cleaning reduced the reads from 534,191 to 210,268 with a serial execution plan, 111,171 with a parallel plan
- Setting @gap=2 gets us down to 0.2 seconds with a margin of error of 1
- Setting @gap=3 actually slows things down it pinch, back to 0.23 seconds and a margin of error of 2
- The @gap parameter allows for informed accuracy vs performance decisions
Conclusion
I think we’ve covered a lot of ground. We learned (or been reminded that):
- Pretty much any complex problem can be solved without loops, iterative recursion, mutable objects or any other procedural, iterative programming logic. Mark another one up for the tally table!
- You can speed up the Longest Common Subsequence, in any language, without changing any code, simply by reducing the size of the input.
- Leveraging pure set-based code while maintaining functional purity, we can make our function even faster when forcing a parallel execution plan.
- With RangeAB we can leverage the @gap parameter to turn our LCSQ2 and LCSQ3 functions into approximations.
Lastly, I want to send a quick shout-out to Sean Lange. Sean’s always been one of my unofficial internet SQL mentors and I’ve learned a ton from him over the last decade plus. A couple years ago I posted an answer on StackOverflow where I introduced some of these ideas. Sean’s short response lit a fire under my butt and deserves some of the credit.
I hope to have stimulated a few extra synapses out there. In Part2 we’re going to dive deeper into the world of sequence compression along with how to turbo charge the socks off our LCSQ2 and LCSQ3 functions.
Thanks for reading and happy hunting!