Solving NP-Complete Problems Faster Part 1 – Sequence Reduction

Disclaimer

This article is a little more advanced than previous ones but will still be useful to programmers of all skill levels. To get the most from this article I suggest running the attached code locally, then reading what I have to say about it. I rushed to get this out so there may be typos. I’ll let my code do a lot of the talking.

Functions used in this article:

Introduction

In the my very first article I mentioned that I was going to use leverage the techniques detailed in this blog to present a solution to the Longest Common Subsequence (LCSQ) between three strings which is an NP-Complete [1] problem. In April 2013 I posted an article titled, A Set-Based Solution to the Longest Common Subsequence Problem. It was my first attempt at solving this problem using only a lazy sequence (AKA tally table or numbers table) exclusively, meaning no loops, cursors, temp tables and all other procedural/iterative techniques. The article was my most popular article even though my solution was wrong as a reader kindly pointed out. Instead of taking the article down I tried to develop a correct purely set-based LCSQ function and update the article once I did. I tried and tried unsuccessfully with no avail. This was infuriating because, though not easy, the problem seemed simple enough to solve with 10-20 hours of effort, 30 tops. For two years, I would periodically take a swing at it as it was a such an intriguing problem. This is when/how I learned about NP Complexity, Complexity Theory and the unsolved math problem known simply as P vs NP.

Long story short, if you can come up with an algorithm that solves any NP-Complete problem in deterministic polynomial time, you have proven the P=NP and, in theory, can be awarded $1,000,000. Alternatively, if you can prove that no such algorithm is possible, that P!=NP you will also get paid. Most people who study this problem, roughly 99%, believe P != NP. I am in the tiny minority that believes otherwise, in part, for the reasons discussed here (6:30 minutes in.)

25 years of programming has taught me is that the tools that our Math Teachers and Professors have given us are not adequate to solve complex problems quickly, only quick enough. The purpose of this entire blog is to introduce new tools that empower developers to solve any programming problem faster, including NP-Hard and NP-Complete ones. This, in almost any programming language. The code here is in T-SQL because it’s what I know best, but I intend to re-write the code in this series using Python and JavaScript.

What’s ahead

In this installment we’re going to:

  1. Use a Longest Common Subsequence (LCSQ) function for some basic AI as we determine what someone meant to say.
  2. Learn how to speed up any LCSQ function (two or more strings), written in any programming language, without changing any code, leveraging what I have coined: Sequence Reduction.
  3. Introduce a set-based solution to the LCSQ between two strings (LCSQ2) and three strings (LCSQ3)

Unless someone can demonstrate otherwise, I will aver the LCSQ3 function posted below is the first purely set-based, immutable, loop and recursion free, leveraging nothing more than a lazy sequence and some new ideas. Better yet, our set-based function is going to give you:

  1. A deterministic result
  2. The length and text of the longest common subsequence
  3. One row per subsequence when there are ties
  4. The location where the subsequence(s) starts, ends and it’s pattern
  5. A tuning parameter (discussed here) that allows you to transform the function into an approximation or heuristic, or a combination of the three for a much, much faster, deterministic result

Let’s do a quick demo:

DECLARE @S1  VARCHAR(8000) = 'XA.BCD123',
        @S2  VARCHAR(8000) = 'ABC-D-12Z3',
        @S3  VARCHAR(8000) = '12DABCYP123';

SELECT lcsq3.* FROM samd.LCSQ3_AJB_POC(@S1,@S2,@S3,1) AS lcsq3; --==== Longest Common Subsequence – three strings
Results:
LCSQ3 Results

The compressed length, 6 in this case, represents the length of the longest common subsequence between the strings, XA.BCD123, ABC-D-123Z and 12DABCYP123. S1_SfxIdx is where the subsequence begins, and the compression key represents the pattern of the sequence. Consider this table which will help you visualize how the compression key works:

The function’s name ends with _POC, for Proof of Concept. I wanted to show that it can be done. In Part2 we’ll dive deep into how to make our LCSQ functions even faster leveraging techniques I’ve covered in earlier articles including finite opposite numbers, indexed functions, combinatorial index optimization and much more. Throughout this series we’ll learn how to make your function faster the more CPUs you throw at it. Better yet, we’re going to learn how to teach your LCSQ3 function to teach itself to solve problems faster. From here on in, I intend to talk as little as possible and let my portable declarative code do my talking for me.

Playing along at home

Attached in the intro is all the code used in this series. The are 31 functions total which you’ll become more familiar with as this series progresses. There’s~1800 lines of code but 90% of it is inline documentation. Please feel free to load it locally and play mad data scientist. Remember, however, to give credit where credit where credit is due.

The code is formatted in a way that you can go to Edit > Outlining > Toggle all Outlining to collapse the code for a high-level view of what’s included.

Using the Longest Common Subsequence for some basic AI

The challenge

Black Mirror is my favorite show of all time. In the White Christmas episode we learned about cookies.  The Irony here that we’re using N-Grams to analyze the names of characters whose digital consciousness was transferred to “the cloud” using a recaller to extract and upload their memory ngrams. I digress.

We have three names in a table named @cookie containing an ID and Name of three Black Mirror characters. The second table, @unmatched, are names of each character spelled incorrectly. We need to match the misspelled name with the best guess for what they were trying to spell.

Cookie Sample data

DECLARE @cookie TABLE (CookieId INT IDENTITY, CookieName VARCHAR(20));
INSERT  @cookie VALUES('Joe Porter'),('Mr. Matt Trent'),('Greta Cookie');

DECLARE @unmatched TABLE (Unmatched VARCHAR(30));
INSERT  @unmatched VALUES('Mr. Matthew Trent'),('Gretta Cookie'),('Joey Porter');

Get the similarity between all possible combinations

First a basic cross join (cartesian product) to get the LCSQ between each possible combination. We are using Phil Factor’s LCSQ function to extract the LCSQ between each pair of names.

--==== How to get the similarity between all strings
SELECT      
  UnmatchedName = u.Unmatched, 	
  QualifiedName = c.CookieName,
  LCSq          = dbo.LongestCommonSubsequence(c.CookieName,u.Unmatched)
FROM        @cookie    AS c
CROSS JOIN  @unmatched AS u;
Results:
UnmatchedName       QualifiedName        LCSq
------------------- -------------------- -----------------
Mr. Matthew Trent   Joe Porter           e rt
Mr. Matthew Trent   Mr. Matt Trent       Mr. Matt Trent
Mr. Matthew Trent   Greta Cookie         rete
Gretta Cookie       Joe Porter           e oe
Gretta Cookie       Mr. Matt Trent       rtt e
Gretta Cookie       Greta Cookie         Greta Cookie
Joey Porter         Joe Porter           Joe Porter
Joey Porter         Mr. Matt Trent       rtr
Joey Porter         Greta Cookie         e oe

Next, we can use our Bernie8K function to calculate the similarity between the two strings dividing LCSQ by L2.

--== 1.1.2. How to get the similarity between all strings
SELECT
  UnmatchedName = u.Unmatched, 
  QualifiedName = c.CookieName,
  LCSq          = lcsq.Txt,
  LCSqLength    = Txt.Ln,
  Similarity    = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
FROM        @cookie                                           AS c
CROSS JOIN  @unmatched                                        AS u
CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched)           AS b
CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
CROSS APPLY (VALUES(LEN(lcsq.Txt)))                           AS Txt(Ln);
Results:
UnmatchedName        QualifiedName        LCSq               LCSqLength  Similarity
-------------------- -------------------- ------------------ ----------- -----------
Mr. Matthew Trent    Joe Porter           e rt               4           0.235
Mr. Matthew Trent    Mr. Matt Trent       Mr. Matt Trent     14          0.824
Mr. Matthew Trent    Greta Cookie         rete               4           0.235
Gretta Cookie        Joe Porter           e oe               4           0.308
Gretta Cookie        Mr. Matt Trent       rtt e              5           0.357
Gretta Cookie        Greta Cookie         Greta Cookie       12          0.923
Joey Porter          Joe Porter           Joe Porter         10          0.909	
Joey Porter          Mr. Matt Trent        rt                3           0.214
Joey Porter          Greta Cookie         e oe               4           0.333

Now let’s get the best suggestion for each misspelled name. For this we’ll wrap our logic in a subquery then, using TOP(1) WITH ties, we can get the best suggestion for each with this ORDER BY clause.

ORDER BY ROW_NUMBER()
OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
--== 1.1.3. Get the best suggestion
SELECT TOP(1) WITH TIES similarity.*
FROM
(
  SELECT
    Unmatched  = u.Unmatched, 
    Qualified  = c.CookieName,
    LCSq       = lcsq.Txt,
    LCSqLength = Txt.Ln,
    Sim        = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
  FROM        @cookie                                           AS c
  CROSS JOIN  @unmatched                                        AS u
  CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched)           AS b
  CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
  CROSS APPLY (VALUES(LEN(lcsq.Txt)))                           AS Txt(Ln)
) AS Similarity  
ORDER BY ROW_NUMBER() 
 OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);
Results:
UnmatchedName      QualifiedName    LCSq            LCSqLength  Similarity
------------------ ---------------- --------------- ----------- ----------
Mr. Matthew Trent  Mr. Matt Trent   Mr. Matt Trent  14          0.824
Gretta Cookie      Greta Cookie     Greta Cookie    12          0.923
Joey Porter        Joe Porter       Joe Porter      10          0.909

And there you have it! We just wrote a query that figured out (presumably) what the name the user was intending to type. Hopefully this demystifies what’s going on behind the scenes with stuff like spell-check and auto-suggestion algorithms.

Sequence Reduction

As mentioned earlier, LCSQ2 is NP-Hard and LCSQ3 is NP-Complete. We’ll use sequence reduction against two strings so that you can calculate similarity faster, then against three strings to fulfil my promise to demonstrate how to “solve problems in NP without any code changes.”

First the LCSQ between two strings. The longest common subsequence between the strings: “AWWWDABGAGWAWADADAXAAWA” and “BBAXZZXXXBZBGZGZZZZGGGZ” is ABGG. I colored the first instance of ABGG in red for each string. Leveraging sequence cleaning we can reduce the characters in the set and get the correct answer by comparing AABGAGAAAAXAAA & BBAXXXXBBGGGGG. Leveraging sequence compression, we get the correct answer by comparing AABGAGAAAAXAA & BAXXBBGGGG. Sequence reduction allows us to get the correct answer by comparing ABGAGAXA & BAXBGG.

This chart will help you visualize the savings we enjoy for each technique.

The original pair of strings are each 23 characters long, the pair of strings transformed via sequence reduction are 8 and 6 characters long. To be clear, in this example, it’s possible to calculate the longest common subsequence between two 23 character strings by shrinking the size of the input for a dramatic increase in better performance while still guaranteeing the  correct answer. Furthermore, since we’re only using functional pure, set-based code – our sequence reductions functions will run in nasty fast time (0MS.)

In the example above we can reduce the original input, which totals 46 characters, to 14 characters leveraging sequence reduction. Here, we’ve reduced the complexity of LCSQ2 from O(246) to O(214). That’s good news for any exact algorithm, approximation, or heuristic. Now Let’s review how to clean, compress, and reduce a set of two or more strings.

1. Sequence Cleaning

Sequence cleaning is performed by removing characters that exist in one string but not the other (orphans). Predicate logic dictates that characters that exist in only one string cannot be part of any common subsequence. Applying this concept to our strings, these:

Become:

The W’s, D’s and Z’s have been removed. Simple enough, eh?

2. Sequence Compression

Taking this concept to the next level, let’s say we string1 contains only one “A” and string2 contains three. If “A” is part of the LCSQ, and only one “A” exists in string1, then only one of the three A’s in string2 will be matched. It’s not possible to determine which “A” will be included in the common subsequence but there is a way to determine, in some instances, which ones can be removed. Enter Sequence Compression.

Consider the LCSQ between “ABC“ and each of these: “ABAAC”, “ABBBCA”, “ABCCDC”. “ABC” contains only one A, one B and one C. Therefore any consecutive series of A’s, B’s and/or C’s can be compressed to one. Compare the before and after for these three “sequence compressed” strings.

Sequence compression can be applied whenever on of the strings in the set has more of any specific character than the other strings in the set. When comparing ABCB to ABBBBC we can reduce the second string to ABBC. There are only two B’s in the first so BBBB can become BB. Think of sequence compression as an advanced form of sequence cleaning. Applying this concept to our strings, these:

…are transformed into these:

Did you notice that the letter D was removed (cleaned) from the third string. Ther letter D is an orphan and is compressed to 0. With sequence compression, sequence cleaning occurs organically. With sequence compression you are both cleaning orphan characters and compressing islands of repeated characters (when possible).

So, with sequence compression we’re cleaning and compressing strings at the same time. That’s a good thing but it’s also a bad thing. What we really want is to clean then compress our strings! Why? Because cleaning leads to better compression. Enter Sequence Reduction.

3. Clean then Compression = Sequence Reduction

Sequence Reduction is performed by cleaning your string first, then performing sequence compression. Consider these queries:

DECLARE @S1 VARCHAR(8000) = 'AWWWDABGAGWAWADADAXAAWA',
        @S2 VARCHAR(8000) = 'BBAXZZXXXBZBGZGZZZZGGGZ',
        @A  VARCHAR(256)  = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';

;--==== No Reduction
SELECT S1 = @S1, S2 = @S2, L1 = LEN(@S1), L2 = LEN(@S2), 
       LCSS = dbo.LongestCommonSubsequence(@S1,@S2);

;--==== Cleaning Only
SELECT sc.S1, sc.S2, L1 = LEN(sc.S1), L2 = LEN(sc.S2),
       LCSS = dbo.LongestCommonSubsequence(sc.S1,sc.S2)
FROM   samd.SequenceClean8KLight(@S1,@S2) AS sc;-- AABGAGAAAAXAAA BBAXXXXBBGGGGG


;--==== Cleaning AND Compression
SELECT sr.S1_New, sr.S2_New, L1 = LEN(sr.S2_New), L2 = LEN(sr.S2_New), 
       LCSS = dbo.LongestCommonSubsequence(sr.S1_New,sr.S2_New)
FROM   samd.SequenceCompressAll8K_NG(@S1,@S2,@A,0) AS sr; -- AABGAGAAAAXAA BAXXBBGGGG

;--==== Clean THEN Compress (Proper Sequence Reduction: Cleaning THEN Compression) 
SELECT sr.S1_New, sr.S2_New, L1 = LEN(sr.S2_New), L2 = LEN(sr.S2_New), 
       LCSS = dbo.LongestCommonSubsequence(sr.S1_New,sr.S2_New)
FROM samd.SequenceReduce8K_NG(@S1,@S2,@A,0) AS sr --ABGAGAXA BAXBGG

The chart below details which function is responsible for which transformation.

In this article I will focus on sequence cleaning for brevity and explore sequence cleaning in Part2, sequence reduction in Part3.

Sequence Cleaning with samd.SequenceClean8KLight

samd.SequenceClean8KLight is simple, three simple functions:

  1. samd.SequenceGroup8K
  2. samd.SequenceExtract8K
  3. samd.PatReplace8K

samd.SequenceGroup8K, our sequence grouper, identifies characters that exist in one string but not the other. samd.SequenceExtract8K returns a distinct set of unmatched characters as a string. That string is then passed to PatReplace8K and all unmatched characters are removed from both strings.

Sequence Grouping is the process of returning all characters in two or more strings along with a count of each character for each string compared. This is how we can determine why two strings are different. Consider this query:

SELECT sg.* FROM samd.SequenceGroup8K('AABDE23', 'ACC123X') AS sg;
Returns:

The first string (S1) contains one “B” and the seconds string (S2) doesn’t contain any. The letter “C” appears twice in S2 but not at all in S1. Any character that has a 0 count under S1 or S2 can be removed.

Sequence Extraction is the process of extracting a distinct set of common (matched) characters, e.g. a character that appears in both strings) or a distinct set of unmatched characters. The results are returned in the form of a string for consumption by samd.Patexclude8K.Consider these two queries:

DECLARE @S1 VARCHAR(8000) = 'AABDE23',
        @S2 VARCHAR(8000) = 'ACC123X';

SELECT se.* FROM samd.SequenceExtract8K(@S1,@S2,0) AS se; -- Returns: 1BCDEX
SELECT se.* FROM samd.SequenceExtract8K(@S1,@S2,1) AS se; -- Returns: 23A

The first returns the orphans: 1BCDEX; the second query returns the common or matched characters: 23A. The matched characters represent your alphabet; we’ll discuss this later in the series.

Sequence Cleaning, as mentioned earlier is the process of removing unmatched characters from both strings, a vital step in performing Sequence Reduction. For this article I’ll be using samd.SequenceClean8KLight. It is the barebones version of a more advanced sequence cleaning function. Let’s clean @S1 and @S2.

DECLARE @S1 VARCHAR(8000) = 'AABDE23',
        @S2 VARCHAR(8000) = 'ACC123X';

SELECT S1_original = sc.S1,
       S2_original = sc.S2,
       S1_New      = sc.S1,
       S1_New      = sc.S2
FROM   samd.SequenceClean8KLight(@S1,@S2) AS sc;

Next for underlying programming logic behind each function.

Set-Based Grouping, Extraction and Cleaning

Set-based, functionally pure code is vital. Loops, recursion, mutable variables/data structures, and impure functions will ruin everything. NGrams8k has your back.

1. Sequence Grouping via samd.SequenceGroup8K

For this article we’ll use samd.SequenceGroup8K. It’s nasty fast not the fastest; we’ll introduce a much better sequence grouper later in this series. I started with this version as it’s the easiest to understand.

DECLARE @S1 VARCHAR(8000) = 'AABDE23',
        @S2 VARCHAR(8000) = 'ACC123X';

SELECT 
  Token  = ISNULL(S1.Token,S2.Token),
  S1     = f.S1,
  S2     = f.S2,	
  Total  = ABS(f.S1-f.S2) -- Diffs
FROM
( --==== Split @S1 into unigrams
  SELECT   ng.Token, Total = COUNT(*)
  FROM     samd.ngrams8k(@S1,1) AS ng
  GROUP BY ng.Token --==== Group and count @S1 unigrams
) AS S1
FULL JOIN
( --==== Split @S2 into unigrams	
  SELECT   ng.Token, Total = COUNT(*)
  FROM     samd.ngrams8k(@S2,1) AS ng
  GROUP BY ng.Token      --==== Group and count @S2 unigrams
) AS S2 
  ON S1.Token = S2.Token --==== Find matching unigrams
CROSS APPLY (VALUES(ISNULL(S1.Total,0),ISNULL(S2.Total,0))) AS f(S1,S2); --==== Pre-Agg

The logic is simple provided you are comfortable with full outer joins. We can get a count of each character with a simple GROUP BY, then joining on the character returning a 0 when there isn’t a match. If we wanted to perform this operation on three or more strings, we would simply add another subquery per additional string. Our ISNULL(S1.Token,S2.Token) will be replaced with COALESCE(S1.Token,S2.Token, S3.Token…) We will do this later while performing sequence cleaning on three strings.

2. Sequence Extraction Logic

Now the sequence extractor. This guy takes two input strings, performs a sequence grouping, then leverages STRING_AGG to return the matched or unmatched characters as a string.

Logic:
SELECT Pattern = STRING_AGG(sg.Token,'') -- Order doesnt matter; no need for WITHIN GROUP
FROM   samd.SequenceGroup8K(@S1,@S2) AS sg
WHERE  SIGN(sg.S1*sg.S2) = 0; -- 0 for unmatched, 1 for matched

The result set for these queries will help you better understand samd.SequenceExtract8K.

DECLARE @S1 VARCHAR(8000) = 'ABC123AABBCC##',
          @S2 VARCHAR(8000) = 'ABCXXYYYZZ123CC';

SELECT sg.* FROM samd.SequenceGroup8K(@S1,@S2) AS sg;

SELECT se.Pattern FROM samd.SequenceExtract8K(@S1,@S2,0) AS se;
SELECT se.Pattern FROM samd.SequenceExtract8K(@S1,@S2,1) AS se;
Results:

3. Sequence Cleaning Logic

The most straight-forward way to accomplish this is to use SeqeunceExtract8k to extract a distinct string of unmatched characters and pass them to PatReplace8K as a pattern of characters to replace with nothing (remove).

3.1. PatReplace8K version logic

SELECT
  S1 = S1.NewString,
  S2 = S2.NewString
FROM        samd.SequenceExtract8K(@S1,@S2,0)                    AS se
CROSS APPLY samd.patReplace8K(@S1,STUFF('[]',2,0,se.Pattern),'') AS S1
CROSS APPLY samd.patReplace8K(@S2,STUFF('[]',2,0,se.Pattern),'') AS S2;

3.2. TRANSLATE version logic

Included with the attached code is a T-SQL TRANSLATE version of SequenceReduce8KLight name SequenceReduce8KLight_T2; If you are on SQL 2017+ you may want to try it out; performs a wee-bit better than the PatReplace8K version.

SELECT
  S1 = S1.NewString,
  S2 = S2.NewString
FROM        samd.SequenceExtract8K(@S1,@S2,0)                             AS se
CROSS APPLY (VALUES(REPLICATE(CHAR(2),LEN(se.Pattern))))                  AS t(Pat)
CROSS APPLY (VALUES(REPLACE(TRANSLATE(@S1,se.Pattern,t.Pat),CHAR(2),''))) AS S1(NewString)
CROSS APPLY (VALUES(REPLACE(TRANSLATE(@S2,se.Pattern,t.Pat),CHAR(2),''))) AS S2(NewString);

Applying sequence cleaning to our Black Mirror cookie example

Below is the original logic without cleaning, the second example demonstrates how to clean your strings before they are passed to the LCSQ.

Original Logic

SELECT TOP(1) WITH TIES similarity.*
FROM
(
  SELECT
    Unmatched  = u.Unmatched, 
    Qualified  = c.CookieName,
    LCSq       = lcsq.Txt,
    LCSqLength = Txt.Ln,
    Sim        = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
  FROM        @cookie                                           AS c
  CROSS JOIN  @unmatched                                        AS u
  CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched)           AS b
  CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(b.S1,b.S2))) AS lcsq(Txt)
  CROSS APPLY (VALUES(LEN(lcsq.Txt)))                           AS Txt(Ln)
) AS Similarity  
ORDER BY ROW_NUMBER() 
 OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);

Sequence cleaning added

SELECT TOP(1) WITH TIES similarity.*
FROM
(
  SELECT
    Unmatched  = u.Unmatched, 
    Qualified  = c.CookieName,
    LCSq       = lcsq.Txt,
    LCSqLength = Txt.Ln,
    Sim        = CAST(1.*Txt.Ln/b.L2 AS DECIMAL(3,3))
  FROM        @cookie                                           AS c
  CROSS JOIN  @unmatched                                        AS u
  CROSS APPLY samd.bernie8k(c.CookieName,u.Unmatched)           AS b
  CROSS APPLY samd.SequenceClean8KLight(b.S1,b.S2)              AS r
  CROSS APPLY (VALUES(dbo.LongestCommonSubsequence(r.S1,r.S2))) AS lcsq(Txt)
  CROSS APPLY (VALUES(LEN(lcsq.Txt)))                           AS Txt(Ln)
) AS Similarity  
ORDER BY ROW_NUMBER() 
 OVER (PARTITION BY similarity.Unmatched ORDER BY similarity.Sim DESC);

Both return the same thing, the second example is done by comparing less characters. This query demonstrates the savings from sequence cleaning only:

DECLARE @cookie TABLE (CookieId INT IDENTITY, CookieName VARCHAR(20));
INSERT  @cookie VALUES('Joe Porter'),('Matthew Trent'),('Greta Cookie');

DECLARE @unmatched TABLE (Unmatched VARCHAR(30));
INSERT  @unmatched VALUES('Matthew Trint'),('Gretta Cookie'),('Gretta'),('Joey Porter');

SELECT 
  S1  = c.CookieName,
  S2  = u.Unmatched, 
  New = r.S1,
  New = r.S2,
  OriginalLen = LEN(c.CookieName+u.Unmatched),
  NewLen      = LEN(r.S1+r.S2),
  Savings     = LEN(c.CookieName+u.Unmatched)-LEN(r.S1+r.S2)
FROM        @cookie    AS c
CROSS JOIN  @unmatched AS u
CROSS APPLY samd.SequenceClean8KLight(c.CookieName,u.Unmatched) AS r;
Returns:
S1            S2             New        New           OriginalLen NewLen  Savings
------------- -------------- ---------- ------------- ----------- ------- --------
Joe Porter    Matthew Trint  e rter     tte Trt       23          13      10
Joe Porter    Gretta Cookie  oe orte    rett ooe      23          16      7
Joe Porter    Gretta         erter      rett          16          9       7
Joe Porter    Joey Porter    Joe Porter Joe Porter    21          20      1
Matthew Trent Matthew Trint  Matthew .. Matthew Trnt  26          25      1
Matthew Trent Gretta Cookie  atte Tret  retta e       26          16      10
Matthew Trent Gretta         atteTret   retta         19          13      6
Matthew Trent Joey Porter    tte Tret   e rter        24          14      10
Greta Cookie  Matthew Trint  reta ie    atte Trit     25          16      9
Greta Cookie  Gretta Cookie  Greta Co.. Gretta Cookie 25          25      0
Greta Cookie  Gretta         Gretae     Gretta        18          12      6
Greta Cookie  Joey Porter    ret ooe    oe orter      23          15      8

That’s 75 fewer characters leveraging sequence cleaning. Score!

A Set Based Solution to the Longest Common Subsequence (Three Strings)

I’ll exclude my two-string LCSQ function for now but included it with the attached code. We’ll dive deeper into the inner workings of both LCSQ functions in Part2; samd.LCSQ3_AJB_POC is doing a quick cameo for the purposes of demonstrating the power of sequence cleaning against three strings (and that you can solve basically any problem without a loop).

What’s worth calling out now is how both LCSQ functions accept a gap parameter (@gap). Yes, we’re using a brute-force algorithm but we’re doing so without the need to evaluate every subsequence. Both functions leverage the design pattern detailed in PerfML Part 2: Self-Tuning with a Tally Table. This design pattern allows you to solve LCSQ2 in 2(N/G+G-1)time and LCSQ3 in 3(N/G+G-1) time; something we’ll also dig deeper into in Part2. All you need to know for now is that our LCSQ3 is set-based, relatively fast (for all of its functionality), runs faster with a parallel execution plan, and deterministic. The function name ends with “_POC” as it’s my proof-of-concept. The focus here is more about learning than tuning.

The Function

samd.LCSQ3_AJB_POC accepts three input strings and calculates the LCSQ between the three. samd.Bernie8K_3S is used to identify the shortest of the three strings as L1. L1 represents the longest possible subsequence between the three strings. samd.SubSequenceExtractALL looks like it was named by Yoda but works as advertised. It takes three parameters: an input string (@string), a gap parameter (@gap) and maximum subsequence length (@SL). When @gap=1, the function extracts every subsequence size 1 through @SL. When @gap=2 the function returns every subsequence sizes 1,3,5,etc though @SL. We use using L1 as maximum subsequence length (@SL) when splitting our strings into since there aren’t going to be any common subsequences longer than the shortest string in the set.

All three strings are passed to samd.SubSequenceExtractALL as separate calls and only subsequences that exist in all three strings are returned. Then we sort by the length of the matched subsequences in descending order returning the longest match. Leveraging TOP(1) WITH TIES sorted by the length of the subsequence we are able to return ties if they exist. The @gap parameter allows you to use the function for approximations. When @gap = 1, and returns a six character subsequence, you know that the LCSQ is six characters long. When @gap = 2 the function is 2X as fast can be off by one; in this case, for example, if the function returns a six-character string you know that the LCSQ is 6-7 characters long. When @gap = 3, the function is 3X faster but can be off by as much as two; if it says the LCSQ is six characters long, you know that it might by 6, 7 or 8. A classic case of trading performance for accuracy.

CREATE OR ALTER FUNCTION samd.LCSQ3_AJB_POC
(
  @S1  VARCHAR(20),
  @S2  VARCHAR(20),
  @S3  VARCHAR(20),
  @gap BIGINT
)
RETURNS TABLE AS RETURN
SELECT DISTINCT TOP(1) WITH TIES
  CompressedLength    = s1.CompressedLength,
  S1_SfxIdx           = s1.SuffixIndex,
  S1_Compression_key  = s1.CompressionKey,
  S1_Uncompressed_Len = s1.UncompressedLength,
  S1_Subsequence      = s1.Subsequence,
  S2_SfxIdx           = s2.SuffixIndex,
  S2_Compression_key  = s2.CompressionKey,
  S2_Uncompressed_Len = s2.UncompressedLength,
  S2_Subsequence      = s2.Subsequence,
  S3_SfxIdx           = s3.SuffixIndex,
  S3_Compression_key  = s3.CompressionKey,
  S3_Uncompressed_Len = s3.UncompressedLength,
  S3_Subsequence      = s3.Subsequence
FROM        samd.Bernie8K_3S(@S1,@S2,@S3)              AS b
CROSS APPLY samd.SubSequenceExtractALL(b.S1,@gap,b.L1) AS s1
CROSS APPLY samd.SubSequenceExtractALL(b.S2,@gap,b.L1) AS s2
CROSS APPLY samd.SubSequenceExtractALL(b.S3,@gap,b.L1) AS s3
WHERE       s1.Subsequence      = s2.Subsequence
  AND       s1.Subsequence      = s3.Subsequence
ORDER BY    s1.CompressedLength DESC;

Again, we’ll keep making this guy faster and faster throughout this series until you’re sick of it. Stay tuned.

A Three String Circus (Performance Test)

Unlike problems in P, with problems in NP, even reducing the input size (N) by one is a big gain. Let’s create a test harness with our example from earlier to demonstrate. Here we’re going to get the IO stats and CPU time for:

  • The cleaning function only (SequenceReduce3s8KLight) which cleans three strings
  • samd.LCSQ3_AJB_POC without any reduction
  • samd.LCSQ3_AJB_POC with sequence cleaning ran three times, once for @gap=1, once for @gap=2 and once for @gap=3.

LCSQ3 Test Harness

DECLARE @S1  VARCHAR(8000) = 'XA.BCD123',
        @S2  VARCHAR(8000) = 'ABC-D-12Z3',
        @S3  VARCHAR(8000) = '12DABCYP123';

SET STATISTICS IO, TIME ON;
PRINT CHAR(10)+'Cleaning Only'+CHAR(10)+REPLICATE('-',90);
 SELECT sr3.* FROM  samd.SequenceReduce3s8KLight(@S1,@S2,@S3) AS sr3;

PRINT CHAR(10)+'LCSQ3 CORE'+CHAR(10)+REPLICATE('-',90);
 SELECT lcsq3.*  FROM samd.LCSQ3_AJB_POC(@S1,@S2,@S3,1) AS lcsq3;

PRINT CHAR(10)+'Reduce Light serial @gap = 1'+CHAR(10)+REPLICATE('-',90);
 SELECT      lcsq3.* 
 FROM        samd.SequenceReduce3s8KLight(@S1,@S2,@S3)  AS sr3
 CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,1) AS lcsq3
 OPTION (MAXDOP 1); -- default optimizer option

PRINT CHAR(10)+'Reduce Light parallel @gap = 1'+CHAR(10)+REPLICATE('-',90);
 SELECT      lcsq3.* 
 FROM        samd.SequenceReduce3s8KLight(@S1,@S2,@S3)  AS sr3
 CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,1) AS lcsq3
 CROSS APPLY core.make_parallel();

PRINT CHAR(10)+'Reduce Light parallel @gap = 2'+CHAR(10)+REPLICATE('-',90);
 SELECT      lcsq3.* 
 FROM        samd.SequenceReduce3s8KLight(@S1,@S2,@S3)  AS sr3
 CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,2) AS lcsq3
 CROSS APPLY core.make_parallel();

PRINT CHAR(10)+'Reduce Light parallel @gap = 3'+CHAR(10)+REPLICATE('-',90);
  SELECT      lcsq3.* 
  FROM        samd.SequenceReduce3s8KLight(@S1,@S2,@S3)  AS sr3
  CROSS APPLY samd.LCSQ3_AJB_POC(sr3.S1,sr3.S2,sr3.S3,3) AS lcsq3
  CROSS APPLY core.make_parallel();
GO
SET STATISTICS IO, TIME OFF;

Results:

Reduction Only
------------------------------------------------------------------------------------------
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 0 ms,  elapsed time = 1 ms.

LCSQ3 CORE
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 167, logical reads 534191, physical reads 0, page server reads 0, read-ahead reads 187, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 2079 ms,  elapsed time = 1840 ms.

LCSQ3 + Reduce Light serial @gap = 1
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 137, logical reads 210268, physical reads 0, page server reads 0, read-ahead reads 23, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 750 ms,  elapsed time = 769 ms.

Reduce Light parallel @gap = 1
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 129, logical reads 111171, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 515 ms,  elapsed time = 524 ms.

Reduce Light parallel @gap = 2
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 81, logical reads 61921, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 219 ms,  elapsed time = 210 ms.

Reduce Light parallel @gap = 3
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 65, logical reads 45475, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
 SQL Server Execution Times: CPU time = 234 ms,  elapsed time = 234 ms.

Completion time: 2021-08-10T20:54:09.6885549-05:00

Reviewing the results, the key take-aways are:

  • The cleaning process is instantaneous – 0MS
  • The core LCSQ3 version without any reduction is the slowest; even with a parallel plan the runtime is 1.84 seconds.
  • Adding sequence cleaning speeds the time up to 0.75 seconds with a parallel plan, ~0.5 seconds when parallel
  • Sequence cleaning reduced the reads from 534,191 to 210,268 with a serial execution plan, 111,171 with a parallel plan
  • Setting @gap=2 gets us down to 0.2 seconds with a margin of error of 1
  • Setting @gap=3 actually slows things down it pinch, back to 0.23 seconds and a margin of error of 2
  • The @gap parameter allows for informed accuracy vs performance decisions

Conclusion                                                     

I think we’ve covered a lot of ground. We learned (or been reminded that):

  1. Pretty much any complex problem can be solved without loops, iterative recursion, mutable objects or any other procedural, iterative programming logic. Mark another one up for the tally table!
  2. You can speed up the Longest Common Subsequence, in any language, without changing any code, simply by reducing the size of the input.
  3. Leveraging pure set-based code while maintaining functional purity, we can make our function even faster when forcing a parallel execution plan.
  4. With RangeAB we can leverage the @gap parameter to turn our LCSQ2 and LCSQ3 functions into approximations.

Lastly, I want to send a quick shout-out to Sean Lange. Sean’s always been one of my unofficial internet SQL mentors and I’ve learned a ton from him over the last decade plus. A couple years ago I posted an answer on StackOverflow where I introduced some of these ideas. Sean’s short response lit a fire under my butt and deserves some of the credit.

I hope to have stimulated a few extra synapses out there. In Part2 we’re going to dive deeper into the world of sequence compression along with how to turbo charge the socks off our LCSQ2 and LCSQ3 functions.

Thanks for reading and happy hunting!