By the end of this article should understand and will have sample code which can be used to write what I call a self-tuning function – a function which can continuously tune itself better performance over time. This without libraries or third party code, only algorithms and concepts. What you may appreciate most about this approach to writing self-tuning functions is that you can apply it any major programming languages that supports side-effect free functions; especially Functional Programming (FP), Python, .NET and pretty much any flavor of SQL. I came up with this concept while working on a version of the traveling salesperson problem for a logistics company. I quickly discovered that these ideas could be easily applied to common everyday problems Throughout this series we will apply this concept to a simple string and pattern search algorithm then to some more complex but useful functions, all the way through the longest common subsequence between three strings which is NP-Complete (for the general case of an arbitrary number of input sequences).
By the way, welcome to my blog! I created the CodingBlackArts to introduce a whole new approach to writing modular, scalable nasty fast code, language-agnostic. My background has been with Microsoft SQL Server but I’m experienced with most with various flavors of SQL and other declarative programming languages such as MDX and XSLT. Writing high-performing set-based SQL code is my second passion by my first true love was functional programming (FP) and, when possible I like to dabble with XSLT 3.0, Scala and Clojure. Sadly, I’ve never had an opportunity to focus on FP professionally so I’ve come up with ways b ring my favorite FP concepts into my SQL development. Pure functions, higher-order functions, functions as first Class Citizens, lazy sequences, modularity and even code as data. Some of this you will see throughout the series.
PerfML is a collection of ideas, concepts and algorithms intended to enable you to: (1) write super-fast, clean and deterministic functions, (2) teach your functions to make themselves faster (3) teach your functions how to teach themselves how to be faster. The self-tuning function is a core component of PerfML. Perf – as in performance, ML as in Machine Learning. PerfML relies heavily on some new approximation algorithms, reductions and even a new kind of parameter I developed. This is the first of a many articles on this topic. I’ve spent nearly seven years developing this new method of solving old problems and I’m excited to share it with you.
First, functions are the center of the PerfML Universe and a the top of that hierarchy is fnTally, a T-SQL function that produces an ordered lazy sequence of numbers similar to python and Clojure’s range function.
This article has the first iteration of code examples which involve loops and T-SQL scalar user defined functions (referred to herein as scalar UDFs). This is for learning purposes only, loops and scalar UDFs perform poorly in MS SQL Server. Loops are generally gross as a math concept in my opinion, especially when developing functions. For now it’s important to understand the concepts, especially if you are new to programming, we’ll address performance later in the series.
Second, as with all CodingBlackArts content, this article is a living document and will be updated frequently. Please forgive any spelling/grammar issues, sample code issues and such. My TOP(1) priority is to deliver my my code, concepts and algorithms as quickly and frequently as possible. than getting my grammar perfect. Please comment below of ping me with ideas, thought, complaints and suggestions.
Third, don’t be discouraged if you are new to programming or SQL specifically. The subject matter is a little advanced but I will do my best to present this subject matter in a way that as many developers and curious onlookers can grasp. When I was new to programming I read a lot of articles that middle little or no sense to me – running and playing with code with was a great learning tool early on and still is.
Let’s begin with a basic string search algorithm then turn it into a self-tuning function. The requirement is a function that searches a string(S) for the longest consecutive series of a specific character(C), between L and H characters long. Consider these T-SQL parameters and function:
DECLARE @C CHAR(1) = 'x', @S VARCHAR(MAX) = 'ABC.xxxyz.xxxx!!', @L BIGINT = 3, @H BIGINT `= 6; SELECT dbo.stringSearch_exact(@C,@S,@L,@H);
Per the above parameters, we need dbo.stringSearch_exact to return the length of the longest series of the letter, “x” in the string ABC.xxxyz.xxxx!!, that is between 3 and 6 characters long. The correct result is 4 because the text, “xxxx” appears at position 11.
Solution 1: An Exact Algorithm
First for the most direct but least performant solution leveraging a loop and a basic brute force search algorithm. The routine will search for a series of the character, @C that that is @L characters long. If no match is found the routine returns zero (0). If a match does exist the routine begins a search for a series with a length of @H, decrementing @H by 1 until a match is found or until @H=@L.
-- KEY: Note that, with this algorithm: When @H=@L you know @L is the correct answer. DECLARE @C CHAR(1) = 'x', @S VARCHAR(MAX) = 'zzzabc123xxxxxxxxgggggggfffxxxxxxx', @L BIGINT = 5, @H BIGINT = 20; PRINT CHAR(13)+'Function logic as a loop'+CHAR(13)+REPLICATE('-',90); WHILE @L<=@H BEGIN DECLARE @hIdx BIGINT = CHARINDEX(REPLICATE(@C,@H),@S); PRINT CONCAT('@H:',@H,'; Idx:',@hIdx,';-',IIF(@hIdx>0,' Hit',' No Hit')); IF CHARINDEX(REPLICATE(@C,@L),@S)=0 BEGIN SELECT 0; BREAK END; -- You done IF @L=@H BEGIN SELECT @H; BREAK END; -- You done IF CHARINDEX(REPLICATE(@C,@H),@S)>0 BEGIN SELECT @H; BREAK END; -- You done SET @H=@H-1; -- Lets try again (variable mutation bad - how dare me) END;
There are a couple subtle but slick optimizations worth pointing out. First – if there isn’t a match only one operation is performed. It would be easy to just start counting down from @H, performing O(H) operations to discover that there’s no match. Next, when @H-1 =@L the routine skips the final iteration because it’s done. This means that we a spared one operation in worst case scenarios. The former optimization is a big deal, the latter is a big deal when dealing with really big strings but not smaller ones. I point these optimizations out because they are both math victories. In programming I like to think of each math victory as points in a video game that help you advance to the next level. Even the small ones add up. I digress.
A PRINT statement is included to see what’s happening under the hood.
@H:20; Idx:0; - No Hit @H:19; Idx:0; - No Hit @H:18; Idx:0; - No Hit @H:17; Idx:0; - No Hit @H:16; Idx:0; - No Hit @H:15; Idx:0; - No Hit @H:14; Idx:0; - No Hit @H:13; Idx:0; - No Hit @H:12; Idx:0; - No Hit @H:10; Idx:0; - No Hit @H:11; Idx:0; - No Hit @H:9; Idx:0; - No Hit @H:8; Idx:10; - Hit
The parameters above will require the routine perform 13 iterations for the correct result. Things quickly get worse when the string gets longer and/or the distance between @L and @H increases. Try the routine above with different parameters. Note the best- and worst-case scenarios. Below is our routine wrapped in a scalar UDF :
CREATE FUNCTION dbo.stringSearch_Exact ( @C CHAR(1), -- Search character @S VARCHAR(MAX), -- Input String @L BIGINT, -- Lower Boundary @H BIGINT -- Upper Boundary ) /* Created by Alan Burstein 20200916 WARNING - SLOW CODE, INTENDED FOR LEARNING PURPOSES ONLY Note the RETURNS NULL ON NULL INPUT. This has performance benefits and will prevent an infinite loop for when @H is NULL */ RETURNS BIGINT WITH SCHEMABINDING, RETURNS NULL ON NULL INPUT AS BEGIN WHILE @L<=@H BEGIN IF CHARINDEX(REPLICATE(@C,@L),@S)=0 RETURN 0; IF @L=@H RETURN @H; IF CHARINDEX(REPLICATE(@C,@H),@S)>0 RETURN @H; SET @H=@H-1; END RETURN NULL; END;
We call dbo.stringSearch_Exact like this:
SELECT dbo.stringSearch_Exact('C', 'ABCCCCDEF', 1, 5); -- RETURNS: 4
One small thing to note is how the value of @H changes with each iteration. This is referred to as variable mutation a programming anti-pattern which is why this is the last time you will see it here. Variable mutation will be replaced with recursion later in our next solution.
Solution 2: An Approximation Algorithm
What if we didn’t need an exact match? What if we introduced a margin of error, allowing the answer can be off by a factor of N (N as a user-supplied boundary)? Say we are seeking a match between 5 and 10 characters long, but it doesn’t have to be exact.
Enter the approximation algorithm. Approximations are a vital component of PerfML, consider this definition from Wikipedia:
In computer science and operations research, approximation algorithms are efficient algorithms that find approximate solutions to optimization problems (in particular NP-hard problems) with provable [deterministic] guarantees on the distance of the returned solution to the optimal one.
Approximation Algorithms – Wikipedia
The intended class of problems are NP-hard, a a class of problems most people outside the data science universe rarely deal with. Approximation algorithms were created for this class of problems but work great on the easier stuff, especially when brute force is involved. This needs to change, too many developers are missing out.
Let’s transform our exact algorithm into an approximation algorithm. For this we’ll use a gap parameter. A gap parameter (we’ll call @Gap or @G for short) allows us to decrease the size of a sequence (AKA set) by decrementing by the value of @G instead of 1. So, for example, when: @L=1, @H=10, @G=1, the routine counts down from 10,9,8…2,1. That’s the exact algorithm from above (the a static value instead of @G. If you set @G=2 the routine counts down – 10,8,6…
DECLARE @L BIGINT = 1, @H BIGINT = 20, @G BIGINT = 2; PRINT CONCAT('@N:',@H,' (Start)'); WHILE @L<@H BEGIN SET @H=IIF(@H-@G < @L, @L, @H-@G); PRINT CONCAT('@H:',@H); END
To prevent the final iteration from being negative or less than @L:
Again, we still have the option of @G=1 for an exact result by setting the gap parameter to 1: @G=1, or it can be greater. For brevity I’m not adding logic to prevent you from setting @G less than 1, NULL or longer than your string but keep that in mind for now.
DECLARE @C CHAR(1) = 'x', @S VARCHAR(MAX) = '123xxxyyyzzz123xxxxxxxxgggggggxxxxxxx', @L BIGINT = 5, @H BIGINT = 20, @G BIGINT = 2; PRINT CHAR(13)+'Function logic as a loop'+CHAR(13)+REPLICATE('-',90); DECLARE @I BIGINT = 1; WHILE @L<=@H BEGIN DECLARE @hIdx BIGINT = CHARINDEX(REPLICATE(@C,@H),@S); PRINT CONCAT('@H:',@H,'; Idx:',@hIdx,';-',IIF(@hIdx>0,' Hit',' No Hit')); IF CHARINDEX(REPLICATE(@C,@L),@S)=0 BEGIN SELECT RESULTS=0; BREAK END; IF @L=@H BEGIN SELECT RESULTS=@H; BREAK END; IF CHARINDEX(REPLICATE(@C,@H),@S)>0 BEGIN SELECT RESULTS=@H; BREAK END; SELECT @H=@H-@G, @I=@I+@G; END
To understand how different values for @G impact the workload consider:
The beauty here is how easy it is to calculate the work/accuracy trade-off for as @G increases reducing the workload while increasing your margin of error. In the analysis above the cost/accuracy trade-off appears to diminish when @G=3. Here, the “sweet spot” is somewhere between 2 and 3. The sweet spot is, however, a constantly moving target; finding it is both Art and Science. The important thing is to understand is the best/worst case scenarios and associated trade-offs.
In the world of Business Intelligence there is a saying, “if you can’t measure it, you can’t manage it.” For example, how could you manage your direct reports if you didn’t know how many you had? Once we go set-based with fnTally we’ll be able to capture real metrics from the execution plan to determine which parameter would have been best for the different kinds of data you are dealing with. Now, when we do a performance test, we aren’t only testing performance, we running a simulation! Black Mirror fans should be excited and nervous.
The Approximation Function
Our new logic is going to be in the form of a self-referencing recursive scalar UDF. I chose a recursive scalar UDF because:
- They allow for the cleanest, most lightweight, universal and portable syntax possible in T-SQL. Even programmers unfamiliar with SQL should understand and apply this code to the language of their choosing provided they understand basic recursion (don’t feel bad if you don’t, just remember that recursion should cause you to think about recursion).
Note that logic for dbo.stringSearchV1 is only three lines.
- There will be good recursion use later on leveraging recursive CTEs which is a little more complicated, the more simple recursive scalar UDF will serve as a good warm-up.
- A slow scalar recursive UDF is a great way to demonstrate the power of the algorithm; in the test below we’ll look for matching patterns in strings millions and even billions of characters long; we’ll get our answer in seconds.
- There’s is barely any documentation available on recursive Scalar UDFs in SQL and nothing anywhere about tail recursion in T-SQL server functions – an important topic.
Approximation Function Version #1: dbo.stringSearchV1
CREATE OR ALTER FUNCTION dbo.stringSearchV1 ( @C VARCHAR(100), -- Search String @S VARCHAR(MAX), -- Input String @L BIGINT, -- Lower Boundary @H BIGINT, -- Upper Boundary @G BIGINT -- Gap Optimizer ) /* Created by Alan Burstein on 20200916 WARNING: SLOW FUNCTION - FOR LEARNING PURPOSES ONLY */ RETURNS BIGINT WITH RETURNS NULL ON NULL INPUT AS BEGIN IF CHARINDEX(REPLICATE(@C,@L),@S)=0 RETURN 0; -- Startup Predicate IF CHARINDEX(REPLICATE(@C,@H),@S)>0 OR @H=@L RETURN @H; -- Final Recursive call RETURN dbo.stringSearchV1(@C,@S,@L,IIF(@H-@G<@L,@L,@H-@G),@G); -- Next Recursive call END;
For each iteration where there is still no match without a match, the function calls itself reducing @H by @G using with this formula:
If, on the final iteration, @H-@G is less than @L (and therefore invalid), @L is used instead of @H-@G. E.g when @H=10, @L=1, @G=4: the pattern size evaluated would be 10,6,2, then 1. If @H=20, @L=10, @G=6: the sizes would be 20,14,10.
As mentioned earlier, and especially in the world of declarative programming (which includes T-SQL), performance is best when parameters and variables are immutable. WRONG:
SET @H = IIF(@H-@G < @L, @L, @H-@G) RETURN dbo.stringSearchV1(@C, @S, @L, @H, @G)
Instead of mutating the value of @H the function recursively calls itself with new parameter values. CORRECT:
RETURN dbo.stringSearchV1(@C, @S, @L, IIF(@H-@G<@L,@L,@H-@G), @G)
Approximation Function Performance Test
Again, this article is to get you familiar with the algorithm in Part 2 we’ll turbo charge things with dbo.fnTally, a tally table function.
For our performance test we’re going to:
- Set up a sample string 21,470,000+ characters long, a good value for stress testing
- Create and populate a table variable (@gaps) with different values to assign to our gap parameter (@G)
- Run a performance test with each parameter values for @G
- Prints the results for review
21,470,000 characters is a good for stress testing, it’s ~1% of the VARCHAR-MAX limit (2,147,483,647). Using the parameters below, dbo.stringSearchV1 takes roughly 2 seconds when @G=10. Since it’s performance is linear, I know (and have confirmed) that 214,700,000 characters will take ~20 seconds, 2.147 billion characters (the limit) takes roughly 200 seconds (~3.5 minutes.) Not bad for a scalar UDF.
-- 1. Set up the sample string and parameters DECLARE @X VARCHAR(MAX) = '?'; DECLARE @C VARCHAR(100) = 'x', @S VARCHAR(MAX) = '555xx'+@X+'xxxxxxx'+REPLICATE(@X,21470000)+'zzz123xxxggffxxx', @L BIGINT = 1, @H BIGINT = 20, @G BIGINT, @O BIGINT, @ST DATETIME; -- 2. Create and populate a temp table with gap parameter(@G) values DECLARE @gaps TABLE (G BIGINT PRIMARY KEY); INSERT @gaps VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(20),(50),(100),(500),(1000); -- 3. Start the test PRINT 'APPROXIMAION TEST'+CHAR(13)+REPLICATE('-',90); WHILE EXISTS (SELECT * FROM @gaps) BEGIN SELECT TOP(1) @G = g.G FROM @gaps AS g ORDER BY G; DELETE FROM @gaps WHERE g=@G; SELECT @ST=GETDATE(); -- Set the test start time SELECT L = reduce.L FROM (VALUES(dbo.stringSearch(@C,@S,@L,@H,@G))) AS reduce(L); -- 4. Print the results PRINT CONCAT(@G,': ',DATEDIFF(MS,@st,GETDATE())); END;
Approximation Function Test Results
I colored the values green for what I consider the best performance/accuracy trade-offs. The slowest, but correct, answer is 7 at 4.3 seconds for an exact result (@G=1). 2 seconds to determine that the value is between 5 and 9 (@G=5), 1 second to determine that it’s between 4 & 11 (@G=8) and 0.5 seconds to determine that it’s between 1 & 20 (@G=20.) This is another example of how to trade accuracy for performance in a very predictable fashion.
Solution 3: The Dynamic Algorithm
Can we eat our cake and have it too? Can we enjoy the performance benefits of the approximation algorithm without compromising on accuracy? What if we combined our approximation with an exact algorithm?
This is where things get good. Let’s revisit our approximation performance test and make a quick change. Let’s use a simplified string and set @G=5, which will assign the lowest allowable value @Low. Next assign the highest possible value to @High, which is @Low+@G-1.
DECLARE @C VARCHAR(100) = 'x', @S VARCHAR(MAX) = '999xxxxxxxABC123', @L BIGINT = 1, @H BIGINT = 20, @G BIGINT = 5; -- Gap Parameter set to 5 -- 1. Reduce the search area leveraging the approximation algorithm: DECLARE @Low BIGINT = dbo.stringSearchV1(@C,@S,@L,@H,@G); -- Approximation algorithm DECLARE @High BIGINT = @Low+@G-1; -- 2. Exact Algorithm SELECT LowerBound = @Low, UpperBound = @High, Exact = dbo.stringSearchV1(@C,@S,@Low,@High,1);
Using the exact algorithm (@G=1) would require 13 iterations to determine the correct answer is 7. Seeking an approximation, accurate to within 4, we set @G=5. First we perform 4 iterations to determine that the search item is at least 5 but no more than 9. Next we make a second function call with @L=5, @H=9, and set @G=1 for an exact result, which takes 3 iterations. That’s a total of 7 iterations instead of 13! We just reduced the work by almost 1/2 without any loss of accuracy. That’s huge! Data Scientists and Complexity Theory nerds refer to this type of optimization as a combinatorial optimization leveraging a dynamic algorithm. A dynamic algorithm can be defined as:
A general class of algorithms which solve problems by solving smaller versions of the problem, saving the solutions to the small problems and then combining them to solve the larger problem.
The trick is to choose the correct value for @G or, better yet, write a routine that picks the best value for you. Consider the chart below which shows the worst case scenario for when @L=1 and @H=20. “@G” is the gap parameter values from 1-10. “Approx” represents the maximum number of approximation iterations, “Exact” represents the maximum number of exact iterations. “Total” combines the the Approx+Exact for the worst case scenario. The worst case scenario will always be when @G=1, things improve as @G get’s higher.
Notice that 4 and 5 appear to be the sweet spot. This does not mean that when @G=4, or @G=5 the query is guaranteed to be the fastest, only that they are most likely to be best values. Lets run our performance test from earlier but this time we’ll leverage our new dynamic algorithm for an exact result.
Dynamic Algorithm Performance Test
DECLARE @X VARCHAR(MAX) = '?'; DECLARE @C VARCHAR(100) = 'x', @S VARCHAR(MAX) = '555xx'+@X+'xxxxxxx'+REPLICATE(@X,21470000)+'zzz123xxxggffxxx', @L BIGINT = 1, @H BIGINT = 20, @G BIGINT, @O BIGINT, @ST DATETIME; PRINT 'APPROXIMAION TEST'+CHAR(13)+REPLICATE('-',90); DECLARE @gaps TABLE (G BIGINT PRIMARY KEY); PRINT 'EXACT APPROXIMAION (DYNAMIC)'+CHAR(13)+REPLICATE('-',90); INSERT @gaps VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(15),(20),(30); WHILE EXISTS (SELECT * FROM @gaps) BEGIN SELECT TOP(1) @G = g.G FROM @gaps AS g ORDER BY G; DELETE FROM @gaps WHERE g=@G; SELECT @ST=GETDATE(); -- Start the test DECLARE @Low BIGINT = dbo.stringSearchV1(@C,@S,@L,@H,@G); -- Approximation DECLARE @High BIGINT = @Low+@G-1; SELECT dbo.stringSearchV1(@C,@S,@Low,@High,1); -- @G=1: Exact PRINT CONCAT(@G,': ',DATEDIFF(MS,@st,GETDATE())); -- Print the results END;
Dynamic Algorithm Performance Test Results:
With the final, dynamic algorithm it appears that 4 and 6 were the best for this specific problem. We were able to determine this by running a test against a single string while trying out different values for our gap parameter (@G). In Part 2 I will show you how to automate the selection of the best value for @G so as to transform our function from one we’ve tuned manually into one that tunes itself. See you there!