Improvement Of Classic KMP Matching Algorithm: Apply To An Unlimited Sequence

Author/Beta-TNT

 

As a classic string/sequence matching algorithm, I believe every reader who is graduated from computer science college knows “KMP”. But every examples you can find on the Internet are just boring codes, matching a certain given target string. One of the advantages of KMP algorithm is that there is no need to “watch back”, so, indeed KMP is capable for not just fixed or certain given target string/sequence, but also an unlimited one. In the field of Network Security, we often need to check whether a certain subsequence or sequence fragment matches a ruleset, the KMP is perfect for this situation.

The implementation is quite simple: all we need is a queue which has a max length that equals to the length of pattern string, we use this queue as a “window”. When mismatch happens, the window slides to right, which means append new char to the rear, if the window reached its maximum length, it will pop first char automatically. We check the window’s contents only when it is full.

Window
Target a b a b a b a b
Pattern a b a b b

If you don’t want KMP thing, check all the contents every time when it slides, that’s the way Brute Force matching applied to an unlimited sequence. The KMP way is sliding a distance of mismatch index (we call it [i]) of pattern string minus next[i] (i – next[i]), and matching the pattern from the index of next[i] next time when the window is filled. Here is an example:

Pattern string and its NEXT array
Index 0 1 2 3 4
Pattern a b a b b
Next 0 0 0 1 2

 

Initial status
Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Enqueueing, filling the window
Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Window filled, matching starts
Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

First mismatch
Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Sliding distance = Index-Next[Index] = 4 – 2 = 2
Start matching from = Next[Index] = 2
popped Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Second mismatch
popped Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Sliding distance = Index-Next[Index] = 4 – 2 = 2
Start matching from = Next[Index] = 2
popped Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

 

Matched
popped Window
Target a b a b a b a b b
Pattern a b a b b
Next 0 0 0 1 2
Index 0 1 2 3 4

As the way how to match an unlimited sequence, I use the “feed data” mechanism. If the window slides a length longer than 2 chars, matching will not begin until the window is filled. The feed data function will return None if the sequence doesn’t meet the pattern, or “i” pointer (index of the begin of matched subsequence) and window content if matches.

Here is my Python implementation: https://github.com/Beta-TNT/KMP_Unlimited/blob/main/kmp_unlimited.py