Author/Beta-TNT
As a classic string/sequence matching algorithm, I believe every reader who is graduated from computer science college knows “KMP”. But every examples you can find on the Internet are just boring codes, matching a certain given target string. One of the advantages of KMP algorithm is that there is no need to “watch back”, so, indeed KMP is capable for not just fixed or certain given target string/sequence, but also an unlimited one. In the field of Network Security, we often need to check whether a certain subsequence or sequence fragment matches a ruleset, the KMP is perfect for this situation.
The implementation is quite simple: all we need is a queue which has a max length that equals to the length of pattern string, we use this queue as a “window”. When mismatch happens, the window slides to right, which means append new char to the rear, if the window reached its maximum length, it will pop first char automatically. We check the window’s contents only when it is full.
Window | ||||||||||
Target | … | a | b | a | b | a | b | a | b | … |
Pattern | a | b | a | b | b |
If you don’t want KMP thing, check all the contents every time when it slides, that’s the way Brute Force matching applied to an unlimited sequence. The KMP way is sliding a distance of mismatch index (we call it [i]) of pattern string minus next[i] (i – next[i]), and matching the pattern from the index of next[i] next time when the window is filled. Here is an example:
Index | 0 | 1 | 2 | 3 | 4 |
Pattern | a | b | a | b | b |
Next | 0 | 0 | 0 | 1 | 2 |
Window | ||||||||||||||
Target | a | b | a | b | a | b | a | b | b | |||||
Pattern | a | b | a | b | b | |||||||||
Next | 0 | 0 | 0 | 1 | 2 | |||||||||
Index | 0 | 1 | 2 | 3 | 4 |
Window | |||||||||||
Target | a | b | a | ← | ← | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||||
Next | 0 | 0 | 0 | 1 | 2 | ||||||
Index | 0 | 1 | 2 | 3 | 4 |
Window | |||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
Window | |||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
popped | Window | ||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
popped | Window | ||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
popped | Window | ||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
popped | Window | ||||||||
Target | a | b | a | b | a | b | a | b | b |
Pattern | a | b | a | b | b | ||||
Next | 0 | 0 | 0 | 1 | 2 | ||||
Index | 0 | 1 | 2 | 3 | 4 |
As the way how to match an unlimited sequence, I use the “feed data” mechanism. If the window slides a length longer than 2 chars, matching will not begin until the window is filled. The feed data function will return None
if the sequence doesn’t meet the pattern, or “i” pointer (index of the begin of matched subsequence) and window content if matches.
Here is my Python implementation: https://github.com/Beta-TNT/KMP_Unlimited/blob/main/kmp_unlimited.py