javascript

# 模糊匹配查询正则表达式探究 [译]

### 原文: Fuzzy Scoring Regex Mayhem

Autocompletion is never an entirely solved problem. Can anyone really say what on earth a user typing “uni” into a country input field actually intends to select? It could match any of these:

• Tanzania, [U][n][i]ted Republic of
• [U][n][i]ted Arab Emirates
• [U][n][i]ted Kingdom
• [U][n][i]ted States
• T[u][n][i]sia

Of course, it’s probably not the last one, but that right there is a human intuition that we often forget to instil into these UI interactions.

We can divine what the user probably intends most of the time but it’ll always be a game of heuristics. Most solutions shy away from this game, opting instead to match the query letter-for-letter in each potential value, and this is usually sufficient, but without any other logic not only will “la” match “Latvia” but also “Angola”. And usually “Ltvia” will match nothing whatsoever, even though it’s seemingly obvious what the user is trying to type.

If you try implementing a fuzzy matcher to solve this, the first revelation is that you can’t just boolean-match the query against the data like so many solutions do. You need to score each potential match. Hopefully, in the case of country selection, you end up with a sensible subset of countries that match the query to some reasonable degree. This scoring is necessary so that you know what you’re putting at the top of the list. When typing “U”, the user expects Ukraine or Uzbekistan sooner than Mauritius or Sudan, for example.

（译注, boolean匹配 https://github.com/bevacqua/fuzzysearch

Oddly, if you looked at the most common autocompletion widget out there (jQuery UI), it doesn’t appear to follow this intuition.

Even the most graceful solutions tend to avoid the muddiness of dealing with mistakes like “untied states” or “leichtenstein”. Sure, the likeliness of a person having to type the name of a country they aren’t intimately familiar with is probably quite low, but people still make mistakes.

I’ve been intrigued by this topic for quite a while and it’s why I originally made relevancy.js. It solves the problem quite well, I think, and it does so in a pretty transparent way, with scores applied for various qualities such as the index of the query within the target string (“king” scores higher than “dom” in “kingdom”, for example), but it’s still a quite a lot of code for such a tiny part of an overall user experience.

I have once again been playing with this problem (thanks to a certain tweet) and have so wanted to come up with something stupefyingly graceful.

It all starts with a scratch in back of your mind — the one that tells you that your time has come. The world requires you to use regular expressions.

Warning: I don’t sincerely recommend doing any of this. It’s just a bit of fun. It’s probably an inefficient, unreliable, obscure and ultimately foolish endeavour!

Warning： 别分心，不然效率极低的，bebebe～～～

## Let’s begin!

##我们入正题：

A static France might look like this:

/^France$/  A more lenient France might be less sensitive to its case: 不区分大小写： /^france$/i


We could then allow the characters to be optional too:

/^f?r?a?n?c?e?$/i  This would match “f” and “franc” and “FaE”, etc. 这会匹配 “f”， “franc” 和 “FaE”等等。 But… users make even more grievous mistakes sometimes, and our regular expression should be able to handle those. So let’s add a single character of leniency between each legitimate character, and at the beginning and end of the string: 但…… 用户有时犯更严重的错误，我们的正常表达应该能够处理这些错误。因此，让我们在每一个合理的字母之间加上一个宽容的字符在每个字符的前后。 /^.?f?.?r?.?a?.?n?.?c?.?e?.?$/i


But then this would allow contiguous mistakes like “fafafafa”. We only want to allow a single incorrect mistake after each successfully entered character. For this we can use groups to force each character to be matched and a lazy quantifier on the mistake character to ensure that legitimate characters get to successfully match.

So:

/f.?otherStuff/


Becomes:

/(?:f.??)?otherStuff/


In English: Try to match f followed by otherStuff. If impossible then try to match any character after f but before otherStuff. (This is why lazy quantifiers (e.g. ??) are so useful!)

The entire regex would become:

/^(?:.(?=f))?(?:f.??)?(?:r.??)?(?:a.??)?(?:n.??)?(?:c.??)?(?:e.??)?$/i  We should probably capture each individual match (f should be (f)) so that we can analyze the result and score it appropriately. 我们可能需要捕获每个单独的匹配(f 应该是(f)) ，以便我们可以分析结果并适当地评分。 var r = /^(?:(f).??)?(?:(r).??)?(?:(a).??)?(?:(n).??)?(?:(c).??)?(?:(e).??)?$/i

'f$R-aN_cEx'.match(r); // => ["f$R-aN_cEx", "f", "R", "a", "N", "c", "E"]


The regular expression, broken down:

/
^       # Start of string

(?:     # Non-captured group（非捕获）
(f)   # Match and capture 'f' （匹配并捕获"f"）
.??   # Followed lazily by any character （懒惰匹配跟随字符）
)?      # Entire group is optional （组可选）

(?:     # Non-captured group
(r)   # Match and capture 'f'
.??   # Followed lazily by any character
)?      # Entire group is optional

...     # Etc.

$# End of string /i  A quick note: lazy or lazily in the context of regular expressions simply means that that thing will be intentionally excluded from the first match attempt and will only be used if the subsequent regular expression is unsuccessful without it. 小便签：正则表达式中懒惰匹配是指…… One caveat with the above regex is that it doesn’t allow a mistake to be at the beginning of the string. We could fix this with a lookahead to the effect of “allow a mistake here as long as its followed by a non-mistake” but since “non-mistake” could effectively be any character in the legitimate string it’s easier to just make allowances for that initial mistake in each group. Additionally, we probably want to capture every single mistake, in addition to legitimate characters. Here’s our next iteration: 一个需要注意的是，上面的正则表达式，它不允许在字符串的开头有错误。我们可以修复这个前瞻错误通过 “允许犯错误，只要它跟在一个非错误后面”， 但由于 “非错误” 可以有效地在合法字符串中的任何字符，使得它更容易在每个组中出现所允许的初始错误。此外，除了合法的字符，我们还可能要捕捉每一个错误。下面是我们的实现： / ^ # Start of string (?: # Non-captured group (^.)? # Captured optional mistake at the beginning of the string # =============================================== (f) # Match and capture 'f' (.??) # Followed lazily by any character (captured) )? # Entire group is optional ... # Etc.$       # End of string
/i


The check (^.)? has to be specified in each group, to account for mistakes that don’t involve “f”, like “krance” or “ttance”, etc.

Since we’re aiming to genericize this entire mess, we should create a generator that assembles the regular expression given any piece of text:

function makeFuzzyRegex(string) {

if (!string) { return /^$/; } // Escape any potential special characters: var cleansed = string.replace(/\W/g, '\\$&');

return RegExp(
'^' +
cleansed.replace(
// Find every escaped and non-escaped char:
/(\\?.)/g,
// Replace with fuzzy character matcher:
'(?:(^.)?($1)(.??))?' ) + '$',
'i'
);
}

makeFuzzyRegex('omg');
// => /^(?:(^.)?(o)(.??))?(?:(^.)?(m)(.??))?(?:(^.)?(g)(.??))?$/i  This regex matched against ‘_o-m*g!’ produces: [ // Full match: "_o-m*g!", // Captures: "_", // Mistake "o", // Legit "-", // Mistake undefined, // Void mistake "m", // Legit "*", // Mistake undefined, // Void mistake "g", // Legit "!" // Mistake ]  The captures are in groups of three, with every second capture being the legitimate character (case-insensitive), and with every first and third potentially being mistakes. 捕获是在三组，每组捕获都是合法性（不区分大小写），并与每组第一和第三个字符都可能潜在错误。 We can then loop through these captures and apply weights as we see fit. 我们可以循环这些捕捉和应用权重作为我们合适的分数。 var fullMatch = makeFuzzyRegex('omg').exec('_o-m*g!'); var captures = fullMatch.slice(1); // Get captures specifically var score = 0; for (var i = 0, l = captures.length; i < l; i += 3) { if (captures[i]) score -= 1; if (captures[i+1]) score += 10; if (captures[i+2]) score -= 1; } score; // => 26  That scoring is quite arbitrary, but we’ve at least prescribed our wish to score successes more than we punish mistakes (10 vs 1). 这计算是十分随意的，但是我们至少指定我们的愿望比我们惩罚的错误更成功(10 vs 1)。 We can start to play with the heuristics of this if we wrap it all up: 我们把它封装起来，让Ta溜溜： function createFuzzyScorer(text) { var matcher = makeFuzzyRegex(text); return function(query) { var match = matcher.exec(query); if (!match) return 0; var captures = match.slice(1); var score = 0; for (var i = 0, l = captures.length; i < l; i += 3) { if (captures[i]) score -= 1; if (captures[i+1]) score += 10; if (captures[i+2]) score -= 1; } return score; }; function makeFuzzyRegex(string) { if (!string) { return /^$/; }

// Escape any potential special characters:
var cleansed = string.replace(/\W/g, '\\$&'); return RegExp( '^' + cleansed.replace( // Find every escaped and non-escaped char: /(\\?.)/g, // Replace with fuzzy character matcher: '(?:(^.)?($1)(.??))?'
) +
'\$',
'i'
);
}
}


Our first attempt isn’t too bad:

var score = createFuzzyScorer('omg');

score('omg');     // => 30
score('xOmg');    // => 29
score('.o.m.g.'); // => 26
score('om');      // => 20
score('og');      // => 20
score('o');       // => 10
score('nope');    // => 0


These seem like sensible enough scores, generally, but we’re more interested in autocompletion, and so there’s an obvious predictive element there. If a user types ‘o’ then that should probably score higher than ‘g’ if we’re testing against ‘omg’, but with the above mechanism they both receive a standard 10:

var score = createFuzzyScorer('omg');

score('o'); // => 10
score('g'); // => 10


We can fix this by applying a higher weight to matches that appear earlier in the string:

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
if (captures[i]) score -= 0.1;
if (captures[i+1]) score += (l - i) / l; // the magic
if (captures[i+2]) score -= 0.1;
}


Now the score given for any singular legitimate match will decrease as the index (i) increases. Here are the results:

var score = createFuzzyScorer('omg');

score('omg');     // => 1.99
score('xOmg');    // => 1.90
score('om');      // => 1.66
score('.o.m.g.'); // => 1.59
score('og');      // => 1.33
score('o');       // => 1.00
score('nope');    // => 0.00


This is getting closer to our intuition. The next step would be to try to create a real autocompletion widget. I’ve done it so I know that we’ll want to make one more change. The problem with our scoring right now is that it’ll award legitimate characters relative to the length of the string. But when comparing scores across multiple subject strings, this approach seems broken.

createFuzzyScorer('RuneScript')('Ru'); // 1.9
createFuzzyScorer('Ruby')('Ru');       // 1.7


These should both score equally, as “Ru” is just as likely to become “Ruby” as it is to become “RuneScript”. To achieve this we should only take into account the index, and make the weight of any scoring decision inversely proportional to that index, in this case via an exponential taper (pow(index, -2)).

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
var relevancyOfCharacter = Math.pow(i + 1, -2);
if (captures[i]) score -= relevancyOfCharacter * 0.1;
if (captures[i+1]) score += relevancyOfCharacter * 1;
if (captures[i+2]) score -= relevancyOfCharacter * 0.1;
}


(Final version of createFuzzyScorer available as a gist.)

See this demo using programming languages as the dataset. Try intentionally misspelling something (jawascript), or missing out characters (jaascit), or just going a little crazy (jahskt). It works beautifully.

To achieve speedy sorting, a fuzzy scorer is created for every single value before the user types anything:

var data = PROGRAMMING_LANGUAGES.map(function(lang, i) {
return {
actualValue: lang,
score: createFuzzyScorer(lang),
i: i,
toString: function() { return lang; }
};
});


This means we can iterate through data on every relevant input event, and call the score()method with the current query. We can then bundle this into a filter->sort->slice flow to get our list of sensible suggestions:

var sorted = data.filter(function(item) {

// Get rid of any very unlikely matches (and cache the score!)
return (item._cachedScore = item.score(query)) >= .5;

}).sort(function(a, b) {

var as = a._cachedScore;
var bs = b._cachedScore;

// Sort by score, and if score is equal, then by original index:
// (We would typically return 0 in that case but some engines don't stable-sort)
return as > bs ? -1 : as == bs && a.i < b.i ? -1 : 1;

}).slice(0, 10); // We only need the top 10...


And.. we’re done. It’s never really finished though: you’ll find endless tweaks that can be made to the scorer to make it more believably resemble human-like intuition.

For those wanting to test the resulting country autocompletion interaction: See the demo.

I guess, despite my initial warning, I wouldn’t actually mind using this in production, as long as there were a decent number of unit tests. I’d probably also assemble the regular expressions on the server and serve them up as literals. It’s also worth mentioning that almost everything in this post has been exploring the fuzzy-matching of very short strings in small datasets. Even in the case of the country demo, to get more applicable results, I broke up long names into the component parts and then scored against each. E.g.

// E.g. Macedonia, the Former Yugoslav Republic of:
var scorers = [
"Macedonia, the Former Yugoslav Republic of",
"Macedonia",
"the",
"former",
"yugoslav",
"republic",
"of"
].map(createFuzzyScorer);
// Etc.


And this would be terribly inefficient on a larger scale, so with any dataset longer than a list of countries you’re probably best to explore Trie-based approaches to autocompletion.

And with that, I’ll shut-up and wish you merry regex’ing!

https://github.com/bevacqua/fuzzysearch

https://github.com/gf3/Levenshtein

https://github.com/bevacqua/fuzzysearch所用的方法，你只能在true or false 之间选择，而padolsey所用的方法字样引入了权重概念更加智能，但是在大规模数据下是非常低效的。另外在作者的博客评论中也提到的 https://github.com/gf3/Levenshtein， 可以自行wiki一下。