Skip to content

Commit 16a4935

Browse files
authored
Merge pull request #87 from nol13/autojunk2
Autojunk2
2 parents 136b02d + f86b8ec commit 16a4935

File tree

6 files changed

+72
-14
lines changed

6 files changed

+72
-14
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -454,7 +454,7 @@ Pass options to fuzz.unique_tokens as the second argument if you're using wildca
454454
### Alternate Ratio Calculations
455455
456456
457-
If you want to use difflib's ratio function for all ratio calculations, which differs slightly from the default python-Levenshtein style behavior, you can specify options.ratio_alg = "difflib". The difflib calculation is a bit different in that it's based on matching characters rather than true minimum edit distance, but the results are usually pretty similar. Difflib uses the formula 2.0*M / T where M is the number of matches, and T is the total number of elements in both sequences. This mirrors the behavior of fuzzywuzzy when not using python-Levenshtein. Not all features (wildcards, collation) supported when using difflib ratio.
457+
If you want to use difflib's ratio function for all ratio calculations, which differs slightly from the default python-Levenshtein style behavior, you can specify options.ratio_alg = "difflib". The difflib calculation is a bit different in that it's based on matching characters rather than true minimum edit distance, but the results are usually pretty similar. Difflib uses the formula 2.0*M / T where M is the number of matches, and T is the total number of elements in both sequences. This mirrors the behavior of fuzzywuzzy when not using python-Levenshtein. When using difflib, you can also set `options.autojunk` to `false` to disable the automatic junk heuristic that treats popular elements as junk. Not all features (wildcards, collation) supported when using difflib ratio.
458458
459459
Except when using difflib, the ratios are calculated as ((str1.length + str2.length) - distance) / (str1.length + str2.length), where distance is calculated with a substitution cost of 2. This follows the behavior of python-Levenshtein, however the fuzz.distance function still uses a cost of 1 by default for all operations if just calculating distance and not a ratio.
460460

dist/esm/fuzzball.esm.min.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

dist/fuzzball.umd.min.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

fuzzball.d.ts

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,24 @@ export interface FuzzballBaseOptions {
2929
normalize?: boolean;
3030
}
3131

32+
export interface FuzzballRatioOptions extends FuzzballBaseOptions {
33+
/**
34+
* A string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
35+
*/
36+
ratio_alg?: 'levenshtein' | 'difflib';
37+
/**
38+
* Autojunk argument passed to difflib if you're using the ratio_alg option, default true
39+
*/
40+
autojunk?: boolean;
41+
}
42+
43+
export interface FuzzballPartialRatioOptions extends FuzzballBaseOptions {
44+
/**
45+
* Autojunk argument passed to difflib, default true
46+
*/
47+
autojunk?: boolean;
48+
}
49+
3250
export interface FuzzballTokenSetOptions extends FuzzballBaseOptions {
3351
/**
3452
* Include ratio as part of token set test suite
@@ -64,7 +82,15 @@ interface FuzzballExtractBaseOptions extends FuzzballBaseOptions {
6482
/**
6583
* Sort tokens by similarity before combining with token set scorers
6684
*/
67-
sortBySimilarity?: boolean
85+
sortBySimilarity?: boolean;
86+
/**
87+
* A string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
88+
*/
89+
ratio_alg?: 'levenshtein' | 'difflib';
90+
/**
91+
* Autojunk argument passed to difflib if you're using the ratio_alg option, default true
92+
*/
93+
autojunk?: boolean;
6894
}
6995

7096
interface AbortController {
@@ -170,14 +196,14 @@ export interface FuzzballDedupeObjOptionsWithMap extends FuzzballExtractObjectOp
170196
}
171197

172198
export function distance(str1: string, str2: string, opts?: FuzzballBaseOptions): number;
173-
export function ratio(str1: string, str2: string, opts?: FuzzballBaseOptions): number;
174-
export function partial_ratio(str1: string, str2: string, opts?: FuzzballBaseOptions): number;
175-
export function token_set_ratio(str1: string, str2: string, opts?: FuzzballTokenSetOptions): number;
176-
export function token_sort_ratio(str1: string, str2: string, opts?: FuzzballBaseOptions): number;
177-
export function token_similarity_sort_ratio(str1: string, str2: string, opts?: FuzzballTokenSetOptions): number;
178-
export function partial_token_set_ratio(str1: string, str2: string, opts?: FuzzballTokenSetOptions): number;
179-
export function partial_token_sort_ratio(str1: string, str2: string, opts?: FuzzballBaseOptions): number;
180-
export function partial_token_similarity_sort_ratio(str1: string, str2: string, opts?: FuzzballTokenSetOptions): number;
199+
export function ratio(str1: string, str2: string, opts?: FuzzballRatioOptions): number;
200+
export function partial_ratio(str1: string, str2: string, opts?: FuzzballPartialRatioOptions): number;
201+
export function token_set_ratio(str1: string, str2: string, opts?: FuzzballRatioOptions & FuzzballTokenSetOptions): number;
202+
export function token_sort_ratio(str1: string, str2: string, opts?: FuzzballRatioOptions): number;
203+
export function token_similarity_sort_ratio(str1: string, str2: string, opts?: FuzzballRatioOptions & FuzzballTokenSetOptions): number;
204+
export function partial_token_set_ratio(str1: string, str2: string, opts?: FuzzballPartialRatioOptions & FuzzballTokenSetOptions): number;
205+
export function partial_token_sort_ratio(str1: string, str2: string, opts?: FuzzballPartialRatioOptions): number;
206+
export function partial_token_similarity_sort_ratio(str1: string, str2: string, opts?: FuzzballPartialRatioOptions & FuzzballTokenSetOptions): number;
181207
export function WRatio(str1: string, str2: string, opts?: FuzzballTokenSetOptions): number;
182208
export function full_process(str: string, options?: FuzzballExtractOptions | boolean): string;
183209
export function process_and_sort(str: string): string;

fuzzball.js

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@
8181
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
8282
* @param {number} [options_p.astral] - Use astral aware calculation
8383
* @param {string} [options_p.normalize] - Normalize unicode representations
84+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
85+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
8486
* @returns {number} - the levenshtein ratio (0-100).
8587
*/
8688
var options = clone_and_set_option_defaults(options_p);
@@ -108,6 +110,7 @@
108110
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
109111
* @param {number} [options_p.astral] - Use astral aware calculation
110112
* @param {string} [options_p.normalize] - Normalize unicode representations
113+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib, default true
111114
* @returns {number} - the levenshtein ratio (0-100).
112115
*/
113116
var options = clone_and_set_option_defaults(options_p);
@@ -136,6 +139,8 @@
136139
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
137140
* @param {number} [options_p.astral] - Use astral aware calculation
138141
* @param {string} [options_p.normalize] - Normalize unicode representations
142+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
143+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
139144
* @returns {number} - the levenshtein ratio (0-100).
140145
*/
141146
var options = clone_and_set_option_defaults(options_p);
@@ -164,6 +169,7 @@
164169
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
165170
* @param {number} [options_p.astral] - Use astral aware calculation
166171
* @param {string} [options_p.normalize] - Normalize unicode representations
172+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib, default true
167173
* @returns {number} - the levenshtein ratio (0-100).
168174
*/
169175
var options = clone_and_set_option_defaults(options_p);
@@ -191,6 +197,8 @@
191197
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
192198
* @param {number} [options_p.astral] - Use astral aware calculation
193199
* @param {string} [options_p.normalize] - Normalize unicode representations
200+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
201+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
194202
* @returns {number} - the levenshtein ratio (0-100).
195203
*/
196204
var options = clone_and_set_option_defaults(options_p);
@@ -221,6 +229,7 @@
221229
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
222230
* @param {number} [options_p.astral] - Use astral aware calculation
223231
* @param {string} [options_p.normalize] - Normalize unicode representations
232+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib, default true
224233
* @returns {number} - the levenshtein ratio (0-100).
225234
*/
226235
var options = clone_and_set_option_defaults(options_p);
@@ -252,6 +261,8 @@
252261
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
253262
* @param {number} [options_p.astral] - Use astral aware calculation
254263
* @param {string} [options_p.normalize] - Normalize unicode representations
264+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
265+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
255266
* @returns {number} - the levenshtein ratio (0-100).
256267
*/
257268
var options = clone_and_set_option_defaults(options_p);
@@ -278,6 +289,7 @@
278289
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
279290
* @param {number} [options_p.astral] - Use astral aware calculation
280291
* @param {string} [options_p.normalize] - Normalize unicode representations
292+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib, default true
281293
* @returns {number} - the levenshtein ratio (0-100).
282294
*/
283295
var options = clone_and_set_option_defaults(options_p);
@@ -364,6 +376,8 @@
364376
* @param {boolean} [options_p.sortBySimilarity] - sort tokens by similarity to each other before combining instead of alphabetically
365377
* @param {string} [options_p.wildcards] - characters that will be used as wildcards if provided
366378
* @param {boolean} [options_p.returnObjects] - return array of object instead of array of tuples; default false
379+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
380+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
367381
* @returns {Array[] | Object[]} - array of choice results with their computed ratios (0-100).
368382
*/
369383
var options = clone_and_set_option_defaults(options_p);
@@ -510,6 +524,8 @@
510524
* @param {Object} [options_p.abortController] - track abortion
511525
* @param {Object} [options_p.cancelToken] - track cancellation
512526
* @param {number} [options_p.asyncLoopOffset] - number of rows to run in between every async loop iteration, default 256
527+
* @param {string} [options_p.ratio_alg] - a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein"
528+
* @param {boolean} [options_p.autojunk] - autojunk argument passed to difflib if you're using the ratio_alg option, default true
513529
* @param {function} callback - node style callback (err, arrayOfResults)
514530
*/
515531
var options = clone_and_set_option_defaults(options_p);
@@ -894,7 +910,7 @@
894910
if (!validate(str1)) return 0;
895911
if (!validate(str2)) return 0;
896912
if (options.ratio_alg && options.ratio_alg === "difflib") {
897-
var m = new SequenceMatcher(null, str1, str2);
913+
var m = new SequenceMatcher(null, str1, str2, options.autojunk);
898914
var r = m.ratio();
899915
return Math.round(100 * r);
900916
}
@@ -929,7 +945,7 @@
929945
var shorter = str2
930946
var longer = str1
931947
}
932-
var m = new SequenceMatcher(null, shorter, longer);
948+
var m = new SequenceMatcher(null, shorter, longer, options.autojunk);
933949
var blocks = m.getMatchingBlocks();
934950
var scores = [];
935951
for (var b = 0; b < blocks.length; b++) {

jsdocs/fuzzball.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,8 @@ Calculate levenshtein ratio of the two strings.
5959
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
6060
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
6161
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
62+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
63+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
6264

6365
<a name="module_fuzzball..partial_ratio"></a>
6466

@@ -80,6 +82,7 @@ Calculate partial levenshtein ratio of the two strings.
8082
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
8183
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
8284
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
85+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib, default true |
8386

8487
<a name="module_fuzzball..token_set_ratio"></a>
8588

@@ -102,6 +105,8 @@ Calculate token set ratio of the two strings.
102105
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
103106
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
104107
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
108+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
109+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
105110

106111
<a name="module_fuzzball..partial_token_set_ratio"></a>
107112

@@ -124,6 +129,7 @@ Calculate partial token ratio of the two strings.
124129
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
125130
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
126131
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
132+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib, default true |
127133

128134
<a name="module_fuzzball..token_sort_ratio"></a>
129135

@@ -144,6 +150,8 @@ Calculate token sort ratio of the two strings.
144150
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
145151
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
146152
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
153+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
154+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
147155

148156
<a name="module_fuzzball..partial_token_sort_ratio"></a>
149157

@@ -164,6 +172,7 @@ Calculate partial token sort ratio of the two strings.
164172
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
165173
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
166174
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
175+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib, default true |
167176

168177
<a name="module_fuzzball..token_similarity_sort_ratio"></a>
169178

@@ -184,6 +193,8 @@ Calculate token sort ratio of the two strings.
184193
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
185194
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
186195
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
196+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
197+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
187198

188199
<a name="module_fuzzball..partial_token_similarity_sort_ratio"></a>
189200

@@ -204,6 +215,7 @@ Calculate token sort ratio of the two strings.
204215
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
205216
| [options_p.astral] | <code>number</code> | Use astral aware calculation |
206217
| [options_p.normalize] | <code>string</code> | Normalize unicode representations |
218+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib, default true |
207219

208220
<a name="module_fuzzball..WRatio"></a>
209221

@@ -253,6 +265,8 @@ Return the top scoring items from an array (or assoc array) of choices
253265
| [options_p.sortBySimilarity] | <code>boolean</code> | sort tokens by similarity to each other before combining instead of alphabetically |
254266
| [options_p.wildcards] | <code>string</code> | characters that will be used as wildcards if provided |
255267
| [options_p.returnObjects] | <code>boolean</code> | return array of object instead of array of tuples; default false |
268+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
269+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
256270

257271
<a name="module_fuzzball..extractAsync"></a>
258272

@@ -283,5 +297,7 @@ Return the top scoring items from an array (or assoc array) of choices
283297
| [options_p.abortController] | <code>Object</code> | track abortion |
284298
| [options_p.cancelToken] | <code>Object</code> | track cancellation |
285299
| [options_p.asyncLoopOffset] | <code>number</code> | number of rows to run in between every async loop iteration, default 256 |
300+
| [options_p.ratio_alg] | <code>string</code> | a string representing the ratio algorithm to use, either "levenshtein" or "difflib", default "levenshtein" |
301+
| [options_p.autojunk] | <code>boolean</code> | autojunk argument passed to difflib if you're using the ratio_alg option, default true |
286302
| callback | <code>function</code> | node style callback (err, arrayOfResults) |
287303

0 commit comments

Comments
 (0)