FuzzyWuzzy Python Library

What is FuzzyWuzzy?

The FuzzyWuzzy library is used for fuzzy string matching. It helps in calculating the similarity between two strings and is commonly used when working with datasets that may have inconsistencies like typos, alternate spellings, or word order variations.

Key Features

  1. String Comparison: Measures the similarity between two strings using a score.
  2. Partial Matching: Matches substrings within larger strings.
  3. Token-Based Matching: Reshuffles the words to facilitate comparison.
  4. Score Representation: Provides similarity scores as percentages (0-100).

Installation

To install FuzzyWuzzy:

pip install fuzzywuzzy

To improve performance, install it with the python-Levenshtein package:

pip install fuzzywuzzy[python-Levenshtein]

The python-Levenshtein package significantly improves speed by implementing the Levenshtein distance algorithm in a compiled format.

Core Concepts

  • Levenshtein Distance: The FuzzyWuzzy Algorithm is based on this. Levenshtein distance calculates the smallest number of edits, performed one character at a time, which would make X equal to Y.
  • Similarity Scores: They are a percentage, thus 0-100; higher value shows greater similarity.

Key Functions

1. fuzz.ratio

This function compares two strings and returns a similarity score.

from fuzzywuzzy import fuzz

string1 = "Apple Inc."
string2 = "Apple Incorporated"

score = fuzz.ratio(string1, string2)
print(score)

Output:

86

2. fuzz.partial_ratio

This function calculates the similarity between substrings, making it ideal for matching portions of strings.

from fuzzywuzzy import fuzz

string1 = "Apple Inc."
string2 = "Inc."

score = fuzz.partial_ratio(string1, string2)
print(score)

Output:

100

3. fuzz.token_sort_ratio

It compares strings after sorting their words, ignoring differences in word order.

from fuzzywuzzy import fuzz

string1 = "Incorporated Apple"
string2 = "Apple Incorporated"

score = fuzz.token_sort_ratio(string1, string2)
print(score)

Output:

100

4. fuzz.token_set_ratio

This function normalizes strings by ignoring duplicate words and irrelevant tokens.

from fuzzywuzzy import fuzz

string1 = "Apple Inc."
string2 = "Apple Inc. Inc."

score = fuzz.token_set_ratio(string1, string2)
print(score)

Output:

100

Advanced Usage: Matching Against a Collection

FuzzyWuzzy provides functionality to match a string against a list of strings using the process module.

process.extract

This function returns a list of matches and their similarity scores.

from fuzzywuzzy import process

query = "Apple"
choices = ["Apple Inc.", "Apple Incorporated", "Microsoft", "Google"]

results = process.extract(query, choices)
print(results)

Output:

[('Apple Inc.', 100), ('Apple Incorporated', 86), ('Microsoft', 0), ('Google', 0)]

process.extractOne

Finds the single best match for a string.

from fuzzywuzzy import process

query = "Apple"
choices = ["Apple Inc.", "Apple Incorporated", "Microsoft", "Google"]

best_match = process.extractOne(query, choices)
print(best_match)

Output:

('Apple Inc.', 100)

Performance Optimization

  1. Install python-Levenshtein to make string comparisons faster.
  2. Use token-based functions (token_sort_ratio or token_set_ratio) for complex text matching.
  3. Use partial_ratio for substring searches.

Applications

  1. Data Cleaning: Find and correct spelling errors or inconsistent entry information within datasets.
  2. Search Engines: Implement approximate string matching in search functionalities.
  3. Deduplication: the removal of duplicate records in a database.
  4. Record Matching: Pair similar records between datasets.

Limitations

  1. Speed: Without python-Levenshtein, it could be slow on big datasets.
  2. Not Context-Aware: It compares characters, not the meaning or semantics of the strings.
  3. False Positives: high similarity scores are not always good matches.