String Distance Metrics for Address Comparison: Levenshtein, Damerau, Jaro, and Jaro-Winkler
Comparing addresses is not a simple equality check
When you compare two address strings to decide if they refer to the same location, an exact match is rarely enough. Real-world addresses come with typos, abbreviations, inconsistent casing, and missing components. "15 Baker St" and "15 Baker Street" are the same address, but "15 Baker St" == "15 Baker Street" evaluates to False.
String distance metrics can be used to solve this problem. They assign a number to two strings that quantifies how different (or similar) they are. This number lets you set a threshold: "if the distance is less than 2, or the similarity is above 0.9, treat them as the same address."
In this guide, we cover the four most widely used metrics for address comparison:
- Levenshtein distance - the classic edit distance
- Damerau-Levenshtein distance - edit distance with transpositions
- Jaro similarity - character matching with position awareness
- Jaro-Winkler similarity - Jaro with a bonus for matching prefixes
For each one, we explain how it works, walk through a concrete address example, and discuss when it is (and is not) a good fit. At the end, a summary table and practical recommendations.
Table of contents:
- Comparing addresses is not a simple equality check
- A quick note before diving in
- 1. Levenshtein distance
- 2. Damerau-Levenshtein distance
- 3. Jaro similarity
- 4. Jaro-Winkler similarity
- Summary table
- Normalization: the step that matters more than metric choice
- Putting it all together: a complete address comparison
- Where to go from here
A quick note before diving in
There are two flavors of metric:
- Distance metrics (Levenshtein, Damerau-Levenshtein): return 0 for identical strings, and a higher integer for more different strings. Think of them as counting "how many operations to fix this?"
- Similarity metrics (Jaro, Jaro-Winkler): return 1.0 for identical strings, and a lower value (down to 0.0) for more different strings. Think of them as "what fraction of characters match?"
Neither type is universally better. Which one to use depends on your data and your use case.
1. Levenshtein distance
How it works
The Levenshtein distance between two strings is the minimum number of single-character edits needed to transform one into the other. The allowed operations are:
- Insertion: add a character (e.g., "Streeet" - one character - to "Street")
- Deletion: remove a character (e.g., "Streeet" + one deletion - to "Street")
- Substitution: replace one character with another (e.g., "Straat" - replace 'a' with 'e' - to "Street")
A distance of 0 means the strings are identical. A distance of 1 means a single operation is enough.
Address example
Compare "15 baker srteet" (a common keyboard transposition typo) to "15 baker street":
15 baker srteet 15 baker street
To fix "srteet" into "street", Levenshtein needs 2 substitutions: replace 'r' with 't' at position 9, and 't' with 'r' at position 10.
Distance = 2.
This is technically correct, but it feels like a lot for what is clearly a single typing mistake. We will fix this in the next section.
In Python
import jellyfish jellyfish.levenshtein_distance("15 baker srteet", "15 baker street") # → 2
Install with pip install jellyfish.
Pros and cons
| Pros | Cons |
|---|---|
| Simple and well understood | Counts a transposition (swap of two adjacent chars) as 2 operations instead of 1 |
| Handles insertions and deletions well (great for abbreviations like "St" vs "Street") | Raw distance is hard to compare across string pairs of different lengths |
| Available in virtually every language and library | Does not give extra weight to matching prefixes |
| Fast on short strings |
When to use it for addresses
Levenshtein is a solid default, especially when the main source of variation is abbreviations or missing characters. Set a relative threshold to make comparisons fair across different string lengths:
def normalized_levenshtein(a, b): dist = jellyfish.levenshtein_distance(a, b) return dist / max(len(a), len(b)) # Treat as same address if < 15% different normalized_levenshtein("15 baker street", "15 baker st") # → 0.267 normalized_levenshtein("15 baker street", "15 baker sreet") # → 0.067
2. Damerau-Levenshtein distance
How it works
Damerau-Levenshtein extends Levenshtein by adding one more operation:
- Transposition: swap two adjacent characters (e.g., "srteet" to "street" - swap 'r' and 't')
This single addition makes a big practical difference. Studies of typing errors show that transpositions (adjacent character swaps) are one of the most frequent mistakes people make when typing by hand. Recognizing them as a single error rather than two substitutions gives a much more realistic picture of "how many mistakes did the user make?"
Address example
Same pair as before: "15 baker srteet" vs "15 baker street".
15 baker srteet 15 baker street ^^ ← 'rt' transposed to 'tr'
Damerau-Levenshtein distance = 1 (one transposition).
Compare this to the Levenshtein distance of 2 for the same pair. One error, one operation - much more intuitive.
Another example: "rue de al paix" vs "rue de la paix" (the words "al" and "la" are swapped - a common cut-and-paste error):
rue de al paix
rue de la paix
^^ ← 'a' and 'l' transposed
Levenshtein = 2, Damerau-Levenshtein = 1.
In Python
import jellyfish jellyfish.damerau_levenshtein_distance("15 baker srteet", "15 baker street") # → 1 jellyfish.damerau_levenshtein_distance("rue de al paix", "rue de la paix") # → 1
Pros and cons
| Pros | Cons |
|---|---|
| More realistic for human typing errors | Slightly more complex to implement from scratch |
| Transpositions (very frequent mistakes) cost 1, not 2 | Same issue with raw distance across different string lengths |
| Strict superset of Levenshtein - always <= Levenshtein distance |
When to use it for addresses
Prefer Damerau-Levenshtein over plain Levenshtein whenever your input comes from users typing addresses manually (web forms, search boxes, mobile apps). It gives a better answer to the question "how careless was this input?" and lets you set tighter thresholds while still tolerating common slip-of-the-finger errors.
3. Jaro similarity
How it works
Jaro takes a completely different approach. Instead of counting edit operations, it measures the proportion of characters that match between two strings, with a penalty for characters that are far apart or out of order.
Two characters are considered matching if:
1. They are the same character, and
2. Their positions are not too far apart. The maximum allowed distance between two matching characters is: floor(max(len(s1), len(s2)) / 2) - 1
Once you count the matching characters m and the number of matching characters that are out of order t (transpositions), the Jaro similarity is:
Jaro(s1, s2) = (1/3) * (m / |s1| + m / |s2| + (m - t) / m)
The result is between 0.0 (nothing matches) and 1.0 (identical).
Address example
Compare "church rd" (abbreviated) to "church road" (full form):
- s1 = "church rd" (9 characters), s2 = "church road" (11 characters)
- Matching window = floor(11 / 2) - 1 = 4
Every character in "church rd" can be matched to a character in "church road" within the window: c, h, u, r, c, h, (space), r, d all appear in the same relative order. So m = 9, t = 0.
Jaro = (1/3) * (9/9 + 9/11 + 9/9) = (1/3) * (1.000 + 0.818 + 1.000) ≈ 0.939
A similarity of 0.939 for what is clearly the same street, just abbreviated. That is the right answer.
In Python
import jellyfish jellyfish.jaro_similarity("church rd", "church road") # → ~0.939 jellyfish.jaro_similarity("main street", "mane street") # typo: i→e # → ~0.970
Pros and cons
| Pros | Cons |
|---|---|
| Normalized score between 0 and 1 - easy to threshold | Formula is less intuitive than edit distance |
| Handles abbreviations and length differences naturally | Can give surprisingly high scores to long strings with many common characters |
| Well suited for comparing individual address components | Does not give extra weight to matching prefixes - we fix this with Jaro-Winkler |
When to use it for addresses
Jaro is a good fit for comparing individual address fields (just the street name, just the city) rather than full concatenated address strings. It is also useful for detecting near-duplicate records in a database.
4. Jaro-Winkler similarity
How it works
Jaro-Winkler builds on Jaro by adding a prefix bonus: if two strings share the same first characters, their similarity is boosted. The motivation is that strings that start the same way are more likely to refer to the same thing than strings that diverge immediately.
The formula is:
JaroWinkler(s1, s2) = Jaro(s1, s2) + l * p * (1 - Jaro(s1, s2))
Where:
- l = length of the common prefix, up to a maximum of 4 characters
- p = scaling factor, conventionally 0.1
The effect is a modest upward adjustment when the prefix matches. With p = 0.1 and l = 4, the maximum bonus is +4 * 0.1 * (1 - Jaro), which boosts a Jaro of 0.90 by at most 0.04. Small but meaningful for ranking candidates.
Address example
Take "church rd" vs "church road" again (Jaro = 0.939, common prefix "chur", l = 4):
JaroWinkler = 0.939 + 4 * 0.1 * (1 - 0.939)
= 0.939 + 0.4 * 0.061
≈ 0.963
Now consider two streets that differ at the very beginning:
- "avenue du general leclerc" vs "boulevard du general leclerc"
These share no prefix at all (l = 0), so Jaro-Winkler equals Jaro exactly. No bonus is applied, and the score is lower than it would be for two strings with the same beginning. This is exactly the intended behavior: a difference at the start is a stronger signal of a mismatch than a difference in the middle.
Now consider a house number mismatch:
- "12 rue de la paix" vs "21 rue de la paix": the prefix is "1" vs "2", so l = 0. The score penalizes the mismatch at the start, which is what we want - these are two completely different addresses.
In Python
import jellyfish jellyfish.jaro_winkler_similarity("church rd", "church road") # → ~0.963 jellyfish.jaro_winkler_similarity("12 rue de la paix", "21 rue de la paix") # → ~0.980 (lower score due to no prefix match) jellyfish.jaro_winkler_similarity("12 rue de la paix", "12 rue de la pai") # → ~0.988 (strong prefix, only one char missing at end)
Pros and cons
| Pros | Cons |
|---|---|
| Prefix bonus aligns with how we read addresses (number first, then street name) | Prefix bonus requires consistent formatting - mixed case will kill the bonus |
| Often more accurate than plain Jaro for addresses and proper names |
p = 0.1 is conventional; tuning it for your data takes experimentation |
| Normalized 0-1 score, easy to threshold | Like Jaro, harder to explain to non-technical stakeholders than edit distance |
When to use it for addresses
Jaro-Winkler is the best general-purpose choice for full address comparison when strings are properly normalized (lowercase, consistent formatting). The prefix bonus is particularly useful when the house number is at the start of the string - it correctly penalizes address strings that start with different numbers, since those represent different physical locations regardless of the rest.
Summary table
| Metric | Type | Score range | Transpositions | Prefix bonus | Best for |
|---|---|---|---|---|---|
| Levenshtein | Distance | 0 to +inf | Costs 2 | No | Abbreviations, insertions, deletions |
| Damerau-Levenshtein | Distance | 0 to +inf | Costs 1 | No | User-typed input, keyboard typos |
| Jaro | Similarity | 0 to 1 | Handled | No | Individual address fields |
| Jaro-Winkler | Similarity | 0 to 1 | Handled | Yes (up to 4 chars) | Full normalized address strings |
Quick decision guide:
- User types addresses into a form - use Damerau-Levenshtein. Transpositions are very common in manual input.
- You are deduplicating a database of street names - use Jaro or Jaro-Winkler on the street name component alone.
- You want a single score for a full address string - use Jaro-Winkler on normalized strings.
- Your main problem is abbreviations ("Blvd" vs "Boulevard") - normalize abbreviations first (see below), then any metric will do.
Normalization: the step that matters more than metric choice
No string metric can compensate for what normalization ignores. Before comparing two addresses, apply at least these basics:
- Lowercase everything: "Baker Street" and "baker street" should score 1.0, not 0.9.
- Expand abbreviations consistently: "St" -> "street", "Blvd" -> "boulevard", "Ave" -> "avenue", "Rd" -> "road". Or shorten everything to a consistent short form.
- Remove punctuation: "St. James's" and "St Jamess" will score very differently if you keep periods and apostrophes.
- Normalize whitespace: collapse multiple spaces, strip leading/trailing whitespace.
With these steps, "15 Baker St." and "15 Baker Street" become "15 baker street" and "15 baker street" - identical strings that score perfectly on any metric.
import re def normalize_address(addr): addr = addr.lower() addr = re.sub(r"[.,'\-]", "", addr) addr = re.sub(r"\bst\b", "street", addr) addr = re.sub(r"\bblvd\b", "boulevard", addr) addr = re.sub(r"\bave\b", "avenue", addr) addr = re.sub(r"\brd\b", "road", addr) addr = re.sub(r"\s+", " ", addr).strip() return addr normalize_address("15 Baker St.") # "15 baker street" normalize_address("15 Baker Street") # "15 baker street"
After normalization, both strings are identical and any metric returns a perfect score.
Putting it all together: a complete address comparison
Here is a short example that combines normalization with all four metrics:
import jellyfish import re def normalize_address(addr): addr = addr.lower() addr = re.sub(r"[.,'\-]", "", addr) addr = re.sub(r"\bst\b", "street", addr) addr = re.sub(r"\bblvd\b", "boulevard", addr) addr = re.sub(r"\bave\b", "avenue", addr) addr = re.sub(r"\brd\b", "road", addr) addr = re.sub(r"\s+", " ", addr).strip() return addr pairs = [ ("15 Baker St.", "15 Baker Street"), ("15 Baker Srteet", "15 Baker Street"), ("12 Rue de la Paix", "21 Rue de la Paix"), ("Church Rd", "Church Road"), ] for a, b in pairs: na, nb = normalize_address(a), normalize_address(b) lev = jellyfish.levenshtein_distance(na, nb) dlev = jellyfish.damerau_levenshtein_distance(na, nb) jaro = jellyfish.jaro_similarity(na, nb) jw = jellyfish.jaro_winkler_similarity(na, nb) print(f"{a!r} vs {b!r}") print(f" Levenshtein: {lev}, Damerau-Levenshtein: {dlev}") print(f" Jaro: {jaro:.3f}, Jaro-Winkler: {jw:.3f}")
Expected output:
'15 Baker St.' vs '15 Baker Street' * Levenshtein: 0, Damerau-Levenshtein: 0 # normalization made them identical * Jaro: 1.000, Jaro-Winkler: 1.000 '15 Baker Srteet' vs '15 Baker Street' * Levenshtein: 2, Damerau-Levenshtein: 1 # one transposition, not two substitutions * Jaro: 0.978, Jaro-Winkler: 0.987 '12 Rue de la Paix' vs '21 Rue de la Paix' * Levenshtein: 2, Damerau-Levenshtein: 1 # different house number * Jaro: 0.980, Jaro-Winkler: 0.980 # no prefix bonus (1 != 2) 'Church Rd' vs 'Church Road' * Levenshtein: 0, Damerau-Levenshtein: 0 # normalization expanded Rd -> Road * Jaro: 1.000, Jaro-Winkler: 1.000
Note how the normalization step handles "St" vs "Street" and "Rd" vs "Road" completely, leaving the metrics to handle what normalization cannot catch (typos, transpositions, genuine differences).
Where to go from here
String similarity metrics are one tool in the address quality toolbox. In geocoding workflows, they are commonly used to:
- Verify that a geocoded result actually matches the input address (compare the returned address label to the input).
- Deduplicate address lists before batching geocoding requests.
- Build fuzzy search over a local address database before calling a paid API.
If you are working on address data quality as part of a geocoding pipeline, these resources may also be useful:
- Learn how to geocode large batches cost-effectively: How to reduce geocoding costs by 67%
- See what geocoding providers are available and how their prices compare: Geocoding prices in 2026
- Find the best geocoding provider for your country: Best geocoding providers for France, UK, Germany