String Distance Metrics for Address Comparison: Levenshtein, Damerau, Jaro, and Jaro-Winkler

2026-04-01• François Andrieux

Comparing addresses is not a simple equality check

When you compare two address strings to decide if they refer to the same location, an exact match is rarely enough. Real-world addresses come with typos, abbreviations, inconsistent casing, and missing components. "15 Baker St" and "15 Baker Street" are the same address, but "15 Baker St" == "15 Baker Street" evaluates to False.

String distance metrics can be used to solve this problem. They assign a number to two strings that quantifies how different (or similar) they are. This number lets you set a threshold: "if the distance is less than 2, or the similarity is above 0.9, treat them as the same address."

In this guide, we cover the four most widely used metrics for address comparison:

Levenshtein distance - the classic edit distance
Damerau-Levenshtein distance - edit distance with transpositions
Jaro similarity - character matching with position awareness
Jaro-Winkler similarity - Jaro with a bonus for matching prefixes

For each one, we explain how it works, walk through a concrete address example, and discuss when it is (and is not) a good fit. At the end, a summary table and practical recommendations.

Table of contents:

A quick note before diving in

There are two flavors of metric:

Distance metrics (Levenshtein, Damerau-Levenshtein): return 0 for identical strings, and a higher integer for more different strings. Think of them as counting "how many operations to fix this?"
Similarity metrics (Jaro, Jaro-Winkler): return 1.0 for identical strings, and a lower value (down to 0.0) for more different strings. Think of them as "what fraction of characters match?"

Neither type is universally better. Which one to use depends on your data and your use case.

1. Levenshtein distance

How it works

The Levenshtein distance between two strings is the minimum number of single-character edits needed to transform one into the other. The allowed operations are:

Insertion: add a character (e.g., "Streeet" - one character - to "Street")
Deletion: remove a character (e.g., "Streeet" + one deletion - to "Street")
Substitution: replace one character with another (e.g., "Straat" - replace 'a' with 'e' - to "Street")

A distance of 0 means the strings are identical. A distance of 1 means a single operation is enough.

Address example

Compare "15 baker srteet" (a common keyboard transposition typo) to "15 baker street":

15 baker srteet
15 baker street

To fix "srteet" into "street", Levenshtein needs 2 substitutions: replace 'r' with 't' at position 9, and 't' with 'r' at position 10.

Distance = 2.

This is technically correct, but it feels like a lot for what is clearly a single typing mistake. We will fix this in the next section.

In Python

import jellyfish

jellyfish.levenshtein_distance("15 baker srteet", "15 baker street")
# → 2

Install with pip install jellyfish.

Pros and cons

Pros	Cons
Simple and well understood	Counts a transposition (swap of two adjacent chars) as 2 operations instead of 1
Handles insertions and deletions well (great for abbreviations like "St" vs "Street")	Raw distance is hard to compare across string pairs of different lengths
Available in virtually every language and library	Does not give extra weight to matching prefixes
Fast on short strings

When to use it for addresses

Levenshtein is a solid default, especially when the main source of variation is abbreviations or missing characters. Set a relative threshold to make comparisons fair across different string lengths:

def normalized_levenshtein(a, b):
    dist = jellyfish.levenshtein_distance(a, b)
    return dist / max(len(a), len(b))

# Treat as same address if < 15% different
normalized_levenshtein("15 baker street", "15 baker st") # → 0.267
normalized_levenshtein("15 baker street", "15 baker sreet") # → 0.067

2. Damerau-Levenshtein distance

How it works

Damerau-Levenshtein extends Levenshtein by adding one more operation:

Transposition: swap two adjacent characters (e.g., "srteet" to "street" - swap 'r' and 't')

This single addition makes a big practical difference. Studies of typing errors show that transpositions (adjacent character swaps) are one of the most frequent mistakes people make when typing by hand. Recognizing them as a single error rather than two substitutions gives a much more realistic picture of "how many mistakes did the user make?"

Address example

Same pair as before: "15 baker srteet" vs "15 baker street".

15 baker srteet
15 baker street
         ^^     ← 'rt' transposed to 'tr'

Damerau-Levenshtein distance = 1 (one transposition).

Compare this to the Levenshtein distance of 2 for the same pair. One error, one operation - much more intuitive.

Another example: "rue de al paix" vs "rue de la paix" (the words "al" and "la" are swapped - a common cut-and-paste error):

rue de al paix
rue de la paix
       ^^       ← 'a' and 'l' transposed

Levenshtein = 2, Damerau-Levenshtein = 1.

In Python

import jellyfish

jellyfish.damerau_levenshtein_distance("15 baker srteet", "15 baker street")
# → 1

jellyfish.damerau_levenshtein_distance("rue de al paix", "rue de la paix")
# → 1

Pros and cons

Pros	Cons
More realistic for human typing errors	Slightly more complex to implement from scratch
Transpositions (very frequent mistakes) cost 1, not 2	Same issue with raw distance across different string lengths
Strict superset of Levenshtein - always <= Levenshtein distance

When to use it for addresses

Prefer Damerau-Levenshtein over plain Levenshtein whenever your input comes from users typing addresses manually (web forms, search boxes, mobile apps). It gives a better answer to the question "how careless was this input?" and lets you set tighter thresholds while still tolerating common slip-of-the-finger errors.

3. Jaro similarity

How it works

Jaro takes a completely different approach. Instead of counting edit operations, it measures the proportion of characters that match between two strings, with a penalty for characters that are far apart or out of order.

Two characters are considered matching if: 1. They are the same character, and 2. Their positions are not too far apart. The maximum allowed distance between two matching characters is: floor(max(len(s1), len(s2)) / 2) - 1

Once you count the matching characters m and the number of matching characters that are out of order t (transpositions), the Jaro similarity is:

Jaro(s1, s2) = (1/3) * (m / |s1| + m / |s2| + (m - t) / m)

The result is between 0.0 (nothing matches) and 1.0 (identical).

Address example

Compare "church rd" (abbreviated) to "church road" (full form):

s1 = "church rd" (9 characters), s2 = "church road" (11 characters)
Matching window = floor(11 / 2) - 1 = 4

Every character in "church rd" can be matched to a character in "church road" within the window: c, h, u, r, c, h, (space), r, d all appear in the same relative order. So m = 9, t = 0.

Jaro = (1/3) * (9/9 + 9/11 + 9/9)
     = (1/3) * (1.000 + 0.818 + 1.000)
     ≈ 0.939

A similarity of 0.939 for what is clearly the same street, just abbreviated. That is the right answer.

In Python

import jellyfish

jellyfish.jaro_similarity("church rd", "church road")
# → ~0.939

jellyfish.jaro_similarity("main street", "mane street")  # typo: i→e
# → ~0.970

Pros and cons

Pros	Cons
Normalized score between 0 and 1 - easy to threshold	Formula is less intuitive than edit distance
Handles abbreviations and length differences naturally	Can give surprisingly high scores to long strings with many common characters
Well suited for comparing individual address components	Does not give extra weight to matching prefixes - we fix this with Jaro-Winkler

When to use it for addresses

Jaro is a good fit for comparing individual address fields (just the street name, just the city) rather than full concatenated address strings. It is also useful for detecting near-duplicate records in a database.

4. Jaro-Winkler similarity

How it works

Jaro-Winkler builds on Jaro by adding a prefix bonus: if two strings share the same first characters, their similarity is boosted. The motivation is that strings that start the same way are more likely to refer to the same thing than strings that diverge immediately.

The formula is:

JaroWinkler(s1, s2) = Jaro(s1, s2) + l * p * (1 - Jaro(s1, s2))

Where: - l = length of the common prefix, up to a maximum of 4 characters - p = scaling factor, conventionally 0.1

The effect is a modest upward adjustment when the prefix matches. With p = 0.1 and l = 4, the maximum bonus is +4 * 0.1 * (1 - Jaro), which boosts a Jaro of 0.90 by at most 0.04. Small but meaningful for ranking candidates.

Address example

Take "church rd" vs "church road" again (Jaro = 0.939, common prefix "chur", l = 4):

JaroWinkler = 0.939 + 4 * 0.1 * (1 - 0.939)
            = 0.939 + 0.4 * 0.061
            ≈ 0.963

Now consider two streets that differ at the very beginning:

"avenue du general leclerc" vs "boulevard du general leclerc"

These share no prefix at all (l = 0), so Jaro-Winkler equals Jaro exactly. No bonus is applied, and the score is lower than it would be for two strings with the same beginning. This is exactly the intended behavior: a difference at the start is a stronger signal of a mismatch than a difference in the middle.

Now consider a house number mismatch:

"12 rue de la paix" vs "21 rue de la paix": the prefix is "1" vs "2", so l = 0. The score penalizes the mismatch at the start, which is what we want - these are two completely different addresses.

In Python

import jellyfish

jellyfish.jaro_winkler_similarity("church rd", "church road")
# → ~0.963

jellyfish.jaro_winkler_similarity("12 rue de la paix", "21 rue de la paix")
# → ~0.980  (lower score due to no prefix match)

jellyfish.jaro_winkler_similarity("12 rue de la paix", "12 rue de la pai")
# → ~0.988  (strong prefix, only one char missing at end)

Pros and cons

Pros	Cons
Prefix bonus aligns with how we read addresses (number first, then street name)	Prefix bonus requires consistent formatting - mixed case will kill the bonus
Often more accurate than plain Jaro for addresses and proper names	`p = 0.1` is conventional; tuning it for your data takes experimentation
Normalized 0-1 score, easy to threshold	Like Jaro, harder to explain to non-technical stakeholders than edit distance

When to use it for addresses

Jaro-Winkler is the best general-purpose choice for full address comparison when strings are properly normalized (lowercase, consistent formatting). The prefix bonus is particularly useful when the house number is at the start of the string - it correctly penalizes address strings that start with different numbers, since those represent different physical locations regardless of the rest.

Summary table

Metric	Type	Score range	Transpositions	Prefix bonus	Best for
Levenshtein	Distance	0 to +inf	Costs 2	No	Abbreviations, insertions, deletions
Damerau-Levenshtein	Distance	0 to +inf	Costs 1	No	User-typed input, keyboard typos
Jaro	Similarity	0 to 1	Handled	No	Individual address fields
Jaro-Winkler	Similarity	0 to 1	Handled	Yes (up to 4 chars)	Full normalized address strings

Quick decision guide:

User types addresses into a form - use Damerau-Levenshtein. Transpositions are very common in manual input.
You are deduplicating a database of street names - use Jaro or Jaro-Winkler on the street name component alone.
You want a single score for a full address string - use Jaro-Winkler on normalized strings.
Your main problem is abbreviations ("Blvd" vs "Boulevard") - normalize abbreviations first (see below), then any metric will do.

Normalization: the step that matters more than metric choice

No string metric can compensate for what normalization ignores. Before comparing two addresses, apply at least these basics:

Lowercase everything: "Baker Street" and "baker street" should score 1.0, not 0.9.
Expand abbreviations consistently: "St" -> "street", "Blvd" -> "boulevard", "Ave" -> "avenue", "Rd" -> "road". Or shorten everything to a consistent short form.
Remove punctuation: "St. James's" and "St Jamess" will score very differently if you keep periods and apostrophes.
Normalize whitespace: collapse multiple spaces, strip leading/trailing whitespace.

With these steps, "15 Baker St." and "15 Baker Street" become "15 baker street" and "15 baker street" - identical strings that score perfectly on any metric.

import re

def normalize_address(addr):
    addr = addr.lower()
    addr = re.sub(r"[.,'\-]", "", addr)
    addr = re.sub(r"\bst\b", "street", addr)
    addr = re.sub(r"\bblvd\b", "boulevard", addr)
    addr = re.sub(r"\bave\b", "avenue", addr)
    addr = re.sub(r"\brd\b", "road", addr)
    addr = re.sub(r"\s+", " ", addr).strip()
    return addr

normalize_address("15 Baker St.")   # "15 baker street"
normalize_address("15 Baker Street")  # "15 baker street"

After normalization, both strings are identical and any metric returns a perfect score.

Putting it all together: a complete address comparison

Here is a short example that combines normalization with all four metrics:

import jellyfish
import re

def normalize_address(addr):
    addr = addr.lower()
    addr = re.sub(r"[.,'\-]", "", addr)
    addr = re.sub(r"\bst\b", "street", addr)
    addr = re.sub(r"\bblvd\b", "boulevard", addr)
    addr = re.sub(r"\bave\b", "avenue", addr)
    addr = re.sub(r"\brd\b", "road", addr)
    addr = re.sub(r"\s+", " ", addr).strip()
    return addr

pairs = [
    ("15 Baker St.", "15 Baker Street"),
    ("15 Baker Srteet", "15 Baker Street"),
    ("12 Rue de la Paix", "21 Rue de la Paix"),
    ("Church Rd", "Church Road"),
]

for a, b in pairs:
    na, nb = normalize_address(a), normalize_address(b)
    lev = jellyfish.levenshtein_distance(na, nb)
    dlev = jellyfish.damerau_levenshtein_distance(na, nb)
    jaro = jellyfish.jaro_similarity(na, nb)
    jw = jellyfish.jaro_winkler_similarity(na, nb)
    print(f"{a!r} vs {b!r}")
    print(f"  Levenshtein: {lev}, Damerau-Levenshtein: {dlev}")
    print(f"  Jaro: {jaro:.3f}, Jaro-Winkler: {jw:.3f}")

Expected output:

'15 Baker St.' vs '15 Baker Street'
* Levenshtein: 0, Damerau-Levenshtein: 0  # normalization made them identical
* Jaro: 1.000, Jaro-Winkler: 1.000

'15 Baker Srteet' vs '15 Baker Street'
* Levenshtein: 2, Damerau-Levenshtein: 1  # one transposition, not two substitutions
* Jaro: 0.978, Jaro-Winkler: 0.987

'12 Rue de la Paix' vs '21 Rue de la Paix'
* Levenshtein: 2, Damerau-Levenshtein: 1  # different house number
* Jaro: 0.980, Jaro-Winkler: 0.980        # no prefix bonus (1 != 2)

'Church Rd' vs 'Church Road'
* Levenshtein: 0, Damerau-Levenshtein: 0  # normalization expanded Rd -> Road
* Jaro: 1.000, Jaro-Winkler: 1.000

Note how the normalization step handles "St" vs "Street" and "Rd" vs "Road" completely, leaving the metrics to handle what normalization cannot catch (typos, transpositions, genuine differences).

Where to go from here

String similarity metrics are one tool in the address quality toolbox. In geocoding workflows, they are commonly used to:

Verify that a geocoded result actually matches the input address (compare the returned address label to the input).
Deduplicate address lists before batching geocoding requests.
Build fuzzy search over a local address database before calling a paid API.

If you are working on address data quality as part of a geocoding pipeline, these resources may also be useful:

Learn how to geocode large batches cost-effectively: How to reduce geocoding costs by 67%
See what geocoding providers are available and how their prices compare: Geocoding prices in 2026
Find the best geocoding provider for your country: Best geocoding providers for France, UK, Germany

Try address geocoding with Coordable