How to clean inconsistent address strings in Python?

Question

I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.

I've already handled simple issues like pipes (|) and newlines. The main problem is that some addresses have a repeated street name, separated by a comma.

For example, I'm getting addresses like this:

'747 Geary Street, 747 Geary St, Oakland, CA 94609'

The goal is to get a single, clean address without the repetition, like:

'747 Geary Street, Oakland, CA 94609'

I've tried a few things, but I'm having trouble handling both types of addresses in a single line.

This is a training project and the goal is to not use any tools such as ai, but the write code.

Here is an example:

# Here is an example of the addresses I am trying to clean.
addresses_to_clean = [
    'The Gantry | 1340 3rd St, San Francisco, CA',
    '845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington Street | 1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, 747 Geary St, Oakland, CA 94609'
]

#Here is the code i am using:
`cleaned_addresses = [address.strip().replace("|", "") for address in addresses_to_clean]`
# of course this does not solve the problem of repeated parts, which I am struggling with.

# This is what I want the list to look like after it's cleaned:
desired_output = [
    'The Gantry, 1340 3rd St, San Francisco, CA',
    '845 Sutter St APT 509, San Francisco, CA',
    '1350 Washington St, San Francisco, CA',
    'Parkmerced 3711 19th Ave, San Francisco, CA',
    '747 Geary Street, Oakland, CA 94609'
]

# How can I write the code to get from my 'addresses_to_clean' list
# to the 'desired_output' list?

I am trying to use a single list comprehension with a .split() and .replace() to clean the addresses. I was expecting to get a single, clean address string for each property. However, my code either removed too much information (like the city and state) or didn't correctly handle all the different formatting issues

André · Accepted Answer

You won't be able to solve this with using split() and replace() only. The following code works on your examples and uses a three step approach:

Convert pipes to ,, expandable to your needs by adding characters.
Normalizing street names to a common abbreviation, i.e. Street becomes St.
Finding and neglecting any parts that are already contained in other parts of the address.

Feel free to adapt the steps to your needs. As I am pretty sure your test set is not complete in terms of potential input, you surely have to treat these cases. But this should get you started.

import re

def clean_address(address):
    # Normalize common street suffixes
    suffix_map = {
        r'\bStreet\b': 'St',
        r'\bAvenue\b': 'Ave',
        r'\bRoad\b': 'Rd',
        r'\bBoulevard\b': 'Blvd',
        r'\bDrive\b': 'Dr',
        r'\bLane\b': 'Ln',
        r'\bCourt\b': 'Ct',
    }
    normalized = address.replace('|', ',')
    for pattern, abbr in suffix_map.items():
        normalized = re.sub(pattern, abbr, normalized, flags=re.IGNORECASE)
    parts = [part.strip() for part in normalized.split(',')]

    # Check if one part is completely contained in another and remove the smaller or first equal one
    cleaned = []
    for i, part in enumerate(parts):
        if not any(i < j and part in other for j, other in enumerate(parts)):
            cleaned.append(part)

    return ', '.join(cleaned)

Output:

Original: The Gantry | 1340 3rd St, San Francisco, CA
Cleaned:  The Gantry, 1340 3rd St, San Francisco, CA

Original: 845 Sutter, 845 Sutter St APT 509, San Francisco, CA
Cleaned:  845 Sutter St APT 509, San Francisco, CA

Original: 1350 Washington Street | 1350 Washington St, San Francisco, CA
Cleaned:  1350 Washington St, San Francisco, CA

Original: Parkmerced 3711 19th Ave, San Francisco, CA
Cleaned:  Parkmerced 3711 19th Ave, San Francisco, CA

Original: 747 Geary Street, 747 Geary St, Oakland, CA 94609
Cleaned:  747 Geary St, Oakland, CA 94609

How to clean inconsistent address strings in Python?

Tags:

python

string

beautifulsoup

web-scraping

Adamzam15

1 Answers

André

Recent Activity

Donate For Us

How to clean inconsistent address strings in Python?

Tags:

python

string

beautifulsoup

web-scraping

Adamzam15

1 Answers

André

Related questions

Recent Activity

Donate For Us