I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent.
I've already handled simple issues like pipes (|) and newlines. The main problem is that some addresses have a repeated street name, separated by a comma.
For example, I'm getting addresses like this:
'747 Geary Street, 747 Geary St, Oakland, CA 94609'
The goal is to get a single, clean address without the repetition, like:
'747 Geary Street, Oakland, CA 94609'
I've tried a few things, but I'm having trouble handling both types of addresses in a single line.
This is a training project and the goal is to not use any tools such as ai, but the write code.
Here is an example:
# Here is an example of the addresses I am trying to clean.
addresses_to_clean = [
'The Gantry | 1340 3rd St, San Francisco, CA',
'845 Sutter, 845 Sutter St APT 509, San Francisco, CA',
'1350 Washington Street | 1350 Washington St, San Francisco, CA',
'Parkmerced 3711 19th Ave, San Francisco, CA',
'747 Geary Street, 747 Geary St, Oakland, CA 94609'
]
#Here is the code i am using:
`cleaned_addresses = [address.strip().replace("|", "") for address in addresses_to_clean]`
# of course this does not solve the problem of repeated parts, which I am struggling with.
# This is what I want the list to look like after it's cleaned:
desired_output = [
'The Gantry, 1340 3rd St, San Francisco, CA',
'845 Sutter St APT 509, San Francisco, CA',
'1350 Washington St, San Francisco, CA',
'Parkmerced 3711 19th Ave, San Francisco, CA',
'747 Geary Street, Oakland, CA 94609'
]
# How can I write the code to get from my 'addresses_to_clean' list
# to the 'desired_output' list?
I am trying to use a single list comprehension with a .split() and .replace() to clean the addresses. I was expecting to get a single, clean address string for each property. However, my code either removed too much information (like the city and state) or didn't correctly handle all the different formatting issues
You won't be able to solve this with using split() and replace() only. The following code works on your examples and uses a three step approach:
,, expandable to your needs by adding characters.Street becomes St.Feel free to adapt the steps to your needs. As I am pretty sure your test set is not complete in terms of potential input, you surely have to treat these cases. But this should get you started.
import re
def clean_address(address):
# Normalize common street suffixes
suffix_map = {
r'\bStreet\b': 'St',
r'\bAvenue\b': 'Ave',
r'\bRoad\b': 'Rd',
r'\bBoulevard\b': 'Blvd',
r'\bDrive\b': 'Dr',
r'\bLane\b': 'Ln',
r'\bCourt\b': 'Ct',
}
normalized = address.replace('|', ',')
for pattern, abbr in suffix_map.items():
normalized = re.sub(pattern, abbr, normalized, flags=re.IGNORECASE)
parts = [part.strip() for part in normalized.split(',')]
# Check if one part is completely contained in another and remove the smaller or first equal one
cleaned = []
for i, part in enumerate(parts):
if not any(i < j and part in other for j, other in enumerate(parts)):
cleaned.append(part)
return ', '.join(cleaned)
Output:
Original: The Gantry | 1340 3rd St, San Francisco, CA
Cleaned: The Gantry, 1340 3rd St, San Francisco, CA
Original: 845 Sutter, 845 Sutter St APT 509, San Francisco, CA
Cleaned: 845 Sutter St APT 509, San Francisco, CA
Original: 1350 Washington Street | 1350 Washington St, San Francisco, CA
Cleaned: 1350 Washington St, San Francisco, CA
Original: Parkmerced 3711 19th Ave, San Francisco, CA
Cleaned: Parkmerced 3711 19th Ave, San Francisco, CA
Original: 747 Geary Street, 747 Geary St, Oakland, CA 94609
Cleaned: 747 Geary St, Oakland, CA 94609
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With