I have a .csv of about 8000 rows. In this csv, one of the fields is a one-line mailing address. I want to split this one-line mailing address into its constituent parts (street address, city, state, zip code). Here is what my csv more-or-less looks like:
Name| Full Address | Street | City | State | PostalCode
Mary| 123 Yor Street Apt 5 Los Angeles California 12345 |Null | Null| Null | Null |
Bob | 567 Other Ave Chicago Illinois 56789 | Null | Null | Null | Null|
Jim | 890 Last Street COlorado Springs Colorado 80919 | Null | Null | Null | Null|
So thats what I have. What I am looking for is this output:
Name Full Address Street City State PostalCode
Mary | 123 Yor Street Apt 5 Los Angeles California 12345 | 123 Yor Street Apt 5| Los Angeles| California| 12345|
Bob | 567 Other Ave Chicago Illinois 56789 | 567 Other Ave | Chicago | Illinois | 56789|
Jim | 890 Last Street COlorado Springs Colorado 80919 | 890 Last Street | Colorado SPrings | Colorado | 80919
Note: The most important parts that I'd like to be able to separate are City and State, so if the variations on street names (like including apartment #s) complicate things, that can be ignored if I can get City and State successfully extracted.
I don't have a lot of experience working with strings, So I'm kinda starting at square one. My first thought is that a regex might be able to parse this out, but I'm not sure how I would construct a regex to identify those things. Because of this, I'm not sure it's the correct approach.
There are no commas or anything like that which would make obvious separators, which is the other complicating part. Maybe using word.split() on the Full Address field and then using a isin() with a dictionary of states for the state column?
So, anyone who can provide some direction would be most appreciated!
Thank you for your time.
PS - I apologize for the table formatting.
I feel trying to regex your way out of this is going to be hard. Address data often is just straight up messy or bad. I've found the module usaddress to be pretty helpful. It's not 100% but what is with address matching?
https://github.com/datamade/usaddress
Here is an example:
import usaddress
def main():
address_list = [
"123 main street",
"99 elm st, manchester, nh",
"56 magic dr, town of, bedford, nh",
"123 Main St. Suite 100 IL, Faketown",
]
for addr in address_list:
data = usaddress.tag(addr)
if "PlaceName" in data[0].keys():
print(data[0]["PlaceName"])
else:
print("no city/town")
main()
Which results in:
no city/town
manchester
town of, bedford
no city/town
You can also have it bring back different data aside from PlaceName such as:
AddressNumber
StreetName
StreetNamePostType
StreetAddress
StateNmae
OccupancyType
OccupancyIdentifier
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With