Splitting a single-line address into its constituent parts? I'm thinking reg-ex, but I'm not sure?

Question

I have a .csv of about 8000 rows. In this csv, one of the fields is a one-line mailing address. I want to split this one-line mailing address into its constituent parts (street address, city, state, zip code). Here is what my csv more-or-less looks like:

Name| Full Address | Street | City | State | PostalCode

Mary| 123 Yor Street Apt 5 Los Angeles California 12345 |Null | Null| Null | Null |
Bob | 567 Other Ave Chicago Illinois 56789              | Null | Null | Null | Null|
Jim | 890 Last Street COlorado Springs Colorado 80919   | Null | Null | Null | Null|

So thats what I have. What I am looking for is this output:

Name     Full Address                                    Street                City          State       PostalCode
Mary | 123 Yor Street Apt 5 Los Angeles California 12345 | 123 Yor Street Apt 5| Los Angeles| California| 12345|
Bob  | 567 Other Ave Chicago Illinois 56789              | 567 Other Ave    | Chicago | Illinois | 56789|
Jim  | 890 Last Street COlorado Springs Colorado 80919   | 890 Last Street | Colorado SPrings | Colorado | 80919

Note: The most important parts that I'd like to be able to separate are City and State, so if the variations on street names (like including apartment #s) complicate things, that can be ignored if I can get City and State successfully extracted.

I don't have a lot of experience working with strings, So I'm kinda starting at square one. My first thought is that a regex might be able to parse this out, but I'm not sure how I would construct a regex to identify those things. Because of this, I'm not sure it's the correct approach.

There are no commas or anything like that which would make obvious separators, which is the other complicating part. Maybe using word.split() on the Full Address field and then using a isin() with a dictionary of states for the state column?

So, anyone who can provide some direction would be most appreciated!

Thank you for your time.

PS - I apologize for the table formatting.

sniperd · Accepted Answer

I feel trying to regex your way out of this is going to be hard. Address data often is just straight up messy or bad. I've found the module usaddress to be pretty helpful. It's not 100% but what is with address matching?

https://github.com/datamade/usaddress

Here is an example:

import usaddress

def main():
    address_list = [
    "123 main street",
    "99 elm st, manchester, nh",
    "56 magic dr, town of, bedford, nh",
    "123 Main St. Suite 100 IL, Faketown",
    ]

    for addr in address_list:
        data = usaddress.tag(addr)
        if "PlaceName" in data[0].keys():
            print(data[0]["PlaceName"])
        else:
            print("no city/town")

main()

Which results in:

no city/town
manchester
town of, bedford
no city/town

You can also have it bring back different data aside from PlaceName such as:

    AddressNumber
    StreetName
    StreetNamePostType
    StreetAddress
    StateNmae
    OccupancyType
    OccupancyIdentifier

Splitting a single-line address into its constituent parts? I'm thinking reg-ex, but I'm not sure?

Tags:

python

string

regex

pandas

T. Kelly

1 Answers

sniperd

Recent Activity

Donate For Us

Splitting a single-line address into its constituent parts? I'm thinking reg-ex, but I'm not sure?

Tags:

python

string

regex

pandas

T. Kelly

1 Answers

sniperd

Related questions

Recent Activity

Donate For Us