Python

Question

Say I have the following strings (inputs) in python:

1) "$ 1,350,000" 2) "1.35 MM $" 3) "$ 1.35 M" 4) 1350000 (now it is a numeric value)

Obviously the number is the same although the string representation is different. How can I achieve a string matching or in other words classify them as equal strings?

One way would be to model -using regular expressions- the possible patterns. However there might be a case that I haven't thought of.

Does someone see a NLP solution to this problem?

Thanks

smci · Accepted Answer

This is not an NLP problem, just a job for regexes, plus some code to ignore order, and lookup a dictionary of known abbreviations(/ontology) like "MM".

First, you can completely disregard the '$' character here (unless you need to disambiguate against other currencies or symbols).
So all this boils down to is parsing number formats, and mapping 'M'/'MM'/'million' -> a 1e6 multiplier. And doing that parsing in an order-independent way (e.g. the multiplier, currency symbol and amount can appear in any relative order, or not at all)

Here's some working code:

def parse_numeric_string(s):

    if isinstance(s, int): s = str(s)

    amount = None
    currency = ''
    multiplier = 1.0

    for token in s.split(' '):

        token = token.lower()

        if token in ['$','€','£','¥']:
            currency = token

        # Extract multipliers from their string names/abbrevs
        if token in ['million','m','mm']:
            multiplier = 1e6
        # ... or you could use a dict:
        # multiplier = {'million': 1e6, 'm': 1e6...}.get(token, 1.0)

        # Assume anything else is some string format of number/int/float/scientific
        try:
            token = token.replace(',', '')
            amount = float(token)
        except:
            pass # Process your parse failures...

    # Return a tuple, or whatever you prefer
    return (currency, amount * multiplier)

parse_numeric_string("$ 1,350,000")
parse_numeric_string("1.35 MM $")
parse_numeric_string("$ 1.35 M")
parse_numeric_string(1350000)

For internationalization, you may want to beware that , and . as thousands separator and decimal point can be switched, or ' as (Arabic) thousands separator. There's also a third-party Python package 'parse', e.g. parse.parse('{fn}', '1,350,000') (it's the reverse of format())
Using an ontology or general NLP library would probably be way more trouble than it's worth. For example, you'd need to disambiguate between 'mm' as in "accounting abbreviation for millions" vs "millimeters" vs 'Mm' as in 'Megameters, 10^6 meters' which is an almost-never-used but valid metric unit for distance. So, less generality probably better for this task.
and you could also use a dict-based approach to map other currency signifiers e.g. 'dollars','US','USD','US$', 'EU'...
here I tokenized on whitespace, but you might want to tokenize on any word/numeric/whitespace/punctuation boundaries so you can parse e.g. USD1.3m

Python - match and parse strings containing numeric/currency amounts [closed]

Tags:

regex

parsing

currency

text-mining

mrt

1 Answers

smci

Recent Activity

Donate For Us