Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - match and parse strings containing numeric/currency amounts [closed]

Say I have the following strings (inputs) in python:

1) "$ 1,350,000" 2) "1.35 MM $" 3) "$ 1.35 M" 4) 1350000 (now it is a numeric value)

Obviously the number is the same although the string representation is different. How can I achieve a string matching or in other words classify them as equal strings?

One way would be to model -using regular expressions- the possible patterns. However there might be a case that I haven't thought of.

Does someone see a NLP solution to this problem?

Thanks

like image 243
mrt Avatar asked Oct 15 '25 22:10

mrt


1 Answers

This is not an NLP problem, just a job for regexes, plus some code to ignore order, and lookup a dictionary of known abbreviations(/ontology) like "MM".

  • First, you can completely disregard the '$' character here (unless you need to disambiguate against other currencies or symbols).
  • So all this boils down to is parsing number formats, and mapping 'M'/'MM'/'million' -> a 1e6 multiplier. And doing that parsing in an order-independent way (e.g. the multiplier, currency symbol and amount can appear in any relative order, or not at all)

Here's some working code:

def parse_numeric_string(s):

    if isinstance(s, int): s = str(s)

    amount = None
    currency = ''
    multiplier = 1.0

    for token in s.split(' '):

        token = token.lower()

        if token in ['$','€','£','¥']:
            currency = token

        # Extract multipliers from their string names/abbrevs
        if token in ['million','m','mm']:
            multiplier = 1e6
        # ... or you could use a dict:
        # multiplier = {'million': 1e6, 'm': 1e6...}.get(token, 1.0)

        # Assume anything else is some string format of number/int/float/scientific
        try:
            token = token.replace(',', '')
            amount = float(token)
        except:
            pass # Process your parse failures...

    # Return a tuple, or whatever you prefer
    return (currency, amount * multiplier)

parse_numeric_string("$ 1,350,000")
parse_numeric_string("1.35 MM $")
parse_numeric_string("$ 1.35 M")
parse_numeric_string(1350000)
  • For internationalization, you may want to beware that , and . as thousands separator and decimal point can be switched, or ' as (Arabic) thousands separator. There's also a third-party Python package 'parse', e.g. parse.parse('{fn}', '1,350,000') (it's the reverse of format())
  • Using an ontology or general NLP library would probably be way more trouble than it's worth. For example, you'd need to disambiguate between 'mm' as in "accounting abbreviation for millions" vs "millimeters" vs 'Mm' as in 'Megameters, 10^6 meters' which is an almost-never-used but valid metric unit for distance. So, less generality probably better for this task.
  • and you could also use a dict-based approach to map other currency signifiers e.g. 'dollars','US','USD','US$', 'EU'...
  • here I tokenized on whitespace, but you might want to tokenize on any word/numeric/whitespace/punctuation boundaries so you can parse e.g. USD1.3m
like image 60
smci Avatar answered Oct 18 '25 11:10

smci



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!