bvCase Insensitive Regex Replacement From Dictionary

Question

I'm sorry, but I haven't been able to find a working solution from any of the solutions Google's been giving me (a couple of "recipes" on some site were pretty close, but way old and I haven't found something that gives me the result I'm looking for.

I'm renaming files, so I have a function that spits out the filename, for this I'm just using 'test_string's: So, all the dots, (and underscores) and stuff are removed first--since those are the most common thing all these professors do differently and makes all this stuff impossible to deal with (or look at) without removing. 5 Examples:

test_string_1 = 'legal.studies.131.race.relations.in.the.United.States.'

'legal.studies' --> 'Legal Studies'

test_string_2 = 'mediastudies the triumph of bluray over hddvd'

'mediastudies' --> 'Media Studies', 'bluray' --> 'Blu-ray, 'hddvd' --> 'HD DVD'

test_string_3 = 'computer Science Microsoft vs unix'

'computer Science' --> 'Computer Science', 'unix' --> 'UNIX'

test_string_4 = 'Perception - metamers dts'

'Perception' would already be good (but who cares), big picture is they want to keep the audio information in there, so 'dts' --> DTS

test_string_5 = 'Perception - Cue Integration - flashing dot example aac20 xvid'

'aac20' --> 'AAC2.0', 'xvid' --> 'XviD'

Currently I'm running this through something like:

new_string = re.sub(r'(?i)Legal(\s|-|)Studies', 'Legal Studies', re.sub(r'(?i)Sociology', 'Sociology', re.sub(r'(?i)Media(\s|-|)Studies', 'Media Studies', re.sub(r'(?i)UNIX', 'UNIX', re.sub(r'(?i)Blu(\s|-|)ray', 'Blu-ray', re.sub(r'(?i)HD(\s|-|)DVD', 'HD DVD', re.sub(r'(?i)xvid(\s|-|)', 'XviD', re.sub(r'(?i)aac(\s|-|)2(\s|-|\.|)0', 'AAC2.0', re.sub(r'(?i)dts', 'DTS', re.sub(r'\.', r' ', original_string.title()))))))))))

I have them all smushed together on one line; because I'm not changing/updating it much and (the way my brain/ADD works) it's easier to have it as minimal/out-of-the-way as possible while I'm doing other things once I'm not messing with this part anymore.

So, with my example:

new_test_string_1 = 'Legal Studies 131 Race Relations In The United States'
new_test_string_2 = 'Media Studies The Triumph Of Blu-ray Over HD DVD'
new_test_string_3 = 'Computer Science Microsoft Vs UNIX'
new_test_string_4 = 'Perception - Metamers DTS'
new_test_string_5 = 'Perception - Cue Integration - Flashing Dot Example AAC2.0 XviD'

However, as I have more and more of these it's really starting to become the kind of thing I want to have a dictionary or something for--I don't want to blow up the code to anything crazy, but I'd like to be able to add new replacements as real life examples come up that need to be added (for example, there are a lot of audio codecs/containers/whatevers out there, and it looks like I might have to just throw them all in). I have no opinion about the method used by this master-list/dictionary/whatever.

Big picture: I'm fixing spaces and underscores in the filenames, replacing a bunch of shit with capitalization stuff (at the moment, universally title-casing it with the exception of the re.subs I'm making, which deal with plenty of cases where the capitalization isn't perfect and there may or may not be a space, dash, or dot in the input that the output should have).

Similarly, a one-liner, unnamed (such as lambda) function would be preferable.

P.S. Sorry for some of the weirdness and some of the initial lack of clarity. One of the problems here is in my major/studies most of the stuff is actually pretty straight-forward, it's other classes that need all the Blu-ray, HD DVD, DTS, AAC2.0, XviD, etc.

jamylak · Accepted Answer

>>> import re
>>> def string_fix(text,substitutions):
        text_no_dots = text.replace('.',' ').strip()
        for key,substitution in substitutions.items():
            text_no_dots = re.sub(key,substitution,text_no_dots,flags=re.IGNORECASE)
        return text_no_dots

>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> d = {
     r'Legal(\s|-|)Studies' : 'Legal Studies', 
     r'Sociology'           : 'Sociology', 
     r'Media(\s|-|)Studies' : 'Media Studies'
}
>>> string_fix(teststring,d)
'Legal Studies 131 race relations in the U S'

And here is a much better way of doing it without a dictionary

>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> def repl(match):
        return ' '.join(re.findall('\w+',match.group())).title()

>>> re.sub(r'Legal(\s|-|)Studies|Sociology|Media(\s|-|)Studies',repl,teststring.replace('.',' ').strip(),flags=re.IGNORECASE)
'Legal Studies 131 race relations in the U S'

bvCase Insensitive Regex Replacement From Dictionary

Tags:

python

regex

recursion

Robin Hood

1 Answers

jamylak

Recent Activity

Donate For Us

bvCase Insensitive Regex Replacement From Dictionary

Tags:

python

regex

recursion

Robin Hood

1 Answers

jamylak

Related questions

Recent Activity

Donate For Us