What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With