Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python extract tags from string by string array

Tags:

python

I am new to python and looking help to extract tags from string by string array. Let's say I have string array of ['python', 'c#', 'java', 'f#' ]

And input string of "I love Java and python".

The output should be array ['java', 'python']

Thanks for any help.

like image 766
user2172306 Avatar asked Dec 07 '25 04:12

user2172306


2 Answers

Non-splittable by blankspace

Regex solution

import re

stringarray = ['python', 'c#', 'core java', 'f#' ]
string = "I love Core Java and python"

pattern = '|'.join(stringarray)    
output = re.findall(pattern, string.lower())
# ['core java', 'python']

Non-regex solution

stringarray = ['python', 'c#', 'core java', 'f#' ]
string = "I love Core Java and python"
output = [i for i in stringarray if i in string.lower()]
# ['core java', 'python']

Splittable by blankspace, or other char (quicker!)

Using set and intersection

stringarray = ['python', 'c#', 'java', 'f#' ]
string = "I love Java and python"

output = list(set(string.lower().split()).intersection(stringarray))
# ['java', 'python']

Short explanation: By doing string.lower().split() we split the words as lower-case in your inputstring by the default (blankspace). By converting it to a set we can access the set function intersection. Intersection will in turn find the occurences that are in both sets. Finally we wrap this around a list to get desired output. As commented by Joe Iddon this will not return repeated tags.

Counts

Are you interested in counts? Consider using collections counter and a dict comprehension:

from collections import Counter

count = {k:v for k,v in Counter(string.lower().split()).items() if k in stringarray}
print(count)
#{'java': 1, 'python': 1}
like image 179
Anton vBR Avatar answered Dec 08 '25 22:12

Anton vBR


You could use the following list comprehension, which turns your string into lowercase, then iterates through each word (after using split), and returns which ones are in your array:

arr = ['python', 'c#', 'java', 'f#' ]
s = "I love Java and python"

outp = [i for i in s.lower().split() if i in arr]

>>> outp
['java', 'python']

Or you could use regular expressions:

import re

arr = ['python', 'c#', 'java', 'f#' ]
s = "I love Java and python"

outp = re.findall('|'.join(arr),s.lower())

>>> outp 
['java', 'python']
like image 40
sacuL Avatar answered Dec 08 '25 22:12

sacuL