Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group items by string pattern in python

supose this list:

list1=["House of Mine (1293) Item 21",
       "House of Mine (1292) Item 24",
       "The yard (1000) Item 1 ",
       "The yard (1000) Item 2 ",
       "The yard (1000) Item 4 "]

I want to add each item of it to a group (a list inside a list on this case) IF the substring till the (XXXX) is the same.

So, in this case, I am expecting to have:

[["House of Mine (1293) Item 21",
  "House of Mine (1292) Item 24"],

 ["The yard (1000) Item 1 ",
  "The yard (1000) Item 2 ",
  "The yard (1000) Item 4 "]

The following code is what I was able to make, but it's not working:

def group(list1):
    group=[]
    for i, itemg in enumerate(list1):
        try:
            group[i]
        except Exception:
            group.append([])
        for itemj in group[i]:
            if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
                group[i].append(itemg)
            else:
                group.append([])
                group[-1].append(itemg)

    return group

I've read thanks to another topic in stack, the page of regular expressions http://www.diveintopython3.net/regular-expressions.html

I know the answer lies on it, but I'm having difficult understanding some concepts of it.

like image 257
BrunoSXS Avatar asked Jan 22 '26 10:01

BrunoSXS


1 Answers

Set up the list to group:

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

Define a function, used to sort and group items (this time using the number in parenthesis):

>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'

Sort the list (in place here):

>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough

Take groupby from itertools

>>> from itertools import groupby

Check the concept:

>>> for gr, items in groupby(list1, key = keyf):
...     print "gr", gr
...     print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
 'The yard (1000) Item 2 ',
 'The yard (1000) Item 4 ',
 'House of Mine (1292) Item 24',
 'House of Mine (1293) Item 21']

Note, we had to call list on items, as items is an iterator over items.

Now using list comprehension:

>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 '],
 ['House of Mine (1292) Item 24'],
 ['House of Mine (1293) Item 21']]

and we are done.

If you want to group by all the text before first "(", the only change is to:

>>> keyf = lambda text: text.split("(")[0]

Short version answering OP

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]      

Variation using re.findall

Solution assumes that "(" is the delimiter and ignores the requirement of having four digits there. Such a task can be resolved using re.

>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '

But it raises IndexError: list index out of range if the text does not have expected content (we are trying to acces item with index 0 from empty list).

>>> text = "nothing here"
IndexError: list index out of range

We can use simple trick, to survive, we append original text to ensure, something is there:

>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'

Final solution using re

>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]
like image 189
Jan Vlcinsky Avatar answered Jan 23 '26 22:01

Jan Vlcinsky