Identifying groups of similar numbers in a list

Question

I have lists of numbers that I'd like to group by similarity. The order of the numbers in the list is fixed and important to preserve.

As an example, here's a visualisation of what I'm trying to achieve:

Black line is the list of numbers, green lines are the identified groups of similar numbers I'd like to identify, corresponding with that section of the list.

The black line represents the list of numbers I have. The green lines represent the groupings I would like to identify in this example list.

The order of numbers in the list is important and cannot be changed (e.g. cannot sort ascending or descending). The numbers in the list are not contiguous (i.e. there isn't likely to be a list of 6, 6, 6, 6, but probably would be something like 5.85, 6.1, 5.96, 5.88).

Is there a method to do this?

Edit: example values, and desired groupings:

[4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

would result in an approximate grouping of

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

In the grouping above, you could argue that 4.52 could belong to the first or second group. If visualised as I did in the example above, the groupings would be represented by the green lines. My lists are actually several hundred to several thousand values in length.

awesoon · Accepted Answer

You may use itertools.groupby - it combines consecutive elements with same result of given key function (round in this case):

In [7]: import itertools

In [8]: data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

In [9]: [tuple(xs) for _, xs in itertools.groupby(data, round)]
Out[9]: 
[(4.1, 4.05, 4.14, 4.01, 3.97),
 (4.52, 4.97, 5.02, 5.05, 5.2, 5.18),
 (3.66, 3.77, 3.59, 3.72)]

Gary van der Merwe · Answer

from statistics import mean

def ordered_cluster(data, max_diff):
    current_group = ()
    for item in data:
        test_group = current_group + (item, )
        test_group_mean = mean(test_group)
        if all((abs(test_group_mean - test_item) < max_diff for test_item in test_group)):
            current_group = test_group
        else:
            yield current_group
            current_group = (item, )
    if current_group:
        yield current_group

data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

print(list(ordered_cluster(data, 0.5)))

Output :

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

This ensures that each item from a group does not exceed max_diff to the mean of the group. If it does, a new group is started.

Identifying groups of similar numbers in a list

Tags:

python

list

J.P.

2 Answers

awesoon

Gary van der Merwe

Recent Activity

Donate For Us

Identifying groups of similar numbers in a list

Tags:

python

list

J.P.

2 Answers

awesoon

Gary van der Merwe

Related questions

Recent Activity

Donate For Us