I have lists of numbers that I'd like to group by similarity. The order of the numbers in the list is fixed and important to preserve.
As an example, here's a visualisation of what I'm trying to achieve:

The black line represents the list of numbers I have. The green lines represent the groupings I would like to identify in this example list.
The order of numbers in the list is important and cannot be changed (e.g. cannot sort ascending or descending). The numbers in the list are not contiguous (i.e. there isn't likely to be a list of 6, 6, 6, 6, but probably would be something like 5.85, 6.1, 5.96, 5.88).
Is there a method to do this?
Edit: example values, and desired groupings:
[4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
would result in an approximate grouping of
[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]
In the grouping above, you could argue that 4.52 could belong to the first or second group. If visualised as I did in the example above, the groupings would be represented by the green lines. My lists are actually several hundred to several thousand values in length.
You may use itertools.groupby - it combines consecutive elements with same result of given key function (round in this case):
In [7]: import itertools
In [8]: data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
In [9]: [tuple(xs) for _, xs in itertools.groupby(data, round)]
Out[9]:
[(4.1, 4.05, 4.14, 4.01, 3.97),
(4.52, 4.97, 5.02, 5.05, 5.2, 5.18),
(3.66, 3.77, 3.59, 3.72)]
from statistics import mean
def ordered_cluster(data, max_diff):
current_group = ()
for item in data:
test_group = current_group + (item, )
test_group_mean = mean(test_group)
if all((abs(test_group_mean - test_item) < max_diff for test_item in test_group)):
current_group = test_group
else:
yield current_group
current_group = (item, )
if current_group:
yield current_group
data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
print(list(ordered_cluster(data, 0.5)))
Output :
[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]
This ensures that each item from a group does not exceed max_diff to the mean of the group. If it does, a new group is started.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With