Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Removing list duplicates based on first 2 inner list values

Question:

I have a list in the following format:

x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]

The algorithm:

  • Combine all inner lists with the same starting 2 values, the third value doesn't have to be the same to combine them
    • e.g. "hello",0,5 is combined with "hello",0,8
    • But not combined with "hello",1,1
  • The 3rd value becomes the average of the third values: sum(all 3rd vals) / len(all 3rd vals)
    • Note: by all 3rd vals I am referring to the 3rd value of each inner list of duplicates
    • e.g. "hello",0,5 and "hello",0,8 becomes hello,0,6.5

Desired output: (Order of list doesn't matter)

x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]

Question:

  • How can I implement this algorithm in Python?

Ideally it would be efficient as this will be used on very large lists.

If anything is unclear let me know and I will explain.

Edit: I have tried to change the list to a set to remove duplicates, however this doesn't account for the third variable in the inner lists and therefore doesn't work.

Solution Performance:

Thanks to everyone who has provided a solution to this problem! Here are the results based on a speed test of all the functions:

Performance Data

like image 761
RulerOfTheWorld Avatar asked Mar 04 '26 23:03

RulerOfTheWorld


1 Answers

Update using running sum and count

I figured out how to improve my previous code (see original below). You can keep running totals and counts, then compute the averages at the end, which avoids recording all the individual numbers.

from collections import defaultdict

class RunningAverage:
    def __init__(self):
        self.total = 0
        self.count = 0

    def add(self, value):
        self.total += value
        self.count += 1

    def calculate(self):
        return self.total / self.count

def func(lst):
    thirds = defaultdict(RunningAverage)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].add(sub[2])
    lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

Original answer

This probably won't be very efficient since it has to accumulate all the values to average them. I think you could get around that by having a running average with a weighting factored in, but I'm not quite sure how to do that.

from collections import defaultdict

def avg(nums):
    return sum(nums) / len(nums)

def func(lst):
    thirds = defaultdict(list)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].append(sub[2])
    lst_out = [[*k, avg(v)] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
like image 120
wjandrea Avatar answered Mar 06 '26 11:03

wjandrea



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!