Python: Removing list duplicates based on first 2 inner list values

Question

Question:

I have a list in the following format:

x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]

The algorithm:

Combine all inner lists with the same starting 2 values, the third value doesn't have to be the same to combine them
- e.g. "hello",0,5 is combined with "hello",0,8
- But not combined with "hello",1,1
The 3rd value becomes the average of the third values: sum(all 3rd vals) / len(all 3rd vals)
- Note: by all 3rd vals I am referring to the 3rd value of each inner list of duplicates
- e.g. "hello",0,5 and "hello",0,8 becomes hello,0,6.5

Desired output: (Order of list doesn't matter)

x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]

Question:

How can I implement this algorithm in Python?

Ideally it would be efficient as this will be used on very large lists.

If anything is unclear let me know and I will explain.

Edit: I have tried to change the list to a set to remove duplicates, however this doesn't account for the third variable in the inner lists and therefore doesn't work.

Solution Performance:

Thanks to everyone who has provided a solution to this problem! Here are the results based on a speed test of all the functions:

Performance Data

wjandrea · Accepted Answer

Update using running sum and count

I figured out how to improve my previous code (see original below). You can keep running totals and counts, then compute the averages at the end, which avoids recording all the individual numbers.

from collections import defaultdict

class RunningAverage:
    def __init__(self):
        self.total = 0
        self.count = 0

    def add(self, value):
        self.total += value
        self.count += 1

    def calculate(self):
        return self.total / self.count

def func(lst):
    thirds = defaultdict(RunningAverage)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].add(sub[2])
    lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

Original answer

This probably won't be very efficient since it has to accumulate all the values to average them. I think you could get around that by having a running average with a weighting factored in, but I'm not quite sure how to do that.

from collections import defaultdict

def avg(nums):
    return sum(nums) / len(nums)

def func(lst):
    thirds = defaultdict(list)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].append(sub[2])
    lst_out = [[*k, avg(v)] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

Python: Removing list duplicates based on first 2 inner list values

Tags:

python

python-3.x

processing-efficiency

Question:

Solution Performance:

RulerOfTheWorld

1 Answers

Update using running sum and count

Original answer

wjandrea

Recent Activity

Donate For Us

Python: Removing list duplicates based on first 2 inner list values

Tags:

python

python-3.x

processing-efficiency

Question:

Solution Performance:

RulerOfTheWorld

1 Answers

Update using running sum and count

Original answer

wjandrea

Related questions

Recent Activity

Donate For Us