Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate sets of items in a sequence

I have a list, for example data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], and I need to remove sets of items from it (max. lenght of set k = 3), but only when the sets follow each other. data includes three such cases: [4, 4], [5, 8, 5, 8], and [1, 5, 6, 1, 5, 6], so the cleaned up list should look like [0, 4, 2, 5, 8, 7, 1, 5, 6].

I tried this code and it works:

data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
for k in range(1, 3):
    kth_difference = data[k:] - data[:-k]
    ids = np.where(kth_difference)
    data = data[ids]

But if I change input list to something like data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5] (broke the last set), the new output list is [0, 4, 2, 5, 8, 7, 1, 5], which lost 6 at the end.

What a solution for this task? How to make this solution workable for any k?

like image 480
Plaeryin_Bol Avatar asked Dec 01 '25 11:12

Plaeryin_Bol


1 Answers

You added a numpy tag, so let's use that to our advantage. Start with an array:

data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])

It's easy to make a mask of elements up to length n that follow each other:

mask_1 = data[1:] == data[:-1]
mask_2 = data[2:] == data[:-2]
mask_3 = data[3:] == data[:-3]

The first mask has ones at each location where the next element is the same. The second mask will have a one wherever an element is the same as something two elements ahead, so you need to find runs of 2 elements at a time. The same applies to the third mask. Filtering of the mask needs to take into account that you want to include the possibility of partial matches at the end. You can effectively extend the mask with k-1 elements to accomplish this:

delta = np.diff(np.r_[False, mask_3, np.ones(2, dtype=bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < 3, :]] = 0
mask = delta[-k:].cumsum(dtype=np.int8).view(bool)

In this arrangement, mask masks the duplicated three elements that constitute a duplicated group. It may contain fewer than three elements if the replicated portion is truncated. That ensures that you get to keep all the elements of partial duplicates.

For this exercise, I will assume that you don't have strange overlaps between different levels. I.e., each part of the array that belongs to a repeated segment belongs to exactly one possible repeated segment. Otherwise, the mask processing becomes much more complex.

Here is a function to wrap all this together:

def clean_mask(mask, k):
    delta = np.diff(np.r_[False, mask, np.ones(k - 1, bool), False].view(np.int8))
    edges = np.flatnonzero(delta).reshape(-1, 2)
    lengths = edges[:, 1] - edges[:, 0]
    delta[edges[lengths < k, :]] = 0
    return delta[:-k].cumsum(dtype=np.int8).view(bool)

def dedup(data, kmax):
    data = np.asarray(data)
    kmax = min(kmax, data.size // 2)

    remove = np.zeros(data.shape, dtype=np.bool)
    for k in range(kmax, 0, -1):
        remove[k:] |= clean_mask(data[k:] == data[:-k], k)
    return data[~remove]

Outputs for the two test cases you show in the question:

>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])

>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])

Timing

A quick benchmark shows that the numpy solution is also much faster than pure python:

for n in range(2, 7):
    x = np.random.randint(0, 10, 10**n)
    y = list(x)
    %timeit dedup(x, 3)
    %timeit remdup(y)

Results:

# 100 elements
 dedup: 332 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: 36.9 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# 1000 elements
 dedup: 412 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: > 1 minute

Caveats

This solution makes no attempt to cover corner cases. For example: data = [2, 2, 2, 2, 2, 2, 2] or similar, where multiple levels of k can overlap.

like image 168
Mad Physicist Avatar answered Dec 04 '25 00:12

Mad Physicist



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!