What algorithm can I use to recognize the line in this scatterplot?

Question

I'm creating a program to compare audio files which uses a similar algorithm to the one described here http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf. I am plotting the times of matches between two songs being compared and finding the line of least squares for the plot. href=http://imgur.com/fGu7jhX&yOeMSK0 is an example plot of matching files. The plot is too messy and the least squares regression line does not produce a high correlation coefficient even though there is an obvious line in the graph. What other algorithm can I use to recognize this line?

Warren Weckesser · Accepted Answer

This is an interesting question, but it's been pretty quiet. Maybe this answer will trigger some more activity.

For identifying lines with arbitrary slopes and intercepts within a collection of points, the Hough transform would be a good place to start. For your audio application, however, it looks like the slope should always be 1, so you don't need the full generality of the Hough transform.

Instead, you can think of the problem as one of clustering the differences x - y, where x and y are the vectors holding the x and y coordinates of the points.

One approach would be to compute a histogram of x - y. Points that are close to lying in the same line with slope 1 will have differences in the same bin in the histogram. The bin with the largest count corresponds to the largest collection of points that are approximately aligned. An issue to deal with in this approach is choosing the boundaries of the histogram bins. A bad choice could result in points that should be grouped together being split into neighboring bins.

A simple brute-force approach is to imagine a diagonal window with a given width, sliding left to right across the (x,y) plane. The best candidate for a line corresponds to the position of the window that contains the most points. This is similar to a histogram of x - y, but instead of having a collection of disjoint bins, there are overlapping bins, one for each point. All the bins have the same width, and each point determines the left edge of a bin.

The function count_diag_groups in the code below does that computation. For each point, it counts how many points are in the diagonal window when the left edge of the window is on that point. The best candidate for a line is the window with the most points. Here's the plot generated by the script. The top is the scatter plot of the data. The bottow is the same scatter plot, with the best candidate points highlighted.

Plot generated by the script

A nice feature of this method is that there is only one parameter, the window width. A not-so-nice feature is that it has time complexity O(n**2), where n is the number of points. There are surely algorithms with better time complexity that could do something similar; the article that you link to discusses this. To judge the quality of an alternative, however, will require more concrete specifications of how "good" or robust the line identification must be.

import numpy as np
import matplotlib.pyplot as plt


def count_diag_groups(x, y, width):
    """
    Returns a list of arrays.  The length of the list is the same
    as the length of x.  The k-th array holds the indices into x
    (and y) of a set of points that are in a "diagonal" window with
    the given width whose left edge includes the point (x[k], y[k]).
    """
    d = x - y
    result = []
    for i in range(d.size):
        delta = d - d[i]
        neighbors = np.where((delta >= 0) & (delta <= width))[0]
        result.append(neighbors)
    return result


def generate_demo_data():
    # Generate some data.
    np.random.seed(123)
    xmin = 0
    xmax = 100
    ymin = 0
    ymax = 25
    nrnd = 175
    xrnd = xmin + (xmax - xmin)*np.random.rand(nrnd)
    yrnd = ymin + (ymax - ymin)*np.random.rand(nrnd)
    n = 25
    xx = xmin + 0.1*(xmax - xmin) + ymax*np.random.rand(n)
    yy = (xx - xx.min()) + 0.2*np.random.randn(n)
    x = np.concatenate((xrnd, xx))
    y = np.concatenate((yrnd, yy))
    return x, y


def plot_result(x, y, width, selection):
    xmin = x.min()
    xmax = x.max()
    ymin = y.min()
    ymax = y.max()

    xsel = x[selection]
    ysel = y[selection]
    # Plot...
    plt.figure(1)
    plt.clf()
    ax = plt.subplot(2,1,1)
    plt.plot(x, y, 'o', mfc='b', mec='b', alpha=0.5)
    plt.xlim(xmin - 1, xmax + 1)
    plt.ylim(ymin - 1, ymax + 1)

    plt.subplot(2,1,2, sharex=ax, sharey=ax)
    plt.plot(x, y, 'o', mfc='b', mec='b', alpha=0.5)
    plt.plot(xsel, ysel, 'o', mfc='w', mec='w')
    plt.plot(xsel, ysel, 'o', mfc='r', mec='r', alpha=0.65)
    xi = np.array([xmin, xmax])
    d = x - y
    yi1 = xi - d[imax]
    yi2 = yi1 - width
    plt.plot(xi, yi1, 'r-', alpha=0.25)
    plt.plot(xi, yi2, 'r-', alpha=0.25)
    plt.xlim(xmin - 1, xmax + 1)
    plt.ylim(ymin - 1, ymax + 1)

    plt.show()

if __name__ == "__main__":
    x, y = generate_demo_data()

    # Find a selection of points that are close to being aligned
    # with a slope of 1.
    width = 0.75
    r = count_diag_groups(x, y, width)

    # Find the largest group.
    sz = np.array(list(len(f) for f in r))
    imax = sz.argmax()
    # k holds the indices of the selected points.
    selection = r[imax]

    plot_result(x, y, width, selection)

Hannes Ovrén · Answer

This looks like an excellent example of a task for Random Sampling Consensus (RANSAC). The Wikipedia article even uses your problem as an example!

The rough outline is something like this.

Select 2 random points in your data, fit a line to them
For each other point, find the distance to that line. If the distance is below a threshold, it is part of the inlier set.
If the final inlier set for this particular line is larger than the previously best line, then keep the new line as the best candidate.
If the decided number of iterations is reached, return the best line found, else go back to 1 and choose new random points.

Check the Wikipedia article for more information.

What algorithm can I use to recognize the line in this scatterplot?

Tags:

scipy

linear-regression

user1985339

2 Answers

Warren Weckesser

Hannes Ovrén

Recent Activity

Donate For Us

What algorithm can I use to recognize the line in this scatterplot?

Tags:

scipy

linear-regression

user1985339

2 Answers

Warren Weckesser

Hannes Ovrén

Related questions

Recent Activity

Donate For Us