Measure structure of xy points - python

Question

I'm trying to measure the overall structure of xy points to represent reoccurring particle formations. I'm hoping to take a pairwise approach to determine the structure by the positioning relative to neighbouring points, rather than take an average of the raw Cartesian coordinates.

To achieve this, I want to calculate the vector between each point and the neighbouring points at every timestamp. The average of these vectors between each pair of points should then provide the overall structure.

Note: the structure won't be identified correctly if the vectors are hard-coded between specific points. If the points swap positions or different points get replaced but retain the same structure the end result won't be accurate. I'm hoping the function will be able to determine the overall structure based solely off the neighbouring points.

So the culminating structure should take a pairwise approach where the final spatial distribution, 1) sets the centroid of the structure as the position of the point in the densest part of the structure, as determined by the average distance to the third-nearest neighbour. 2) Identify the relative position of their nearest neighbour, the relative position of that points nearest neighbour and so on, until the positions of all points have been determined.

I'll generate two examples df's below. Using df1, frame 1 displays the vectors between points at the first timestamp. Frame 2 does the same with new positioning for some points and swapping of position for others (points A and B swap positioning between frames). The final frame displays every vector for all frames, while the points display the average structure.

import pandas as pd
from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
import numpy as np

# Example 1:
df = pd.DataFrame({   
    'Time' : [1,1,1,1,1,2,2,2,2,2],             
    'id' : ['A','B','C','D','E','B','A','C','D','E'],                 
    'X' : [1.0,2.8,4.0,2.0,2.0,1.5,3.0,5.0,3.0,2.5],
    'Y' : [1.0,1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,2.0],
    })

def calculate_distances(group):
    group_distances = pd.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )

    return group_distances

# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Create a placeholder to store the relative positions at every timestamp
relative_positions = {timestamp: [] for timestamp in df["Time"].values}

# Go over the timeframes
for timestamp, group in df.groupby("Time"):

    # ---
    # "... first, we set the centroid of the structure to be the position of the point in the densest part of the structure ..."

    # Determine the density of the group, within this timeframe
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the structure
    centroid = group.iloc[np.argmax(log_density)]

    # Make a list of the other points to keep track of which points we've handled
    other_points = group["id"].to_list()

    # Start by making the centroid the active point
    active_point_id = centroid["id"]

    # ---
    # "... the relative position of that point’s nearest neighbor (ignoring any point already considered
    # in the process) and so on, until the positions of all points in the team have been determined."

    # Keep handling the next point until there are no points left
    while len(other_points) > 1:

        # Remove the active point from the list
        other_points = [point for point in other_points if point != active_point_id]

        # For the active point, get the nearest neighbor
        nearest_neighbor = df_distances.loc[[timestamp]][active_point_id].droplevel(0).loc[other_points].sort_values().reset_index().iloc[0]["id"]

        # ---
        # "... We then identify the relative position of his nearest neighbor ..."

        # Determine the relative position of the nearest neigbor (compared to the active point)
        active_point_coordinates = group.loc[group["id"] == active_point_id, ["X", "Y"]].iloc[0].values
        nearest_neighbor_coordinates = group.loc[group["id"] == nearest_neighbor, ["X", "Y"]].iloc[0].values
        relative_position = active_point_coordinates - nearest_neighbor_coordinates

        # Add the relative position to the list, for this timestamp
        relative_positions[timestamp].append(relative_position)

        # The neighbor becomes the active point
        active_point_id = nearest_neighbor

# ---
# "... averaging the vectors between each pair of points over a specified time interval to gain a
# clear measure of their designated relative positions ..."

# Take the average vector, across timeframes
averages = np.mean([t for t in relative_positions.values()], axis=0)

# Plot the relative positions, NOTE: The centroid is always at (0, 0), and is not plotted

plt.scatter(averages[:,0], averages[:,1])

If I manually plot centroid at 0,0 the output is:

enter image description here

point structure frame 1:

enter image description here

point structure frame 2:

enter image description here

The total vectors for both frames are highlighted. So the average point structure of these should be:

enter image description here

If I generate a identical point structure but shift the points to the right for the subsequent frame, the underlying point structure should be the same.

df2 = pd.DataFrame({   
    'Time' : [1,1,1,1,1,2,2,2,2,2],             
    'id' : ['A','B','C','D','E','B','A','C','D','E'],                 
    'X' : [1.0,3.0,4.0,2.0,2.0,3.0,5.0,6.0,4.0,4.0],
    'Y' : [1.0,1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,2.0],
    })

Intended structure:

enter image description here

Gijs Wobben · Accepted Answer

I've tried to follow the paper you quoted to a T, but the description of their algorithm is quite vague. This is my solution:

import numpy
import pandas
import random

from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform

# From the paper:
# ---------------
# Formations are measured by calculating the vectors between each player and the rest of his
# teammates at successive instants during a match, averaging the vectors between each pair of
# players over a specified time interval to gain a clear measure of their designated relative positions.
# The final spatial distribution of the outfield players is determined by the following algorithm:
# first, we set the centroid of the formation to be the position of the player in the densest part of the
# team, as determined by the average distance to the third-nearest neighbor. We then identify the
# relative position of his nearest neighbor, the relative position of that player’s nearest neighbor
# (ignoring any player already considered in the process) and so on, until the positions of all players
# in the team have been determined.


# Your data, I've added some randomness to get a more realistic setting
df = pandas.DataFrame(
    {
        "Time": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "id": ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
        "Y": [element + random.random() * 0.25 for element in [1.0, 1.0, 0.0, 1.25, 2.0, 1.0, 1.0, 0.0, 1.25, 2.0]],
        "X": [element + random.random() * 0.25 for element in [1.0, 3.0, 2.0, 2.25, 2.0, 3.0, 5.0, 4.0, 4.25, 4.0]],
    }
)

# Plot the different timeframes (for reference)
for timestamp in df["Time"].unique():
    df.loc[df["Time"] == timestamp].plot(kind="scatter", x="X", y="Y")


def calculate_distances(group: pandas.DataFrame) -> pandas.DataFrame:
    """ Calculate the distances between the players, within a specific timeframe.

    Args:
        group (pandas.DataFrame): The data from a specif timeframe

    Returns:
        pandas.DataFrame: The distances
    """
    group_distances = pandas.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )
    return group_distances


# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Create a placeholder to store the relative positions at every timestamp
relative_positions = {timestamp: [] for timestamp in df["Time"].values}

# Go over the timeframes
for timestamp, group in df.groupby("Time"):

    # ---
    # "... first, we set the centroid of the formation to be the position of the player in the densest part of the team ..."

    # Determine the density of the group, within this timeframe
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the formation
    centroid = group.iloc[numpy.argmax(log_density)]

    # Make a list of the other players to keep track of which players we've handled
    other_players = group["id"].to_list()

    # Start by making the centroid the active player
    active_player_id = centroid["id"]

    # ---
    # "... the relative position of that player’s nearest neighbor (ignoring any player already considered
    # in the process) and so on, until the positions of all players in the team have been determined."

    # Keep handling the next player until there are no players left
    while len(other_players) > 1:

        # Remove the active player from the list
        other_players = [player for player in other_players if player != active_player_id]

        # For the active player, get the nearest neighbor
        nearest_neighbor = df_distances.loc[[timestamp]][active_player_id].droplevel(0).loc[other_players].sort_values().reset_index().iloc[0]["id"]

        # ---
        # "... We then identify the relative position of his nearest neighbor ..."

        # Determine the relative position of the nearest neigbor (compared to the active player)
        active_player_coordinates = group.loc[group["id"] == active_player_id, ["X", "Y"]].iloc[0].values
        nearest_neighbor_coordinates = group.loc[group["id"] == nearest_neighbor, ["X", "Y"]].iloc[0].values
        relative_position = active_player_coordinates - nearest_neighbor_coordinates

        # Add the relative position to the list, for this timestamp
        relative_positions[timestamp].append(relative_position)

        # The neighbor becomes the active player
        active_player_id = nearest_neighbor


# ---
# "... averaging the vectors between each pair of players over a specified time interval to gain a
# clear measure of their designated relative positions ..."

# Take the average vector, across timeframes
averages = numpy.mean([t for t in relative_positions.values()], axis=0)

# Plot the relative positions, NOTE: The centroid is always at (0, 0), and is not plotted
pandas.DataFrame(averages, columns=["X", "Y"]).plot(kind="scatter", x="X", y="Y")

Previous answer:

The first part (that fixes your code sample) is not too difficult. scipy has a function called pdist which calculates the distance between a set of points in multiple dimensions (2 in this case). If you want this by timeframe, you just have to use a groupby.

The second part is more harder because it's not totally clear what you hope to achieve. Finding the most "dense" spot in the formation can be done without the previous distance calculations. sklearn has a KernelDensity class for this. Beyond that I can't really follow what you want because in your formation there isn't a nearest neighbor (the distance from the centroid to all other points is equal, so all neighbors are equally close). However, I think you can use the matrix of mean distances (df_distances_mean) for this purpose, since it does contain all distances. You'll just have to select the next one with the lowest distance from the centroid.

Here is the code that calculates the distances and finds the centroid with the KernelDensity class:

import numpy
import pandas
from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform

# Your data
df = pandas.DataFrame(
    {
        "Time": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "id": ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
        "X": [1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 5.0, 4.0, 4.0, 4.0],
        "Y": [1.0, 1.0, 0.0, 1.0, 2.0, 1.0, 1.0, 0.0, 1.0, 2.0],
    }
)


def calculate_distances(group: pandas.DataFrame) -> pandas.DataFrame:
    group_distances = pandas.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )
    return group_distances


# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Take the mean distance across timeframes (since your points are just shifted right over time, this should be constant)
df_distances_mean = pandas.DataFrame(
    numpy.mean([group.to_numpy() for _, group in df_distances.groupby("Time")], axis=0),
    columns=df_distances.columns,
    index=df_distances.columns,
)
# df_distances_mean will now contain the mean distance between points

# =====================
# Not a 100% sure what you want to achieve from this point forward

# Determining the density of the formation (at each time step)
for _, group in df.groupby("Time"):

    # Determine the density
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the formation (?)
    centroid = group.iloc[numpy.argmax(log_density)]
    print("Centroid based on density:", centroid)

Output:

Centroid based on density: Time    1
id      D
X       2
Y       1
Name: 3, dtype: object
Centroid based on density: Time    2
id      D
X       4
Y       1
Name: 8, dtype: object

print(df_distances_mean)

id         A         B         C    D         E
id                                             
A   0.000000  2.000000  1.414214  1.0  1.414214
B   2.000000  0.000000  1.414214  1.0  1.414214
C   1.414214  1.414214  0.000000  1.0  2.000000
D   1.000000  1.000000  1.000000  0.0  1.000000
E   1.414214  1.414214  2.000000  1.0  0.000000

Measure structure of xy points - python

Tags:

python

pandas

scipy

jonboy

1 Answers

Gijs Wobben

Recent Activity

Donate For Us

Measure structure of xy points - python

Tags:

python

pandas

scipy

jonboy

1 Answers

Gijs Wobben

Related questions

Recent Activity

Donate For Us