Optimizing a complex, custom data problem - find the optimal polygon that contains a give fraction of data, with constraints

Question

I am working with Uranium-Lead geochronological datasets. This type of data is typically interpreted in what is called a "Wetherill" concordia diagram, an x-y coordinate space where each axis contains Pb/U ratios that correspond to one age estimate. Given that both axes have independent age estimates, but the isotopic systems decay at different rates, the line of equal age (concordia) in this space has curvature to it.

Geochronologists interpret data arrays in this space where the relevant age information is contained in where arrays intersect the concordia line, the curved line of equal age.

Typical interpretations rely on single data arrays following a single line, which is readily regressed to find intersections with the concordia line, and uncertainty is robustly propagated on those regressions.

However, more complex datasets can be difficult to interpret.

The fundamental information in more complex scenarios can be represented by a polygon, where the corners of the polygon must fall on the concordia curve - these are the points that contain the age information. In essence, we can interpret the boundary lines in a dense data array as containing the geochronological information.

I am dealing with very large n datasets (>1 million data points), and I am working on a method to identify the polygon shape that contains the largest fraction of the data, but also tightly defines the outer limit of the data - in other words the optimal polygon would not be simply the largest polygon.

Below I generate some synthetic data that corresponds to a complex data array - the scenario modeled contains a large population of mineral grains that grew between 3 billion upper.int.age.max years ago and 2.5 billion years ago upper.int.age.min. These grains then experienced an alteration event 1 billion years ago lower.int.age.1, and another alteration event 100 million years ago lower.int.age.2.

## set seed
set.seed( 12 ) 

## now create ages for a model dataset events
lower.int.age.1 <- 1000
lower.int.age.2 <- 100
upper.int.age.min <- 2500
upper.int.age.max <- 3000

## a couple of required equations
ratio735 <- function(age) {   ## Age in years
  exp( 9.8485e-10 * age * 1e6 ) - 1 
}

ratio638 <- function(age) {   ## Age in years
  exp( 1.55125e-10 * age * 1e6 ) - 1 
}

## calculate x and y values from the age information
x.values <- c( ratio735(lower.int.age.1), 
ratio735(lower.int.age.2) ) # calculate 7/35 ratios
y.values <- c( ratio638(lower.int.age.1), 
ratio638(lower.int.age.2) ) # calculate 6/38 ratios


## now make a data frame of example data
data.test <- data.frame( upper.age = runif( 5000, min = 
upper.int.age.min, max = upper.int.age.max ),
                     prop.1 = runif( 5000, min = 0.05, max = 1 ),
                     prop.2 = runif( 5000, min = 0.05, max = 1 ) ) 
data.test$x1 <- ( ( ratio735( data.test$upper.age ) - x.values[1] 
) * ( data.test$prop.1 ) ) + x.values[1]
data.test$y1 <- ( ( ratio638( data.test$upper.age ) - y.values[1] 
) * ( data.test$prop.1 ) ) + y.values[1]

data.test$x2 <- ( ( data.test$x1 - x.values[2] ) * ( 
data.test$prop.2 ) ) + x.values[2]
data.test$y2 <- ( ( data.test$y1 - y.values[2] ) * ( 
data.test$prop.2 ) ) + y.values[2]

Now here is the equation for the line of equal age, the concordia line, and generate some points along the line for plotting purposes

## line equation 
concordia.x <- function(t) {
  x <- exp(9.8485e-10 * t * 1e6) - 1  
  x
}  
concordia.y <- function(t) {
  y <- exp(1.55125e-10 * t * 1e6) - 1  
  y
}


concordia.line <- data.frame( age = seq(0,3000,1) ) 
concordia.line$x <- ratio735( concordia.line$age )
concordia.line$y <- ratio638( concordia.line$age )
concordia.points <- subset( concordia.line, age == lower.int.age.1 
| age == lower.int.age.2 | age = upper.int.age.min | age == upper.int.age.max )

Now plot the synthetic data and the concordia line. The black points are the geological events.

## plot the data and line
ggplot( data.test, aes( x= x2, y = y2 ) ) +
  geom_line( data = concordia.line, aes( x = x, y = y ) ) +
  geom_point( color = "red" ) +
  geom_point( data = concordia.points, aes( x = x, y = y ),
              size = 4 )

Here is the resulting plot (the lower.int.age.1 etc. Ideally a model will return a polygon with the four bold points making up the corners of the polygon.
enter image description here

I'm currently doing this brute force, by generating a suite of hypothetical polygons with corners defined along the concordia line. I generate the polygons by calculating four sequences of possible corner values, then using expand.grid to find all possible combinations. Then I filter that grid for the correct sequence of points.

## create polygons
lower.1 = seq( from = 0, to = 100, by = 50 ) ## corner 1 
lower.2 = seq( from = 900, to = 1100, by = 100 ) ## corner 2 
upper.1 = seq( from = 2400, to = 2800, by = 100 ) ## corner 3
upper.2 = seq( from = 2800, to = 3200, by = 100 ) ## corner 4
# find all combinations
age.values = expand.grid( lower.1, lower.2, upper.1, upper.2 )
colnames( age.values ) <- c( "lower.1", "lower.2", "upper.1", 
"upper.2" )

##filter for polygons with a certain order to them
age.values.used <- subset( age.values, lower.1 < lower.2 & lower.2 < upper.1 & upper.1 < upper.2 ) 


## calculate a list of x,y values for each polygon
polygons <- list(NA)

## Fill the list with a for loop - not a great way, I know
for( i in 1:nrow( age.values.used ) ) {
  data.temp <- age.values.used[ i, ]
  polygons[[ i ]] <- data.frame( age = age.values.used[i,], 
                                 x = concordia.x( as.vector( t( data.temp ) ) ),
                                 y = concordia.y( as.vector( t( data.temp ) ) ) )
}

Once I have the polygons in a list, where each element contains the four corners of the polygon, I make a function to find the fraction of data in each polygon. This uses the sp::point.in.polygon function and just calculates the fraction of data in each polygon in the list of polygons.

library(sp, pracma)
## calculate the fraction of data in one polygon
in.poly   <- function( poly.frame, points ) { ##uses two 
dataframes, one with data, one with the x and y coords for the polygon
  ## use sp point.in.polygon function
  in.polygon <- sp::point.in.polygon( point.x = points[,1],
                        point.y = points[,2],
                        pol.x = poly.frame$x,
                        pol.y = poly.frame$y )

  sum( in.polygon == 1 ) / length( in.polygon )
}

Now I loop through using that equation to find the fraction in each polygon.

    ## create a blank data frame to fill in a for loop
results.data <- data.frame( upper.1 = rep( NA, times = length( 
    polygons ) ), upper.2 = rep( NA, times = length( 
    polygons ) ), lower.2 = rep( NA, times = length( 
    polygons ) ), lower.1 = rep( NA, times = length( 
    polygons ) ), fraction.data.in = rep( NA, times = 
    length( polygons ) ), area = rep( NA, times = length( polygons 
    ) ) )

Now use the for loop to add the corner ages, the fraction of data in the polygon, and calculate the area of the polygon (using the sp:Polygon function)

for( j in 1:length( polygons ) ) {
  results.data$upper.1[ j ] = polygons[[j]]$age.upper.2[1]
  results.data$upper.2[ j ] = polygons[[j]]$age.upper.2[1]
  results.data$lower.2[ j ] = polygons[[j]]$age.lower.2[1]
  results.data$lower.1[ j ] = polygons[[j]]$age.lower.1[1]
  results.data$fraction.data.in[j] <- in.poly( polygons[[j]], 
         data.test[, c("x2", "y2")] )
  rpoly.def <- sp::Polygon( polygons[[j]][,c("x","y")] )

  results.data$area[ j ] <- slot( poly.def, "area" )
  }

I can then use this data to figure out what polygons have the most data inside of them, while also having the least area to them.

This is very much a brute force method and I'm hoping to find something that is more elegant (not hard I'm sure) and runs significantly faster.

Thanks for any help you can provide, and please comment if I should explain something further.

Michael Dewar · Accepted Answer

I suggest a genetic algorithm using the GA package.

The goal is to find a set of times within the interval

age.min <- 0
age.max <- 3000

such that the (x,y) coordinates

compute_x_from_age <- function(t){
    exp(9.8485e-10 * t * 1e6) - 1
}
compute_y_from_age <- function(t){
    exp(1.55125e-10 * t * 1e6) - 1
}

enclose most of the

red_points <- data.test[, c("x2", "y2")]

while having minimal area. Call a vector of four numbers in the interval a candidate. Then

library(GA)
library(sp)
library(pracma)
library(ggplot2)

fraction_in_candidate <- function(candidate){
    candidate <- sort(candidate)

    in.polygon <- sp::point.in.polygon( point.x = red_points[,1],
                                        point.y = red_points[,2],
                                        pol.x = compute_x_from_age(candidate),
                                        pol.y = compute_y_from_age(candidate))

    sum( in.polygon == 1 ) / length( in.polygon )
}

area_of_candidate <- function(candidate){
    candidate <- sort(candidate)

    sp::Polygon(data.frame(compute_x_from_age(candidate),
                           compute_y_from_age(candidate))) |>
        slot("area")
}

fitness_of_candidate <- function(candidate){
    (fraction_in_candidate(candidate)^5)/sqrt(area_of_candidate(candidate))
}

You should use theoretical and empirical domain knowledge to adjust the fitness function to ensure it gives reasonable output below. My first few tries gave degenerate (extremely narrow) polygons.

Given a candidate, we can mutate it by randomly nudging a point left or right.

custom_mutation <- function(object, parent) {
    child <- object@population[parent, ]

    gene_to_mutate <- sample(1:4, 1)
    jitter_amount <- rnorm(1, mean = 0, sd = 100)
    child[gene_to_mutate] <- child[gene_to_mutate] + jitter_amount

    # clip to boundary
    child[gene_to_mutate] <- min(max(child[gene_to_mutate], age.min), age.max)

    child
}

Given two parent candidates, we can merge the eight values together and then separate them randomly into two child candidates.

custom_crossover <- function(object, parents) {
    parent1 <- object@population[parents[1], ]
    parent2 <- object@population[parents[2], ]

    combined_genes <- c(parent1, parent2)
    children <- sample(combined_genes,8)

    list(children = matrix(children, nrow = 2), fitness = c(NA, NA))
}

Then the genetic algorithm can be run with

ga_result <- ga(
    type = "real-valued",
    fitness = fitness_of_candidate,
    lower = rep(age.min, 4),
    upper = rep(age.max, 4),
    popSize = 500,
    maxiter = 50,
    mutation = custom_mutation,
    crossover = custom_crossover,
    elitism = 5,
    monitor = TRUE
)

# Best solution
solution <- sort(ga_result@solution[1,])
solution_df <- data.frame(x2 = compute_x_from_age(solution),
                          y2 = compute_y_from_age(solution))

ggplot(red_points, aes(x=x2, y = y2)) +
    geom_line( data = concordia.line, aes( x = x, y = y ) ) +
    geom_point(data = solution_df, color = "blue", size = 4) +
    geom_polygon(data = solution_df,
                 color = "blue",
                 alpha = 0.3
    ) +
    geom_point( color = "red" )

enter image description here

My uneducated guess is that the main place to focus your geochronological knowledge is in the fitness function. I believe the rest is simply about the speed of convergence.

You could try letting the mutation function occasionally add or remove a point. Then you might get non-quadrilaterals.

jay.sf · Answer

You could try fitting a segmented regression, to e.g., a cummax, and predict y values of the estimated break points.

> xord <- order(data.test$x2)
> yc_max <- cummax(data.test$y2[xord])
> x2_ord <- data.test$x2[xord]
> fit <- lm(yc_max ~ x2_ord)
> o <- segmented::segmented(fit, seg.Z = ~x2_ord, psi = c(2, 11))  ## c(2, 11) are guessed starting values for the optimizer
> x_brk <- unname(summary(o)$psi[, "Est."]) |> 
+   c(range(data.test$x2)) |> sort()
> yhat <- predict(o, newdata = data.frame(x2_ord = x_brk))
> (polygon_est <- cbind(x=x_brk, y=yhat))
           x          y
1  0.2181886 0.02226673
2  2.2488102 0.18087809
3 10.5529546 0.45421275
4 17.4122930 0.58586390
> 
> with(data.test, plot(x2, y2, pch = '.'))
> polygon(polygon_est, border = 'red')

enter image description here

Optimizing a complex, custom data problem - find the optimal polygon that contains a give fraction of data, with constraints

Tags:

optimization

dataframe

r

polygon

Jesse

2 Answers

Michael Dewar

jay.sf

Recent Activity

Donate For Us

Optimizing a complex, custom data problem - find the optimal polygon that contains a give fraction of data, with constraints

Tags:

optimization

dataframe

r

polygon

Jesse

2 Answers

Michael Dewar

jay.sf

Related questions

Recent Activity

Donate For Us