Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need to fix Stan code for the generalized pareto distribution to take in real arguments rather than vectors

Tags:

r

statistics

stan

I am using the functions defined here: Extreme value analysis and user defined probability functions in Stan for modeling the data with a generalized pareto distribution, but my problem is that my model is in a for-loop and expects three real valued arguments, whereas, the gpd functions assume a vector, real, real argument.

I’m not so sure that my model chunk is so amenable to being vectorized, and so I was thinking I would need to have the gpd functions take in real valued arguments (but maybe I’m wrong).

I’d appreciate any help with switching the code around to achieve this. Here is my stan code

functions {
  real gpareto_lpdf(vector y, real k, real sigma) {
    // generalised Pareto log pdf 
    int N = rows(y);
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
    else
      return -sum(y)/sigma -N*log(sigma); // limit k->0
  }

  real gpareto_lcdf(vector y, real k, real sigma) {
    // generalised Pareto log cdf
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
    else
      return sum(log1m_exp(-(y)/sigma)); // limit k->0
  }
}

data {
  // the input data
  int<lower = 1> n;                       // number of observations
  real<lower = 0> value[n];              // value measurements
  int<lower = 0, upper = 1> censored[n];  // vector of 0s and 1s

  // parameters for the prior
  real<lower = 0> a;
  real<lower = 0> b;
}

parameters {
  real k;
  real sigma;
}

model {
 // prior
 k ~ gamma(a, b);
 sigma ~ gamma(a,b);

 // likelihood
  for (i in 1:n) {
    if (censored[i]) {
      target += gpareto_lcdf(value[i] | k, sigma);  
    } else {
      target += gpareto_lpdf(value[i] | k, sigma);
    }
  }
}
like image 972
John Smith Avatar asked Oct 26 '25 23:10

John Smith


1 Answers

Here is how the log PDF could be adapted. This way, index arrays for subsetting y into censored and non-censored observations can be passed.

real cens_gpareto_lpdf(vector y, int[] cens, int[] no_cens, real k, real sigma) {
    
  // generalised Pareto log pdf 
  int N = size(cens);
  real inv_k = inv(k);
  if (k<0 && max(y)/sigma > -inv_k)
  reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)

    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y[no_cens]) * (k/sigma))) -N*log(sigma) + 
      sum(log1m_exp((-inv_k)*(log1p((y[cens]) * (k/sigma)))));
    else
      return -sum(y[no_cens])/sigma -N*log(sigma) + 
      sum(log1m_exp(-(y[cens])/sigma));
}

Extend the data block: n_cens, n_not_cens, cens, and no_cens are values that need to supplied.

  int<lower = 1> n;               // total number of obs
  int<lower = 1> n_cens;          // number of censored obs
  int<lower = 1> n_not_cens;      // number of regular obs
  
  int cens[n_cens];               // index set censored
  int no_cens[n_not_cens];        // index set regular
  
  vector<lower = 0>[n] value;     // value measurements

Nonzero Parameters as suggested by gfgm:

parameters {
  real<lower=0> k;
  real<lower=0> sigma;
}

Rewrite the model block:

model {
  // prior
  k ~ gamma(a, b);
  sigma ~ gamma(a,b);
  // likelihood
  value ~ cens_gpareto(cens, no_cens, k, sigma);
}

Disclaimer: I neither checked the formulas for sanity nor ran the model using test data. Just compiled via rstan::stan_model() which worked fine. gfgm's suggestion may be more convenient for post-processing / computing stuff in generated quantities etc. I'm not a Stan expert :-).

Edit:

Fixed divergence issue found by gfgm through simulation. The likelihood was ill-defined (N= rows(y) instead of N=size(cens). Runs fine now with gfgm's data (using set.seed(123) and rstan):

         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
k        0.16    0.00 0.10    0.02    0.08    0.14    0.21    0.42  1687    1
sigma    0.90    0.00 0.12    0.67    0.82    0.90    0.99    1.16  1638    1
lp__  -106.15    0.03 1.08 -109.09 -106.56 -105.83 -105.38 -105.09  1343    1
like image 193
Martin C. Arnold Avatar answered Oct 29 '25 15:10

Martin C. Arnold