I am using multivariate GAM models to learn more about fog trends in multiple regions. Fog is determined by visibility going below a certain threshold (< 400 meters). Our GAM model is used to determine the response of visibility to a range of meteorological variables.
However, my challenge right now is that I'd really like the y-axis to be the actual visibility observations rather than the centered smoothed. It is interesting to see how visibility is impacted by the covariates relative to the mean visibility in that location, but it's difficult to compare this for multiple locations where the mean visibility is different (and thus the 0 point in which visibility is enhanced or diminished has little comparable meaning).
In order to compare the results of multiple locations, I'm trying to make the y-axis actual visibility observations, and then I'll put a line at the visibility threshold we're interested in looking at (400 m) to evaluate what the predictor variables values are like below that threshold (eg what temperatures are associated with visibility below 400 m).
I'm still a beginner when it comes to GAMs and R in general, but I've figured out a few helpful pieces so far.
Helpful things so far:
Attempt 1. how to extract gam fit for each variable in model Extracting data used to make a smooth plot in mgcv
Attempt 2. how to use predict function to reconstruct a univariable model http://zevross.com/blog/2014/09/15/recreate-the-gam-partial-regression-smooth-plots-from-r-package-mgcv-with-a-little-style/
Attempt 3. how to get some semblance of a y-axis that looks like visibility observations using "fitted" -- though I don't think this is the correct approach since I'm not taking the intercept into account http://gsp.humboldt.edu/OLM/R/05_03_GAM.html
install.packages("mgcv") #for gam package
require(mgcv)
install.packages("pspline")
require(pspline)
#simulated GAM data for example
dataSet <- gamSim(eg=1,n=400,dist="normal",scale=2)
visibility <- dataSet[[1]]
temperature <- dataSet[[2]]
dewpoint <- dataSet[[3]]
windspeed <- dataSet[[4]]
#Univariable GAM model
gamobj <- gam(visibility ~ s(dewpoint))
plot(gamobj, scale=0, page=1, shade = TRUE, all.terms=TRUE, cex.axis=1.5, cex.lab=1.5, main="Univariable Model: Dew Point")
summary(gamobj)
AIC(gamobj)
abline(h=0)
Univariable Model of Dew Point https://i.sstatic.net/Q5jYj.jpg
#dummy var that spans length of original covariate
maxDP <-max(dewpoint)
minDP <-min(dewpoint)
DPtrial.seq <-seq(minDP,maxDP,length=3071)
DPtrial.seq <-data.frame(dewpoint=DPtrial.seq)
#predict only the DP term
preds <- predict(gamobj, type="terms", newdata=DPtrial.seq, se.fit=TRUE)
#determine confidence intervals
DPplot <-DPtrial.seq$dewpoint
fit <-preds$fit
fit.up95 <-fit-1.96*preds$se.fit
fit.low95 <-fit+1.96*preds$se.fit
#plot
plot(DPplot, fit, lwd=3,
main="Reconstructed Dew Point Covariate Plot")
#plot confident intervals
polygon(c(DPplot, rev(DPplot)),
c(fit.low95,rev(fit.up95)), col="grey",
border=NA)
lines(DPplot, fit, lwd=2)
rug(dewpoint)
Reconstructed Dew Point Covariate Plot https://i.sstatic.net/YtLqE.jpg
plot(dewpoint,fitted(gamobj), main="Fitted Response of Y (Visibility) Plotted Against Dew Point")
abline(h=mean(visibility))
rug(dewpoint)
Fitted Response of Y Plotted Against Dew Point https://i.sstatic.net/cQ3v5.jpg
Ultimately, I want a horizontal line where I can investigate the predictor variable relative to 400 meters, rather than just the mean of the response variable. This way, it will be comparable across multiple sites where the mean visibility is different. Most importantly, it needs to be for multiple covariates!
Gavin Simpson has explained the method in a couple of posts but unfortunately, I really don't understand how I would hold the mean of the other covariates constant as I use the predict function:
Changing the Y axis of default plot.gam graphs
Any deeper explanation into the method for doing this would be super helpful!!!
I'm not sure how helpful this will be as your Q is a little more open ended than we'd typically like on SO, but, here goes.
Firstly, I think it would help to think about modelling the response variable, which I assume is currently visibility. This is going to be a continuous variable, bounded at 0 (perhaps the data never reach zero?) which suggests modelling the data as conditionally distributed either
family = Gamma(link = 'log')
) for visibility that never takes a value of zero.family = tw()
) for data that do have zeroes.An alternative approach would be to model the occurrence of fog; if this is defined as an event <400m visibility then you could turn all your observations into 0/1 values for being a fog event or otherwise. Then you'd model the data as conditionally distributed Bernoulli, using family = binomial()
.
Having decided on a modelling approach, we need to model the response. This should be done using a multiple regression type of approach, with a GAM including multiple predictors. This way you get to estimate the effect of each potential predictor variable on the response while controlling for the effects of the other predictors. If you just do this using a single predictor at a time, say dewpoint
, that variable could well "explain" variation in the data that might be due to another predictor, windspeed
say, and you wouldn't know it.
Furthermore, there may well be interactions between predictors that you'll want to control for if they exist, which can only be done in
Then, to finally get to the crux of your problem, having fitted the multi-predictor model to "explain" visibility, you will need to predict from the model for sets of likely conditions. To look at how the visibility varies with dewpoint
in a model where other predictor variables have effects, you need to fix the other variables at some reasonable values; one option is to set them to their mean (or modal value in the case of any factor predictor variables), or some other value indicative of typically values for that variable. You'll have to use your domain knowledge for this.
If you have interactions in the model, then you'll need to vary the two variables in the interaction, whilst holding all other variable fixed at some values.
Let's assume you don't have interactions and are interested in dewpoint
but the model also includes windspeed
. The mean windspeed for the values used to fit the model can be found from the cmX
component of the fitted model. Of you could just calculate this from the observed windpseed
values or set it to some known number you want to use. Denote the fitted by m
, and the data frame with your data in it by df
, then we can create new data to predict at over the range of dewpoint
, whilst holding windspeed
fixed.
mn.windspd <- m$cmX['windspeed']
## or
mn.windspd <- with(df, mean(windspeed))
## or set it some some value
mn.windspd <- 10 # say
Then you can do
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd))
Then you use this to predict from the fitted model:
pred <- predict(m, newdata = preddata, type = "link", se.fit = TRUE)
pred <- as.data.frame(pred)
Now we want to put these predictions back on to the response scale, and we want a confidence interval so we have to create that first before back transforming:
ilink <- family(m)$linkinv
pred <- transform(pred,
Fitted = ilink(fit),
Upper = ilink(fit + (2 * se.fit)),
Lower = ilink(fit - (2 * se.fit)),
dewpoint = preddata = dewpoint)
Now you can visualised the effect of dewpoint
on the response whilst keeping windspeed
fixed.
In your case, you will have to extend this to keeping temperature
constant also, but that is done in the same way
mn.windspd <- m$cmX['windspeed']
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd,
temperature = mn.temp))
and then follow the steps above to do the prediction.
For one or two variables varying I have a function data_slice()
in my gratia package which will do the above expand.grid()
stuff for you so you don't have to specify the mean values of the other covariates:
preddata <- data_slice(m, 'dewpoint', n = 300)
technically this finds the value in the data closest to the median value (for the covariates not varying). If you want means, then do
fixdf <- data.frame(windspeed = mn.windspd, temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', data = fixdf, n = 300)
If you have an interaction, say between dewpoint
and windspeed
then you need to vary two variables. This is pretty easy again with expand.grid()
:
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 100),
windspeed = seq(min(windspeed),
max(windspeed),
length = 300),
temperature = mn.temp))
This will create a 100 x 100 grid of values of the covariates to predict at, whilst holding temperature constant.
For data_slice()
you'd need to do:
fixdf <- data.frame(temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', 'windpseed',
data = fixdf, n = 300)
And extending this on to more covariates you want to vary, is also easy following this pattern with expand.grid()
; I have yet to implement more than 2 variables varying in data_slice
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With