Generalized additive model for location, scale and shape
The generalized additive model for location, scale and shape (GAMLSS) is a distributional regression model in which a parametric statistical distribution is assumed for the response (target) variable but the parameters of this distribution can vary according to explanatory variables. Therefore the shape of this distribution for the target variable can change with explanatory variables.
GAMLSS is an input output model, i.e. but differs from the classical model in that the input X effects the distribution of the target variable as a whole not just the mean, i.e. .
GAMLSS allows flexible regression by using smoothing or machine learning techniques to model the parameters of the target variable (response). GAMLSS assumes the response variable could follows any theoretical parametric distribution, which might be heavy or light-tailed, and positively or negatively skewed. In addition, all the parameters of the distribution which often are location (e.g., mean), scale (e.g., variance) and shape (skewness and kurtosis) – can be modelled as linear, nonlinear or using algorithm modelling functions of the explanatory variables. The distributional assumption for the target variables can be checked through diagnostic plots like Q–Q plot or worm plot. GAMLSS is a supervised machine learning model since the target value (the output) is always present.
Overview of the model
[edit]The generalized additive model for location, scale and shape (GAMLSS) is a statistical model introduced by Rigby and Stasinopoulos (2005)[1] to overcome some of the limitations associated with the popular generalized linear models (GLMs) of Nelder and Wedderburn (1972)[2] and generalized additive models (GAMs) of Hastie and Tibshirani.[3] Most of the limitations arise from the limited choice of distributions for the response. Both GLM and GAM assumed that the response comes from the exponential family a family rich enough to allow continuous and discrete responses (and very good for modelling the mean of the distribution as a function of the explanatory variables) but not very flexible enough to model other characteristics of the distribution i.e tails.
In GAMLSS the exponential family distribution assumption for the response variable, (), (essential in GLMs and GAMs), is relaxed and replaced by a general distribution family, including highly skew and/or kurtotic continuous and discrete distributions.
The systematic part of the model is expanded to allow modelling not only of the mean (or location) but possibly other parameters of the distribution of response as linear and/or nonlinear, parametric and/or additive non-parametric functions of explanatory variables and/or random effects.
GAMLSS, for continuous responses, is especially suited for modelling a leptokurtic or platykurtic and/or positively or negatively skewed variables therefor modelling the tails of the distribution. For count type response variable data it deals with over-dispersion and zero-inflation by using proper over-dispersed and zero inflated distributions. Heterogeneity also is dealt with by modeling the scale or shape parameters using explanatory variables. GAMLSS allow also mixed distributions, that is, distributions which have discrete and continuous parts, i.e the zero inflated beta is one or them. Note that while the beta distribution allow value in the beta inflated allows values in .
There are several packages written in R related to GAMLSS models,[4] and tutorials for using and interpreting GAMLSS.[5]
A GAMLSS model assumes independent observations for with probability (density) function conditional on . The parameter often is a vector of four distribution parameters, each of which can be a function of the explanatory variables, for example, The first two distribution parameters and are usually characterised as location and scale parameters, while the remaining parameter(s), if any, are characterised as shape parameters, e.g. skewness and kurtosis parameters. The model may be applied more generally to the parameters of any population distribution. The most general formulation of a GAMLSS model is
where is any machine learning (mathematical or algorithmic) model and is the number of parameters in the distribution for the resposnse. The original formulation of GAMLSS in the 2005 RSS paper had only four parameters and it was written as;
where , and are vectors of length , is a parameter vector of length , is a fixed known design matrix of order and is a smooth non-parametric function of explanatory variable , for and . for are link functions to ensure that the parameter are in the correct range of values.
Applications of the model
[edit]For centile estimation, the WHO Multicentre Growth Reference Study Group have recommended GAMLSS and the Box–Cox power exponential (BCPE) distributions[6] for the construction of the WHO Child Growth Standards.[7][8] Recent studies have used GAMLSS to predict the probability of cyanobacterial toxins exceeding critical health thresholds in lakes,[9] as well as in applications related to remote sensing,[10] biogeochemical modeling[11] and medicine.[12]
What distributions can be used
[edit]The form of the distribution assumed for the response variable y, is very general. For example, an implementation of GAMLSS in R[13] has around 100 different distributions available. Such implementations also allow use of truncated distributions and censored (or interval) response variables.[13]
References
[edit]- ^ Rigby, R.A.; Stasinopoulos, D.M. (2005). "Generalized Adiitive models for Location, Scale and Shapes". Royal Statistical Society Series C: Applied Statistics. 54 (3): 507–554. doi:10.1111/j.1467-9876.2005.00510.x.
- ^ Nelder, J.A.; Wedderburn, R.W.M (1972). "Generalized linear models". J. R. Stat. Soc. A. 135 (3): 370–384. doi:10.2307/2344614. JSTOR 2344614.
- ^ Hastie, TJ; Tibshirani, RJ (1990). Generalized additive models. London: Chapman and Hall.
- ^ Stasinopoulos, D. Mikis; Rigby, Robert A (December 2007). "Generalized additive models for location scale and shape (GAMLSS) in R". Journal of Statistical Software. 23 (7). doi:10.18637/jss.v023.i07.
- ^ David, Bann; Liam, Wright; Tim J, Cole (2022). "Risk factors relate to the variability of health outcomes as well as the mean: A GAMLSS tutorial". eLife. 11 (11). doi:10.7554/eLife.72357. PMC 8791632. PMID 34985412.
- ^ Rigby, Robert; Stasinopoulos, D. Mikis (February 2004). "Smooth Centile Curves for Skew and Kurtotic data Modelled Using the Box–Cox Power Exponential Distribution". Statistics in Medicine. 23 (19): 3053–3076. doi:10.1002/sim.1861. PMID 15351960.
- ^ Borghi, E.; De Onis, M.; Garza, C.; Van Den Broeck, J.; Frongillo, E. A.; Grummer-Strawn, L.; Van Buuren, S.; Pan, H.; Molinari, L.; Martorell, R.; Onyango, A. W.; Martines, J. C.; WHO Multicentre Growth Reference Study Group (2006). "Construction of the World Health Organization child growth standards: Selection of methods for attained growth curves". Statistics in Medicine. 25 (2): 247–265. doi:10.1002/sim.2227. PMID 16143968.
- ^ WHO Multicentre Growth Reference Study Group (2006) WHO Child Growth Standards: Length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age: Methods and development. Geneva: World Health Organization.
- ^ Merder, Julian; Harris, Ted; Zhao, Gang; Stasinopoulos, Dimitrios M.; Rigby, Robert A.; Michalak, Anna M. (October 2023). "Geographic redistribution of microcystin hotspots in response to climate warming". Nature Water. 1 (10): 844–854. doi:10.1038/s44221-023-00138-w. ISSN 2731-6084.
- ^ Merder, Julian; Zhao, Gang; Pahlevan, Nima; Rigby, Robert A.; Stasinopoulos, Dimitrios M.; Michalak, Anna M. (1 April 2024). "A novel algorithm for ocean chlorophyll-a concentration using MODIS Aqua data". ISPRS Journal of Photogrammetry and Remote Sensing. 210: 198–211. doi:10.1016/j.isprsjprs.2024.03.014. ISSN 0924-2716.
- ^ Kida, Morimaru; Merder, Julian; Dittmar, Thorsten; Pawlowsky-Glahn, Vera; Egozcue, Juan Jose (13 October 2025). "Reframing natural organic matter research through compositional data analysis". doi:10.31223/x51x7p.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Judah, Hannah R.; Rigby, Robert A.; Stasinopoulos, Mikis D.; Pateras, Konstantinos; Rahim, Mussarat N.; Heneghan, Michael A.; Nicolaides, Kypros H.; Kametas, Nikos A. (1 July 2025). "Reference ranges for liver function tests in pregnancy controlling for maternal characteristics". American Journal of Obstetrics & Gynecology. 0 (0). doi:10.1016/j.ajog.2025.06.056. ISSN 0002-9378. PMID 40609852.
- ^ a b "The R packages | gamlss". The R packages | gamlss. Retrieved 4 May 2020.
Further reading
[edit]- Beyerlein, A.; Fahrmeir, L.; Mansmann, U.; Toschke, A. M. (2001). "Alternative regression models to assess increase in childhood BM". BMC Medical Research Methodology. 8: 59. doi:10.1186/1471-2288-8-59. PMC 2543035. PMID 18778466.
- Cole, T. J., Stanojevic, S., Stocks, J., Coates, A. L., Hankinson, J. L., Wade, A. M. (2009), "Age- and size-related reference ranges: A case study of spirometry through childhood and adulthood", Statistics in Medicine, 28(5), 880–898.Link
- Fenske, N., Fahrmeir, L., Rzehak, P., Hohle, M. (25 September 2008), "Detection of risk factors for obesity in early childhood with quantile regression methods for longitudinal data", Department of Statistics: Technical Reports, No.38 Link
- Hudson, I. L., Kim, S. W., Keatley, M. R. (2010), "Climatic Influences on the Flowering Phenology of Four Eucalypts: A GAMLSS Approach Phenological Research". In Phenological Research, Irene L. Hudson and Marie R. Keatley (eds), Springer Netherlands Link
- Hudson, I. L., Rea, A., Dalrymple, M. L., Eilers, P. H. C. (2008), "Climate impacts on sudden infant death syndrome: a GAMLSS approach", Proceedings of the 23rd international workshop on statistical modelling pp. 277–280. Link[permanent dead link]
- Nott, D (2006). "Semiparametric estimation of mean and variance functions for non-Gaussian data". Computational Statistics. 21 (3–4): 603–620. CiteSeerX 10.1.1.117.6518. doi:10.1007/s00180-006-0017-9. S2CID 16900583.
- Serinaldi, F (2011). "Distributional modeling and short-term forecasting of electricity prices by Generalized Additive Models for Location, Scale and Shape". Energy Economics. 33 (6): 1216–1226. doi:10.1016/j.eneco.2011.05.001.
- Serinaldi, F.; Cuomo, G. (2011). "Characterizing impulsive wave-in-deck loads on coastal bridges by probabilistic models of impact maxima and rise times". Coastal Engineering. 58 (9): 908–926. doi:10.1016/j.coastaleng.2011.05.010.
- Serinaldi, F., Villarini, G., Smith, J. A., Krajewski, W. F. (2008), "Change-Point and Trend Analysis on Annual Maximum Discharge in Continental United States", American Geophysical Union Fall Meeting 2008, abstract #H21A-0803*
- van Ogtrop, F. F.; Vervoort, R. W.; Heller, G. Z.; Stasinopoulos, D. M.; Rigby, R. A. (2011). "Long-range forecasting of intermittent streamflow". Hydrology and Earth System Sciences Discussions. 8 (1): 681–713. doi:10.5194/hessd-8-681-2011.
- Villarini, G.; Serinaldi, F. (2011). "Development of statistical models for at-site probabilistic seasonal rainfall forecast". International Journal of Climatology. 32 (14): 2197–2212. doi:10.1002/joc.3393.
- Villarini, G.; Serinaldi, F.; Smith, J. A.; Krajewski, W. F. (2009). "On the stationarity of annual flood peaks in the continental United States during the 20th century". Water Resources Research. 45 (8). Bibcode:2009WRR....45.8417V. doi:10.1029/2008wr007645. Archived from the original on 6 June 2011. Retrieved 27 May 2010.
- Villarini, G.; Smith, J. A.; Napolitano, F. (2010). "Nonstationary modeling of a long record of rainfall and temperature over Rome". Advances in Water Resources. 33 (10): 1256–1267. Bibcode:2010AdWR...33.1256V. doi:10.1016/j.advwatres.2010.03.013.