Statistical models for data analyses

A statistical model must, foremost, reflect the biology of the problem. A true model describes the pattern of the data perfectly but it is usually unknown. An ideal model is one that is close to a true model based on an understanding of the problem. At times, due to missing information or computational problems, an ideal model may be simplified to an operational model. This is a model that permits predictions to be made with an acceptable level of accuracy. Whenever an operational model (instead of an ideal one) is used, it is recommended that the principles for an ideal model are outlined and reasons for not using it and problems likely to arise from not using it are given. The ultimate choice of the type of model to use will depend on the traits being studied and the pattern of variation exhibited by the trait of interest.

The statistical models commonly used in animal breeding are linear models, with the set of factors being assumed to additively affect the observations. The choice of linear models has been influenced by the fact that most economically important traits studied are linear in nature (Schaeffer, 1991). More recently, non-linear models are being used to evaluate traits that exhibit categorical phenotypes (Ducrocq, 1997) and covariance functions are used in the analysis of longitudinal data (Meyer, 1998).

Components of a model

Dependent vs. independent variables: A model comprises factors/variables that influence a trait. The trait under study is termed the dependent variable, while those factors affecting it are termed independent variables. The essence of constructing a model is to determine the independent variables that affect the dependent variable, obtain information on the magnitude of each and draw inferences that can be translated into changing animal populations.

Characteristics of independent variables

Independent variables tend to be broadly grouped in two categories: fixed effects and random effects. Fixed effects are those estimated using information from the data only. Any conclusion drawn about the estimated mean for the trait will apply only to the study itself. They can be either discrete or continuous. Discrete factors have distinct levels, whereas continuous variables have a range of values assumed to follow a certain pattern (generally linear or quadratic). For example, it is known that calf weight at birth can be influenced by the sex of calf, the season when the dam calved, the age of the dam, the dam and the sire of calf. For the sex of calf there are two levels, i.e. male or female. Age of dam, however, can be considered as a continuous variable, say 3–12 years of age. When we fit a continuous variable, we may fit a straight line or a polynomial function of this variable. The slope of this line is known as a regression coefficient [Biometrics example 1]. Instead of treating age as a continuous variable, it is also possible to classify age of dam into different age categories (e.g. 3, 4–6, 7–9, 10–12 years) and treat the factor as discrete with four levels [Biometrics example 2].

A covariable is a factor known to affect a performance trait which adds ‘noise’ to the variable of interest. When there is a significant relationship between the trait being analysed and a covariable, a proportion of the natural variation among animals is explained by this covariable. This improves the precision of comparison between mean values of primary interest [Biometrics example 2].

When a factor is considered to be random, however, results of the study can be extrapolated to a wider population from which the sample under investigation can be assumed to be drawn at random. Thus, sire, for example, is a factor that can be either fixed or random. If sires have been selected purposively for an experiment, then it is likely that we would treat the factor as fixed and calculate mean values for each sire separately. More often, though, it will be assumed that sires have been chosen at random from a wider population. In such cases the effect for sire is assumed to be random and any inferences made from the study are generalized to the wider population of which the sires are representative. To construct a model to be used in data analysis the researcher has to decide, based on the understanding of the data, whether a factor is fixed or random. As a rule of thumb, a factor is considered as random as soon as one wants to make use of prior information about the variable of interest.

Types of models used

A model comprises three parts: (1) the equation which describes the factors (effects) and their levels; (2) the specification of the distribution characteristics of random effects; and (3) assumptions, restrictions and limitations in the use of the model. There are various types of linear models. The name given depends on whether it contains only regression variables, fixed discrete effects and the number of the fixed effects in the model; whether there are any interactions between factors; or whether the model contains either only fixed or random effects or both. Thus, according to Searle (1971) and Snedecor and Cochran (1980), some of the names that one can come across are:

i. linear regression models—simple or multiple linear regression
ii. correlation models
iii. classification models—one-way, two-way, three-way classification of factors
iv. classification models with interactions
v. nested (or hierarchical) models
vi. cross-classification models
vii. random models—all factors considered random
viii. mixed models—combination of fixed and random effects

    Analytical models may be for a single trait at a time (single-trait models) or for several traits at the same time (multi-trait models). When assessing several traits on the same individuals at the same time, often the interest of the researcher is to determine both phenotypic and genetic correlations between the various traits. The models used generally involve making various assumptions. An assumption often made is that residuals are normally distributed and each observation was randomly and independently obtained. However, this is not necessarily true, as if one considers a multiple trait the observations of different traits on the same animals are not independent. Also, in case of repeated measures (single trait) the observations on a same animal are not independent.

    Repeated measures on an animal can cause some difficulties because adjacent observations tend to be more closely correlated relative to those further apart. Statistical procedures are generally fairly robust and slight departures from normality can be ignored. When data are clearly not distributed normally, the data should be appropriately transformed or alternative non-linear techniques can be applied.

    For small data sets described by simple models (with a small number of factors), solving the equations may be quite easy. However, data sets in animal breeding can be very large and the results for the trait being evaluated can be influenced by many factors, some of which may have an uneven number of observations within each subgroup (unbalanced). For example, dairy data can include records from thousands of herds, taken over many years: some information can be missing for some herds or years. Cows within the herd can be of various genotypes and ages, cows may have been in lactation for different lengths of time etc. The statistical models required for such data sets can therefore be complicated, resulting in computational difficulties. Over time, different techniques have been developed to deal with such data, e.g. absorbing a factor to reduce the size of the system of equations to be solved, calculating the solutions of the equation system iteratively or including certain covariables or secondary factors in a preliminary step and adjusting the data for them before fitting the final model (Henderson, 1984).

    Sometimes the trait of interest is measured qualitatively rather than quantitatively and observations are assigned to distinct categories or classes based on qualitative assessment of the trait. For example, cows may be diagnosed clinically as having mastitis and coded as 1 or they may be diagnosed as healthy and coded 0. Such data, when expressed as the proportion of cases occurring for different levels of a factor, often belong to a binomial, not a normal, distribution. These data do not lend themselves to direct analysis by linear models for continuous traits, although, where large amounts of data have been collected, a normal approximation can be assumed (Harville and Mee, 1984). In some cases, use of a threshold (probit) model is advisable.