Measuring genetic diversity from molecular data

Determining genetic structure and genetic variability between and within breeds

To understand the influence of selection, mating systems and other breeding interventions in population genetics, it is important to describe and quantify the amount of genetic variation in a population and the pattern of genetic variation among populations. Genetic variation may be measured at various levels, e.g. allelic variation at structural loci (see Module 2, Section 3). Genetic variation within breeds decreases as a result of selection for economically important traits yet genetic variation between and within breeds is important as raw material for genetic improvement. Populations showing a great deal of variation will be able to adapt to changing circumstances whereas populations with less genetic variability will be less adaptable to sudden environmental changes.

Allele frequency determination and allelic variability

The frequencies of an allele at loci are calculated manually by direct counting. The mean number of alleles (MNA) observed over a range of loci for different populations is considered to be a reasonable indicator of genetic variation. This holds true provided that the populations are at mutational-drift equilibrium and that the sample size is almost the same for each population. Breeds with a low MNA have low genetic variation due to genetic isolation, historical population bottlenecks or founder effects. A high MNA implies great allelic diversity which could have been influenced by cross breeding or admixture. Bar charts can be created for individual breeds to show variability in allelic distributions at loci. Given that sample sizes are never the same for each population analysed, other indicators of allele variability include the effective number of alleles (ENA) and allelic richness (Ar). ENA denotes the number of equally frequent alleles it would take to achieve a given level of gene diversity. It allows one to compare populations where the number and distribution of alleles differ drastically. Ar, however, is a measure of the number of alleles per locus but allows comparisons to be made between samples of different sizes by using the rarefaction technique or a Bayesian simulation approach to standardize populations to a uniform sample size.

Variation in gene frequencies

The variation in gene frequencies at each locus can be used to determine genetic variability between breeds. Chi square analysis is used to test differences among loci and breeds.

Variation in genotype frequencies

Variability between breeds can be measured using the observed genotypes at each locus and between pairs of breeds. The assumption of independent distribution of genotypes over all breeds can be tested by contingency Chi square analysis. Comparisons between pairs of breeds are performed.

Testing for Hardy-Weinberg equilibrium

Most deductions about populations and quantitative genetics depend on the relationship between gene frequencies and genotype frequencies. A population is said to be in Hardy-Weinberg equilibrium (HWE) when gene and genotype frequencies remain constant from generation to generation. There are factors which can cause changes in these frequencies (e.g. selection, migration and mutation) resulting in non-random union of gametes. Deviation from HWE in a population indicates possible inbreeding, population stratification and sometimes problems with the genotyping. In populations where individuals may be affected by particular ailments or may be under different selective pressures, these deviations can also provide evidence for association. The data required to perform HWE tests are gene and genotype frequencies and the size of sample population at each locus.

The deviation from HWE can be tested using any one of the following three methods:

  1. The Chi square statistic for asymptotic tests has been used to evaluate the overall discordance of genotype frequencies at each locus or population combination (Hammond et al., 1994; Deka et al., 1995). The test is performed for every breed at each locus.
  2. The likelihood ratio test criterion (G statistic) has also been used to contrast observed and expected genotype frequencies (Hammond et al., 1994; Deka et al., 1995).
  3. The third method uses an exact test of HWE (conditional exact test which is analogous to Fisher’s exact test for contingency tables). In addition, for loci or population combinations with five or more alleles, a Markov chain algorithm is used to obtain an unbiased estimate of the exact probability of being wrong in rejecting HWE. This method should be preferred for small sample sizes and multi-allelic loci since the Chi square test is not valid in such cases.
  4. Recently, there has been great interest in testing for HWE in GWAA in which departures from HWE may indicate problems with quality control for the SNP in question. Therefore, a fourth recently derived method is based on Bayesian simulations and performs an exact test on the basis of the comparison between weighted likelihoods under the null and alternative hypotheses. The ratio of these two functions gives the Bayes Factor (BF). A distribution of the BF under the null hypothesis defines a natural order in the sample space. The discreteness of the sample space causes no complications for the Bayesian approach because all inferences are conditional on the configuration of the observed counts which negates the need to consider hypothetical data realizations. Therefore the test is exact and unconditional and does not depend on asymptotic results. In addition, the test is desirable in terms of decision theory, as it minimizes a linear combination of Type I and type II errors.

With the exception of the Bayesian approach, GENEPOP, FSTAT, ARLEQUIN and the R-programming language can be used to test for HWE.

Estimating average heterozygosity

Heterozygosity is a measure of genetic variation within a population. High heterozygosity values for a breed may be due to long-term natural selection for adaptation, to the mixed nature of the breeds or to historic mixing of strains of different populations. A low level of heterozygosity may be due to isolation with the subsequent loss of unexploited genetic potential. Locus heterozygosity is related to the polymorphic nature of each locus. A high level of average heterozygosity at a locus could be expected to correlate with high levels of genetic variation at loci with critical importance for adaptive response to environmental changes (Kotzé and Muller, 1994).

The observed heterozygosity is defined as the percentage of loci heterozygous per individual or the number of individuals heterozygous per locus. Average heterozygosity at each locus and for each breed can be estimated from allele frequencies at each locus. Individual breed average heterozygosity is estimated by summing heterozygosities at each locus and averaging these values over all loci. Locus heterozygosity is estimated by summing the heterozygosity at all loci for each breed and averaging this quantity over all breeds. The expected heterozygosity (also called gene diversity) is calculated from individual allele frequencies (Nei, 1987). The FSTAT (Goudet, 1995), GENETIX (Belkhir et al., 1996-2004), R-package, Microsatellite Analyzer (Dieringer and Schlštterer, 2003) and MSTollkit (Park, 2001) computer programs can be used to estimate both observed and expected heterozygosity per locus and population and across all populations analysed.

Estimating levels of inbreeding

Molecular data can also be used to estimate inbreeding values even though there are factors other than descent for two markers to be similar. Observed and expected heterozygotes at different loci can be used to estimate the extent of inbreeding. The locus inbreeding coefficients are averaged to estimate average inbreeding coefficients for each population. Inbreeding coefficients should only be estimated for breeds which show significant deviation from the HWE. A large value reflects the existence of a small number of heterozygote genotypes and an excess of homozygote genotypes. A small value indicates the occurrence of heterozygote genotypes at a higher proportion than the homozygote genotypes.

Genetic differentiation

Population differentiation can be assessed by determining whether allelic composition is independent of population assignment (Raymond and Rousset, 1995a). The statistical test is based on analysis of contingency tables using a Markov Chain procedure to derive an unbiased estimate of the exact probability of being wrong in rejecting the null hypothesis, i.e. allelic composition is independent of population assignment (no differentiation). The test is performed for pair-wise inter-population comparisons on contingency tables containing data from each of the microsatellite loci studied. The FSTAT, GENETIX and POPULATIONS statistical program’s can be used to perform the computations.

Analysis of gene flow, genetic admixture and structure

  1. Use of diagnostic allele Diagnostic alleles are alleles that are unique to certain breeds, e.g. alleles unique to indicine breeds or taurine breeds. They are used to determine the purity of breeds, the introgression by one breed type into a population and to determine the genetic composition of breeds. The frequencies of the diagnostic alleles or groups of alleles at a particular locus are averaged to give an estimate of the frequency of the diagnostic alleles in each population.
  2. Estimation of genetic admixture proportions from allele frequencies

    Genetic admixture proportions can be estimated directly using a method developed by Chakraborty (1985) which uses the concept of gene identity coefficient—the probability that two genes chosen at random from one or more populations are identical in state. The underlying rationale to this method is that genetic similarity between populations can be expressed as a simple linear function of admixture proportions. This method requires that parental populations represent the original populations that produced the dihibrid populations of interest. An example would be an Asian breed (or group of Asian breeds) representing an indicine population and a group of African breeds representing a taurine population.

    A computer program called ADMIX (Chakraborty, 1985) uses a vector-matrix approach to produce weighted least squares solutions for each individual admixture proportion with associated standard errors. It also produces correlation coefficients for the weighted least squares solutions that give an indication of the validity of the underlying admixture model (i.e. do present-day Asian zebu and the African breeds serve as adequate surrogates for the original parental populations).

    Another program called GENECLASS 2.0 (Piry et al., 2004) employs multilocus genotypes to select or exclude populations as origins of individuals (assignment and detection of migrants). Both of these tests compute likelihoods using Bayesian simulations, allele frequency data or genetic distances between individuals to assign individuals to their populations of origin or detect recent immigrants.

  3. Evaluating the genetic structure of populations

    The inherent genetic structure of populations can be assessed directly using a method developed by Pritchard et al. (2000) and implemented in the program STRUCTURE. The program implements a model-based clustering method to infer population structure, assign individuals to populations and identify migrants and admixed individuals using multilocus genotype data independent of prior population information. The approach implemented in STRUCTURE assumes a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned probabilistically to populations or jointly to two or more populations if their genotypes indicate them to be admixed.

Tests for linkage disequilibrium

Linkage disequilibrium (LDE) is the non-random association between different loci which may arise from: (i) admixture of populations with different gene frequencies; (ii) chance in small populations (e.g. endangered breeds); (iii) selection favouring one combination of alleles over another; or (iv) the close association between markers in the same linkage group (Falconer and Mackay, 1996). A test can be carried out to check for the existence of the association between markers studied. The null hypothesis for the LDE test is that all the genotypes at one locus are independent of those at another locus. The GENEPOP program (Raymond and Rousset, 1995b) and FSTAT (Goudet, 1995) can be used to test for LDE. The program prepares contingency tables for all pairs of loci in each population and in a pooled sample of all populations. Then a probability test (or Fisher exact test) for each table using the Markov chain method to obtain P-values is performed.

Distribution of genetic diversity (population differentiation)

When a population is divided into subpopulations, there is less heterozygosity than there would be if the population was undivided. Founder effects acting on different subpopulations generally lead to subpopulations with allele frequencies that are different from the larger population. Since allele frequency in each generation represents a sample of the previous generation’s allele frequency, there will be greater sampling error in these small groups than there would be in a larger undifferentiated population. Hence, genetic drift will push these smaller demes toward different allele frequencies and allele fixation more quickly than would take place in a larger undifferentiated population. There are two commonly used approaches to quantify the distribution of genetic diversity within and between populations.

  1. Wright’s F statistics 

    The decline in heterozygosity due to subdivision within a population has usually been quantified using an index known as Wright’s F statistic, also known as the fixation index. The F statistic is a measure of the difference between the mean heterozygosity among subdivisions in a population, and the potential frequency of heterozygotes if all members of the population mix freely and non-assortatively (Hartl and Clark, 1997). The fixation index ranges from 0 (indicating no differentiation between the overall population and its subpopulations) to a theoretical maximum of 1. In practice, however, the observed fixation index is much less than 1 even in highly differentiated populations. Fixation indexes can be determined for differentiated hierarchical levels of a population structure, to indicate, for example, the degree of differentiation between sub-populations within a population, between populations within a group and between groups of populations. To determine the fixation index, the mean heterozygosity at each level must be determined.

  2. AMOVA (Analysis of molecular variance)

    The most commonly used programs for performing AMOVA are Arlequin, GDA and GenAlEx. To perform AMOVA, a distance matrix is created within any of the above programs or included within the input file. For example, Arlequin partitions the sum of squared deviations from the distance matrix into hierarchical variance components which are tested for significance using permutation tests. The AMOVA approach used in Arlequin is essentially similar to other approaches based on analyses of variance of gene frequencies, but for certain types of data it can also take into account the number of mutations between molecular haplotypes (Φ; see p 65 of manual and Excoffier et al., 1992).

    • For haplotypic data, Arlequin estimates Φ using information from both the allelic content and frequency of haplotypes (Excoffier et al., 1992).
    • For genotypic data, with an unknown gametic phase (as is the case for most natural populations) the AMOVA is based on F-statistics.

AMOVAs can be used to: (1) describe the partitioning of genetic variation among and within groups; and (2) test user-defined groupings of populations. AMOVA differs from a simple analysis of variance (ANOVA) in that data are arranged hierarchically and mean squares are computed for groupings at all levels of the hierarchy. This allows for hypothesis tests of between-group and within-group differences at several hierarchical levels.