Polymorphism at the nucleotide level ranges over at least an order of magnitude within species, and average polymorphism ranges over two orders of magnitude between species. Homo sapiens is among the least polymorphic of all species, with a heterozygous single nucleotide polymorphism (SNP) generally occurring once every 500 to 1000 bp.
By contrast, marine invertebrates such as the sea squirt and echinoderms have an astonishing level of sequence diversity with an SNP every 5 to 10 bp. Diversity is a function of organism-level factors such as population size, generation time, and breeding structure, but variation within and among chromosomes signifies that recombination and mutation rates are also critical. In most species, centromeric and telomeric regions are less recombinogenic, hence have smaller effective population sizes, and tend to be less polymorphic. Even within a locus, polymorphism can vary over an order of magnitude, according primarily to the functional constraint: synonymous substitution rates tend to be uniform, whereas replacements can be excluded from highly conserved domains. Noncoding gene sequences are typically more polymorphic than exons and less polymorphic than intergenic DNA, but core regulatory sequences up to several hundred base pairs in length may often be the most conserved of all sequences.
Significant disparity between two measures of polymorphism, namely, the number of segregating sites and the average heterozygosity, provides evidence for departure from “neutrality". However, neutrality comes in many flavors, and demographic processes are just as likely to affect the difference between these two measures as is the selection.
Heterozygosity is a function of allele frequency as well as density, so unexpectedly high or low numbers of heterozygotes relative to the number of SNPs in a population can arise as a result of several processes that may be superimposed on a random drift. Thus, rapid population expansion or strong purifying selection both reduce Genetic Variation and Evolution heterozygosity, whereas admixture or balancing selection will increase heterozygosity.
Tests such as Tajima’s D have remained useful descriptors of diversity, but have been joined by a new series of tests that are more firmly rooted in coalescent theory. Rather than strictly interpreting test scores relative to theoretical expectations, comparison of the distribution of test scores across tens or hundreds of loci among species emphasizes that diversity is affected by a complex interplay of factors and that it is the location of a gene at either extreme of the continuum that marks it as a candidate target of selection, rather than a p-value per se.
A trend toward empirical evaluation of significance by permutation in light of genomic data is also seen in relation to population structure. Standard F-statistics introduced by Sewall Wright based on differences in genotype frequencies
among populations have been extended into an analysis of molecular variance (AMOVA) framework, one popular implementation of which is the Arlequin software. Estimates of SNP, indel, haplotype, or microsatellite allele frequency differences are sensitive to sample size, so samples of at least 100 individuals per population are recommended. Using
genomic data, the multiple comparison issues also arises: in a set of 500 sites, a single site with a test wise p-value of 0.0001 is not unexpected, but in a large sample, this may correspond to an allele frequency difference of just 10%.
Consequently, population structure is best estimated from multilocus data. For example, Pritchard et al have introduced Bayesian statistics to assign individuals to likely subpopulations with numerous applications in evolutionary, conservation, quantitative, and human genetics. It is well known that over 90% of all human polymorphism is common to all populations, but the ability to genotype hundreds of loci has led to the recognition that given sufficient data there is a detectable signature of demographic history even in our species. Similarly, long-held assumptions of panmixia in Drosophila melanogaster are being challenged by deeper sampling, as are commonly held notions about the genetic uniformity of crops such as maize, and in fact, the power to discriminate population structure in most species will have a profound impact on quantitative biology. An important implication of the ability to detect population structure is an inference of departure from neutrality, by comparison of the observed F-statistics with those obtained from a collection of assumed neutral markers.
The advent of new sequencing and genotyping technologies will only accelerate the data-driven nature of evolutionary genetic research. ABI 3730 automated DNA sequencing machines routinely generate traces with over 1 kb of high-quality sequence and have a throughput capacity exceeding 1Mb per day. Single-molecule sequencing methods are expected to make the sequencing of complete eukaryotic genomes for $1000 each a reality, possibly in the next decade, while massively parallel resequencing by hybridization to wafers of tiled oligonucleotides has already been used to characterize polymorphism between primate species. Such studies have identified hundreds of loci that are candidates for the adaptive evolution in the recent human lineage, some of which are likely to contribute to the etiology of common disease Molecular evolutionary studies of single genes in samples of 30 individuals have been typical but will soon be dwarfed by genome-scale sampling, and increasingly, attention will be placed on the efficient sampling design and formulation of hypotheses that utilize patterns of variation across the genome to interpret unusual patterns of variation at focal loci. Describing the variance of standard population-genetic parameters at a genome-wide scale is unprecedented territory, and developing approaches to quantify this variation across these expansive contiguous regions is the challenge for the near future. This type of data will also allow reexamination of some of the most basic assumptions underlying many population genetic approaches, such as the infinite sites and island migration models.