8. Marker-Assisted Selection
by Charles Brummer; Department of Crop & Soil Sciences, University of Georgia
Broadly, marker assisted selection (MAS) can be divided into two categories, that of marker-assisted backcrossing or introgression and that of marker-assisted recurrent selection or population improvement. In the former, the goal is to incorporate one or few major genes or QTL into elite breeding lines (or in some situations, a breeding population). The second case involves using markers to improve the overall genetic value of a population with respect to some trait or suite of traits. Of the two, marker-assisted backcrossing, particularly of a single gene, is the easiest to put into practice; strategies to use markers in recurrent selection are still being developed and the best strategy for a given situation is not clear at the current time.
Marker assisted backcrossing uses DNA markers, which can be scored as a dominant or codominant trait prior to flowering, to facilitate the backcrossing program, saving time if progeny testing would need to be conducted and saving resources if phenotyping is difficult. Markers can be used to select for the gene being introgressed into the recurrent parent and to select against undesirable donor DNA. Markers enable the pyramiding of resistance genes; that is, they enable the incorporation of alleles at multiple loci each of which confers resistance to the same race of pathogen. This is difficult to do traditionally because one locus masks the presence of the others. The literature on marker-assisted backcrossing is increasingly large, as new tweaks to methodology are developed and new scenarios are encountered in different crop species.
Foreground selection vs. background selection
Foreground selection refers to using markers that are tightly linked to the gene of interest in order to select for the target allele or gene. Background selection refers to using markers that are not tightly linked to the gene of interest in order to select against other DNA from the donor parent (i.e., to select for recurrent parent alleles at other loci than the target).
Strategies for using markers for foreground selection
The first step for foreground selection is to associate a molecular marker with the target trait by some genetic mapping method. In the best case, the marker itself would be the functional polymorphism – that is, the DNA change that causes the phenotypic difference between alleles. These markers can be called “functional markers” (Andersen and Lübberstedt, 2003). Creating functional markers requires the gene of interest to be cloned (unless you are very, very lucky!). In the absence of a cloned gene, markers that are very tightly linked to the target gene are necessary to avoid recombination between the marker and the gene during backcrossing. A recombination, of course, would result in a situation where the breeder is selecting for a marker allele that is now linked to the undesirable trait allele.
Assuming markers are available for the trait, then they can be applied at each backcross generation to select those plants carrying the desired allele (or gene, in the case of a novel transgene). Markers are used as follows:
(1) BCxF1 seeds are planted and DNA is extracted from leaf tissue prior to flowering.
(2) The markers are screened on the DNA of each plant, typically requiring a separate reaction assay for each marker on each plant.
(3) Plants that have the desired marker allele – the one known to be associated with the desired trait allele – are selected for backcrossing to the recurrent parent.
A clear summary of an example of foreground selection in rice is provided by Sanchez et al. (2000). Their objective was to introgress three different bacterial leaf blight resistance alleles (each at a different chromosomal location) into elite “new plant type” rice breeding lines using marker-assisted selection (MAS). The donor parent was IRBB59, which carried all three resistance alleles, Xa21, xa13, and xa5, but was not a NPT line. The recurrent parents were IR65598-112, IR65600-42, and IR65600-96, all NPT lines.
Previous studies had identified markers linked to the genes (they converted these markers into new versions easier to analyze). Table 1 shows the markers and their primer sequences. In some cases, in order to see a polymorphism between the parents, the amplifed fragment was digested with a restriction enzyme. The markers used were all tightly linked to the genes; only the marker linked to xa13 showed recombination in a previous study. Note that even though the other markers did not show any recombination in the previous studies (i.e., they are 0 cM from the target), that does not mean that they are the target itself, just that they are very close. (And further, if the previous study was based on a small population size, then the actual recombinational distance may be considerably larger than zero, but for the purposes of this example, having a marker linked to the target at zero cM is great.)
Crosses among the parents and the first backcross of the hybrids to the recurrent parents were done without any markers. They used MAS at each of three subsequent backcross generations. In each generation, ~50 plants were genotyped with markers for the three loci and plants with the correct marker alleles and having the best phenotypes were selected for the next cycle of backcrossing. After selecting BC3F1 plants, the selections were selfed to produce BC3F2 plants and selected for NPT. These plants were then genotyped to identify which resistance genes they had (assuming no recombination between markers and genes), phenotyped to determine resistance to various pathogen races, and then selfed to produce BC3F3 lines, which had a high degree of similarity to NPT.
The accuracy of the markers at detecting the true genotype was >90%. The accuracy was determined using two segregating populations and scoring the F2 plants for marker genotype and their F3 lines for phenotype (see Table 4). Markers are often assumed to have 100% heritability. As shown here, mistakes in scoring the marker data or other problems resulting in incorrect data (mixed leaf samples, etc.) means that markers are less than perfect, and hence have a heritability that is somewhat below 100%. Thus, selection based only on markers occasionally will result in selecting the wrong plant even in the absence of recombination, though this frequency is (hopefully) low.
Sanchez et al. (2000) were able to use markers to introgress desirable alleles at three different loci into NPT breeding lines. Importantly, two of the three were recessive alleles and some of the loci overlapped in race specificity. Thus, doing this in a conventional manner would have been nearly impossible, particularly in the time frame they used. Note that they did not use any background selection, only selecting for the target alleles with no concern for “linkage drag.”
Strategies for using markers for background selection
In contrast to foreground selection, background selection is designed to eliminate donor parent alleles other than the desired target gene. Of most importance is minimizing the size of the introgressed region, ideally limiting it to just the gene of interest. Also important, however, is the elimination of donor chromatin in regions of the genome unlinked to the target (i.e., non-carrier chromosomes and regions far from the target on the carrier chromosome).
To remove unlinked donor genome regions, assaying 2-3 markers evenly spaced per 100 cM is the most efficient approach in early generations, but higher marker density may be useful in later backcross generations (Hospital, 1992). A program including only a selection step on non-carrier chromosomes is more effective in later backcross generations (BC3 or BC4) than in early generations. Although this may be counter intuitive, it is based on the facts that in early generations, few recombinations have occurred to break up introgressed segments from the donor – thus, few markers are needed, and that most of the donor material will be removed in subsequent backcrosses. In later generations, the donor regions will be smaller, and hence more markers will be needed. Selection against donor alleles in the region of the target – i.e., reducing the linkage block surrounding the target – has been discussed by numerous authors (e.g., Hospital, 2001; Frisch and Melchinger, 2001a; Frisch and Melchinger, 2001b; Frisch et al., 1999; Reyes-Valdés, 2000). The goal is to reduce the size of the donor DNA segment surrounding the target gene to as small as possible as soon as possible in the backcrossing program. To do this, recombinants between the target and closely linked markers need to be identified. Actually, individuals with a recombination on both sides of the target need to be identified (as shown in the figure below).
Marker assisted selection is more efficient relative to traditional backcrossing for markers close to the target. If recombination is equally likely throughout a chromosome (which we know isn’t actually true), then the likelihood of a recombination between the target gene and a marker is proportional to the distance the marker is from the target. For markers far from the target, recombination is likely to occur early in the backcrossing process, and hence, the use of markers to select for these regions which are homozygous for the recurrent parent is not particularly necessary. However, as the marker gets closer to the target, the use of markers helps identify the rare recombinants. The problem is that few recombinations occur between the trait and markers close to the target, so that large populations need to be screened to find recombinants here. A double recombinant is even less frequent, necessitating a huge population size. Therefore, one possible approach is to identify a recombinant on one side of the target in one generation and on the other side in the next generation.
The numbers of individuals needed to screen to identify one individual with a donor segment of a certain length are given in Tables 2 and 3 in Hospital (2001). For example, to find an individual containing a donor segment of only 1 cM on each side of the target, would require screening ~93,959 individuals if the individual was to be identified in a single generation in a backcrossing program, or ~1801 individuals in a two generation program. What these tables show is that finding markers with recurrent parent alleles further away from the target is much easier, as explained above. However, using markers over a multiple generation backcrossing program can eventually identify the close recombinants so that, for example, in a five generation program, many fewer individuals need to be genotyped.
Molecular marker assisted backcrossing of quantitative traits
As seen in the previous section, traditional backcrossing for quantitative traits is usually a long and rather difficult process – difficult from the standpoint of raising the trait value to the desired level, if nothing else. An alternative approach would be to use molecular markers and genetic linkage maps to identify (some of) the QTL involved in the expression of the trait and then to introgress those QTL, thereby improving the trait more efficiently.
The original concept of using molecular markers to backcross QTL suggested that markers could be associated with QTL in some segregating population and that those markers could then be used in other populations in a breeding program to select for the QTL. However, breeders soon realized that transferring marker information to another cross was problematic because the chromosomal regions associated with QTL varied in different environments and in different genetic backgrounds. Thus, markers identified in one population as associated with desirable QTL alleles may not be associated with the trait in a different population (or may not be segregating in that population). In the worst case, a marker allele linked to a high yield allele at a QTL in population A could be linked to a low yield allele in population B – now that’s working against yourself!
Two approaches can be envisoned to overcome this deficiency. First, advanced backcross QTL analysis (Tanksley and Nelson, 1996; Tanksley et al., 1996) uses the basic idea of combining backcrossing, phenotyping, and marker analysis in the same population. The general method is as follows:
- Season 1: Cross two lines or genotypes to produce an F1.
- Season 2: Generate ≥100 BC1 individuals, select against clearly undesirable types but otherwise backcross all to the recurrent parent.
- Season 3-4: Generate ≥200 BC2 individuals (say 2-3 plants per BC1), select against clearly undesirable types but otherwise self pollinate for one or two generations to develop BC2F2 and BC2F3 families for phenotypic evaluation. Develop a genetic map of the BC2F1 individuals; use the marker and phenotypic data to localize QTL controlling the trait(s) of interest.
- Season 5+: Identify BC2F2 lines (families) that are segregating for these QTL and continue backcrossing until a desired level of recurrent parent is reached followed by self-fertilization to recover homozyous plants for the introgressed QTL region. Ideally, the breeder could develop nearly isogenic lines for each QTL – called “QTL-NIL” – using markers to narrow the introgressed region and to recover the recurrent parent region.
The goal of this process is to develop a set of lines each carrying a single desirable QTL allele introgressed from the donor. With this set of lines, the effect of each QTL on the phenotype can be assessed. Later, after evaluating individual QTL, crosses among NILs carrying different QTL can be made to pyramid multiple QTL for the given trait into a single line.
An alternative method to finding the QTL is the development of a large set of introgression lines (ILs – confusingly enough), which essentially are nearly isogenic lines differing by a single chromosomal region (Zamir, 2001). Each line in the set would be fixed for a different chromosomal region from the donor parent so that, collectively, the entire set would include each region from a donor parent isolated into a separate line. Rather than mapping to identify QTL and then introgressing only those regions, this approach would introgress all regions, and then evaluate the set of lines to find those that have an effect on the phenotype. The region on the positive ILs can then be crossed and pyramided using markers.
In either case, the goal is to “Mendelize” the loci underlying quantitative traits, so that each QTL could be manipulated as easily as one could a single gene trait. The obvious advantage of this system is that the actual effect of the QTL can then be evaluated in multiple genetic backgrounds. The pyramiding of multiple QTL affecting a given trait can be proposed based on an additive model – the more QTL, the better. Although we know this is not likely to be the case in reality, it is a good place to start, and further experimentation can define the actual results when two or more QTL are brought together in a single background.
The second use of markers is not to introgress desirable alleles from wild/unadapted donor parents but to concentrate favorable alleles already existing in breeding populations – in other words, to increase the frequency of beneficial QTL alleles resulting in improved populations and/or inbreds derived from the populations. An excellent summary of the main uses of markers for this purpose is provided in Bernardo (2008), and this paper serves as the background for much of the discussion below.
Two broad methods can be considered. The first, which is similar in some respects to the QTL introgression approach discussed above, refers to a situation where the goal is to combine several QTL into a single inbred line or germplasm. The second involves predicting the genetic merit of an individual in order to improve overall population performance through recurrent selection.
In the first instance, the QTL need to be clearly identified – that is, marginal QTL should not be the focus of the selection, but instead, those that have strong statistical support are the ones most usefully selected. The number of QTL to be selected has to be limited to avoid large population sizes. Selection can be done in an F2 population to eliminate plants homozygous for the undesirable QTL allele, for example, in order to increase the chances of identifying inbred lines that contain all the desired QTL alleles.
In the second case, QTL need to be identified and their effects need to be estimated. While QTL effects are of interest in the first case – selecting the QTL with the largest effect makes sense – for marker-assisted recurrent selection (MARS) it is essential. Basically, MARS is a version of index selection, in which a marker index is constructed as follows:
where Mj is the marker index for each individual, which consists of a summation across loci of the product of Xij, the plant’s genotype at a given marker locus (+1 for the desirable homozygote and -1 for the undesirable homozygote) and bi , the weight given to each marker based on each QTL’s effect. Thus, if five QTL are identified, each with some effect on the phenotype, then the set of genotypes being evaluated is scored for those five marker alleles and their genotype is multiplied by the effect of each QTL on the expression of the phenotype. The sum across the five loci gives a score for each of the genotypes being evaluated, and truncation selection is used to select the best, say, 10% for recombination. For the purposes of recurrent selection to improve a population’s mean, including “marginal” QTL in the index makes sense – even if their effect is not clear, they can still contribute to the phenotype and should be included here.
The big problems associated with these QTL selection methods are several. First, QTL mapped in one population may not segregate in another population, and even if they do, their effects may be different (and possibly even opposite) in the two populations. Second, QTL mapping will, in general, not identify all the loci contributing to the quantitative trait; as a consequence, markers will not explain all the genotypic variance, but only some portion of it. Third, even if QTL are mapped in the population of interest, their effects may change in different environments. In this case, markers known to be associated with a trait under the conditions in which mapping was conducted may no longer be associated in environments where selection is occurring. Fourth, the epistatic relationships among loci may change as a population is selected for specific loci, and this can adversely affect gain.
Fifth, unless very large populations are being used, the estimate of the effect each QTL has on the phenotype is likely biased upward – the so-called “Beavis effect.” A clear demonstration of this effect has been shown in a vast maize mapping project (Schön et al., 2004). Small population sizes of less than 250 individuals are often used in marker studies – and in some cases, more cannot be analyzed – and in these cases, the percentage of the genotypic variance explained by the QTL is largely a result of biased estimation. The main result of this is that QTL will have a much smaller effect on the trait than expected and in fact, the QTL collectively may not explain much of the variation at all – in which case, the breeder is wasting time using markers.
Citing a combination of the factors noted above, Moreau et al. (2004) found that phenotypic selection improved maize population for a yield and moisture index, but that marker-only selection, based on markers identified in that population, did not. Distressingly, the desirable QTL alleles were in many cases fixed during the selection program, yet no gain was realized. Despite this example, a number of successful examples of marker-assisted selection (or marker-only selection) have been reported (see Bernardo (2008) for a summary).
Ultimately, marker-based selection will usually be less effective per cycle than direct phenotypic selection for most quantitative traits. However, the advantage of markers, of course, is that selection can be based on seedling plants in the greenhouse – the marker index can be used as the selection criterion in the absence of any phenotypic data – and hence several cycles of selection and recombination can be completed in off-season nurseries. On a per year basis, marker selection can be quite effective, and as costs decline, may even be more cost effective than phenotypic evaluation. Of course, marker effects, even if estimated without bias or error, will change over time as the population changes (remember that a QTL allele that has a large effect when rare in the population will have no effect at all when it is fixed in the population) so that remapping to re-identify QTL and re-estimate effects will need to be done every several generations.
A step beyond the QTL mapping methods are “black box” genome-wide selection (GWS) methods (Bernardo and Yu, 2007). These methods assay markers throughout the genome with no a priori knowledge of QTL locations, estimate breeding values for each marker, and then make selections in subsequent generations using those breeding values. A model could be developed like this (Bernardo and Yu, 2007):
where Y is vector of phenotypic values for each individual, Xi is a matrix of marker data for each individual plant, gi is a vector of breeding values for each marker, and e is a vector of residual effects.
The first step in genome-wide selection is to conduct a robust phenotypic evaluation of the genotypes, probably by replicated, multilocation testing of their progeny families. Simultaneously, markers throughout the genome will be scored on those genotypes, enabling the computation of breeding values for each marker. The best genotypes will be intercrossed based on their phenotypic data – realize that you will not beat the phenotypic data with markers! Individuals from the new population can then be grown, sampled for DNA, and scored for the set of markers evaluated previously. Based on the genotypes and the known marker breeding values, a prediction of the phenotype can be made, and the plants with the best predicted phenotypes selected and recombined. As with MARS above, the breeding values will change over time, so only several cycles of marker-only selection would be possible before the breeding values need to be re-estimated.
Although this method sounds very promising, it has several limitations. Most importantly, the method depends on having adequate genome coverage so that QTLs will be able to be associated with markers. The extent of the association is determined by the extent of linkage disequilibrium (LD). If LD is very short, as would be expected in a very large intermixed population (consider a population of all U.S. maize, for example), then the number of markers needed will be very large. Realize that even though we are not mapping QTL in this method (nor do we really care where QTL are), we need to be able to have markers associated with many – ideally, all – QTL, because otherwise, we will be unable to accurately predict breeding values. On the other hand, if LD extends for long distances across the genome – as may be expected in a narrow-based population like an F2 derived from two inbred parents – then relatively few markers are needed to adequately sample the genome. As marker costs decline, the ability to assay more markers increases, and therefore, the less LD is necessary to make gains using GWS. However, for very short LD situations, GWS is unlikely to be applicable to most crop situations at this time.
Several points are worth reiterating regarding molecular markers and selection.
1. Using markers to introgress single genes from wild species to elite breeding lines is reasonably straightforward and has been demonstrated to be successful in practical breeding.
2. Using markers to isolate individual QTL controlling important traits has been accomplished in some circumstances, enabling those QTL to be manipulated similarly to genes in #1.
3. Identifying and manipulating QTL with smaller effects or traits controlled by many QTL is more complicated, being hampered by the population sizes needed to fix a large number of desirable QTL alleles, particularly if some are linked in repulsion phases, and by instability in QTL effects across genetic backgrounds or environments.
4. Using markers to improve populations by recurrent selection can be done either focusing on previously identified QTL or using a black box genome-wide approach, although limitations caused by poor estimates of QTL effects will limit the gain in the first case and by a limited extent of LD will limit gain in the second.
5. Regardless of the method in which markers are used to manipulate quantitative traits, the markers are only as good as the phenotypic data on which they are based. If the heritability of a trait is zero because of poor phenotypic data, then no true QTL will be identified – even though some QTL may be found, they will not be real – and no gain will be made. Markers will not explain more of the genotypic variation than the phenotypic data will, and thus, will always be somewhat less effective than phenotypic evaluations (particularly if environmental fluctuations cause QTL x environment interactions) on a per cycle basis, a limitation that may be overcome by being able to select and recombine in one or more non-target environments. Even if markers are not as effective, if they can provide gain greater than the cost of achieving that gain during a season when phenotypic evaluations are not possible, markers will be useful.
Andersen, J.R. and T. Lübberstedt. 2003. Functional markers in plants. Trends Plant Sci. 8:554-560.
Bernardo, R. 2008. Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Sci. 48:1649-1664.
Bernardo, R. and J. Yu. 2007. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 47:1082-1090.
Frisch, M. M. Bohn, and A.E. Melchinger. 1999. Comparison of selection strategies for marker-assisted backcrossing of a gene. Crop Sci. 39:1295-1301.
Frisch, M. and A.E. Melchinger. 2001a. Marker-assisted backcrossing for simultaneous introgression of two genes. Crop Sci. 41:1716-1725.
Frisch, M. and A.E. Melchinger. 2001b. The length of the intact donor chromosome segment around a target gene in marker-assisted backcrossing. Genetics 157:1343-1356.
Hospital, F. 2001. Size of donor chromosome segments around introgressed loci and reduction of linkage drag in marker-assisted backcross programs. Genetics 158:1363-1379.
Hospital, F., C. Chevalet, and P. Mulsant. 1992. Using markers in gene introgression breeding programs. Genetics 132:1199-1210.
Moreau, L., A. Charcosset, and A. Gallais. 2004. Experimental evaluation of several cycles of marker-assisted selection in maize. Euphytica 137:111-118.
Reyes-Valdés, M.H. 2000. A model for marker-based selection in gene introgression breeding programs. Crop Sci. 40:91-98.
Sanchez, A.C., D.S. Brar, N. Huang, Z. Li, and G.S. Khush. 2000. Sequence tagged site marker-assisted selection for three bacterial blight resistance genes in rice. Crop Sci. 40:792-797.
Schön, C.S., H.F. Utz, S. Groh, B. Truberg, S. Openshaw, and A.E. Melchinger. 2004. Quantitative trait locus mapping based on resampling in a vast maize testcross experiment and its relevance to quantitative genetics for complex traits. Genetics 167:485-498.
Tanksley, S.D. and J.C. Nelson. 1996. Advanced backcross QTL analysis: a method for the simultaneous discovery and transfer of valuable QTLs from unadapted germplasm into elite breeding lines. Theor. Appl. Genet. 92:191-203.
Tanksley, S.D., S. Grandillo, T.M. Fulton, D. Zamir, Y. Eshed, V. Petiard, J. Lopez, and T. Beck-Bunn. 1996. Advanced backcross QTL analysis in a cross between an elite processing line of tomato and its wild relative L. pimpinellifolium. Theor. Appl. Genet. 92:213-224.
Zamir, D. 2001. Improving plant breeding with exotic genetic libraries. Nat. Rev. Genet. 2:983-989.