Genetic data sets

This is a MapMaker compatible format. CARTHAGENE uses a dedicated boosted EM for backcross data which may be one or two orders or magnitudes faster than a standard EM algorithm without any loss of precision [SCBM01]. All backcross data file must start with the following header line:

In the case of backcross data, each locus can either be homozygous or heterozygous. These two situations are encoded using the H and A characters respectively. Loci with an unknown status are encoded as -. This encoding can be redefined using aliasing in the second header line (see the beginning of the section).

Intercross F2 data

This is a MapMaker compatible format. All F2 (intercross) datasets must start with the following header line:

Depending on the dominance or codominance of each loci, several situations can be encoded. If we call A and B the two alleles of an heterozygous individual, the descendance can be either:

Recombinant inbred lines (RIL) data

These are MapMaker compatible formats. Both self and sib RIL data are handled by CARTHAGENE. Note that since these data types are internally handled as pure backcross data (recombination frequencies being adequately corrected to take into account the fact that the data represent self/sib RIL), it is impossible to completely merge (both on order and recombination frequencies, see the dsmergen command, section 2.2.2) ) such data with other genetic data. Use the dsmergor command in this case.

Depending on the RIL type (self/sib), the first header line of the format file must be respectively:

The characters used to encode RIL data are the same as for backcross data (see section 2.1.1). This encoding can be redefined using aliasing in the second header line (see the beginning of the section).

Data from arbitrary series of mating operations in line crosses

Mating designs consisting of a series of backcrossing, selfing, random intercrossing, and/or haploid-doubling operations applied to F1 progeny of a cross between two homozygous individuals are accepted. The file header will look like this:

where the first bs (case-insensitive) is required and the final word on the first line is a variable sequence of letters that denote mating operations. In the example above, two backcrosses are followed by a selfing, another backcross, and a final self. The allowed codes are b, s, i, and d, and any sequence of up to eight operations is permitted.

If your design ends with a backcross operation, be sure that the recurrent parent is represented by the A character. You can use CarthaGene?s aliasing notation to alter the character meanings, as described for the f2 backcross and other mating types.

Don?t use this method to analyze a single backcross (coded as bs b) or RI design (coded, for the example of an F9, as bs ssssssss). For these, the standard CarthaGene types are handled with faster algorithms. The F2 design (coded as bs s) is handled at probably about the same speed as if it were coded f2 intercross.

Phase known outbred data

CARTHAGENE can handle outbred data as far as phases are fixed (either known or fixed to the most probable phases). Such phase known outbred datasets can be handled using different strategies. A first simple method (which ignores part of the information) consists in projecting the information on each parent side: this gives two backcross datasets which can be merged using either the mergen or mergor command. The first case will aim at computing a consensus map (with consensus distances) while the second one will aim at computing a consensus order with different recombination ratio on each parent. We will not detail this strategy here although it has the advantage of relying on our ``Boosted EM'' algorithm which means that it will be a lot faster than the approach below. In this section, we describe a more complex encoding which does not ignore any information. All outbred datasets must start with the following header line (same as for intercross data:

Because the ability to handle outbred data has evolved from a classical intercross situation using Mapmaker syntax, the syntax used to encode such data is currently rather clumsy. This may change one of these days but you'd better not count on it...

At one locus, consider the cross of $F_0\vert F_1 \times M_0\vert M_1$ where

and

stand for the alleles on each haplotype of the father and the mother respectively. The genotype of the child obtained may be either $F_0\vert M_0$ , $F_1\vert M_0$ , $F_0\vert M_1$ or $F_1\vert M_1$ . Depending on the heterozygocity of the parents, or the number of different alleles, on the dominance or codominance of the markers, on the observations available on the child's phenotype, only a subset of these 4 possibilities is possible. For example, in the ``usual'' F2 intercross situation, the two parents are heterozygous and bear the same pair of alleles:

and

. In this simple case, observations on the phenotype of a child may lead to different situations:

In order to be able to cope with all phases known situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present, Carthagène actually enables the user to express any subset of the 4 different possibilities. In order to do so, these 4 possibilities are associated with a number:

and the user will be able to tell Carthagène which set of genotype is actually possible at a given locus by:

The following tables recapitulates all possible codes from 1 to f and the corresponding set of possible genotypes at the locus.

Notation	Synonym	Possible Genotypes
`1`	`A`	$F_0\vert M_0$
`2`		$F_0\vert M_1$
`3`		$F_0\vert M_0$ , $F_0\vert M_1$
`4`		$F_1\vert M_0$
`5`		$F_0\vert M_0$ , $F_1\vert M_0$
`6`	`H`	$F_0\vert M_1$ , $F_1\vert M_0$
`7`	`D`	$F_0\vert M_0, F_0\vert M_1$ , $F_1\vert M_0$
`8`	`B`	$F_1\vert M_1$
`9`		$F_0\vert M_0$ , $F_1\vert M_1$
`a`		$F_0\vert M_1$ , $F_1\vert M_1$
`b`		$F_0\vert M_0$ , $F_0\vert M_1$ , $F_1\vert M_1$
`c`		$F_1\vert M_0$ , $F_1\vert M_1$
`d`		$F_0\vert M_0$ , $F_1\vert M_0$ , $F_1\vert M_1$
`e`	`C`	$F_0\vert M_1$ , $F_1\vert M_0$ , $F_1\vert M_1$
`f`	`-`	$F_0\vert M_0$ , $F_0\vert M_1$ , $F_1\vert M_0$ , $F_1\vert M_1$

Let see how this can be used in some practical phase-known outbred situations according to the segregation ratio of the marker at the current locus.

Segregation ratio 1:2:1

This is the usual case in F2 intercross with codominance. In this case, the parents $F_0\vert F_1$ and $M_0\vert M_1$ are such that typically

and

codominate. The usual observations on the child are either phenotype

. this lead to the following encoding:

When missing data occurs, it is still possible to give partial information to Carthagène. Eg., if the child is typed

but the other allele is not known. The child's genotype can be either $A\vert A = M_0\vert F_0$ or $A\vert B = M_0\vert F_1$ or $B\vert A = M_1\vert F_0$ . This is encoded by the character 7=1+2+4 (synonym D)

The character e = 14 = 8+4+2 (synonym C) encodes a situation where the child has been typed

but the other allele is not known. In this case, the child's genotype can be $B\vert B = F_1\vert M_1$ or $A\vert B = F_0\vert M_1$ or $B\vert A = F_1\vert M_0$ .

Segregation ratio 3:1

This type of segregation ratio occurs when dominance appears. Imagine

dominates

, then it is impossible to distinguish between $A\vert A$ , $A\vert B$ and $B\vert A$ . Precisely, if the child is typed

, then the character

(or the synonym B of MapMaker) will be used. Else the character 7 = 1+2+4 (or the synonym D of MapMaker) will be used to represent the fact that we simply know that the child genotype is either $A\vert A = F_0\vert M_0$ , $A\vert B = F_0\vert M_1$ or $B\vert A = F_1\vert M_0$ .

Conversely, if

dominates

, the code 1 (resp. e) will be used if the child is typed

(resp.

). The respective synonyms A and C of MapMaker can also be used.

Segregation ratio 1:1:1:1

This occurs when different alleles appear in the father and in the mother. For example in $A\vert B \times C\vert D$ . In this case the child is either $A\vert C$ or $A\vert D$ or $B\vert C$ or $B\vert D$ and the corresponding codes are respectively 1 (or A), 2, 4 and 8 (or B).

When some data is missing, it is still possible to give information to Carthagène. Imagine for example that the child is typed

then the second allele is unknown. Because we have 4 different alleles, we know for sure that the first allele of the children is

. Therefore, the only possible genotypes are $A\vert C$ (code 1) or $A\vert D$ (code 2) and the corresponding code is 3= 1+2.

Imagine that instead of having 4 alleles, we have 3 alleles in $A\vert B \times A\vert C$ . Then children may be $A\vert A$ , $A\vert C$ , $B\vert A$ or $B\vert C$ i.e., there is still a 1:1:1:1 segregation ratio. If again the children is just typed

then there is more indetermination at hand: the child may be either $A\vert A$ (code 1), $A\vert C$ (2) or $B\vert A$ (code 4). In this case, the corresponding code is 7 = 1+2+4 (or the synonym D).

Segregation ratio 1:1

Imagine the second parent is homozygous at current locus i.e., we cross $A\vert B$ with $C\vert C$ (a backcross like situation), then the children genotypes may either be $A\vert C$ or $B\vert C$ If we observe $A\vert C$ (or

simply), it is not know where does the

come from i.e, the children may be $A\vert C_0$ or $A\vert C_1$ . This case is encoded by 3 = 1+2. This is the sum of 1 and 2 which corresponds to the two possible cases: $A\vert C$ with

coming from one grand-parent (code 1) or the other (code 2).

Similarly, if we observe

, then we know that the first allele is

, the second is

but we don't where this allele comes from. The code will be c = 12 = 4+8.

If the homozygocity appears on the first parent ( $A\vert A \times B\vert C$ ) instead of the second one and if we observe

we get the code 5, the sum of 1 and 4 corresponding to the fact that the origin of the first allele is unknown. If we observe

, we get the code a which in hexadecimal corresponds to 10 which is the sum of 8 and 2.