- Backcross data
- Intercross F2 data
- Recombinant inbred lines (RIL) data
- Data from arbitrary series of mating operations in line crosses
- Phase known outbred data

Backcross data

This is a MapMaker compatible format. CARTHAGENE uses a dedicated boosted EM for backcross data which may be one or two orders or magnitudes faster than a standard EM algorithm without any loss of precision [SCBM01]. All backcross data file must start with the following header line:

data type f2 backcross

In the case of backcross data, each locus can either be homozygous or
heterozygous. These two situations are encoded using the `H`
and `A` characters respectively. Loci with an unknown status
are encoded as `-`. This encoding can be redefined using
aliasing in the second header line (see the beginning of the section).

Intercross F2 data

This is a MapMaker compatible format. All F2 (intercross) datasets must start with the following header line:

data type f2 intercross

Depending on the dominance or codominance of each loci, several
situations can be encoded. If we call `A` and `B` the
two alleles of an heterozygous individual, the descendance can be
either:

- homozygous
`A`: this is encoded by the character`A`. - homozygous
`B`: this is encoded by the character`B`. - heterozygous: this is encoded by the character
`H`. - know to be not homozygous
`A`: this is encoded by the character`C`. - know to be not homozygous
`B`: this is encoded by the character`D`. - unknown: this is encoded as
`-`.

Recombinant inbred lines (RIL) data

These are MapMaker compatible formats. Both self and sib RIL data are
handled by CARTHAGENE. Note that since these data types are internally
handled as pure backcross data (recombination frequencies being
adequately corrected to take into account the fact that the data
represent self/sib RIL), it is impossible to completely merge (both on
order and recombination frequencies, see the `dsmergen` command,
section 2.2.2) ) such data with other genetic data. Use the
`dsmergor` command in this case.

Depending on the RIL type (self/sib), the first header line of the format file must be respectively:

data type ri selfor

data type ri sib

The characters used to encode RIL data are the same as for backcross data (see section 2.1.1). This encoding can be redefined using aliasing in the second header line (see the beginning of the section).

Data from arbitrary series of mating operations in line crosses

Mating designs consisting of a series of backcrossing, selfing, random intercrossing, and/or haploid-doubling operations applied to F1 progeny of a cross between two homozygous individuals are accepted. The file header will look like this:

data type bs BBSBS

where the first `bs` (case-insensitive) is required and the final word on the first line is a variable sequence of letters that denote mating operations. In the example above, two backcrosses are followed by a selfing, another backcross, and a final self. The allowed codes are `b`, `s`, `i`, and `d`, and any sequence of up to eight operations is permitted.

If your design ends with a backcross operation, be sure that the recurrent parent is represented by the `A` character. You can use CarthaGene?s aliasing notation to alter the character meanings, as described for the f2 backcross and other mating types.

Don?t use this method to analyze a single backcross (coded as `bs b`) or RI design (coded, for the example of an F9, as `bs ssssssss`). For these, the standard CarthaGene types are handled with faster algorithms. The F2 design (coded as `bs s`) is handled at probably about the same speed as if it were coded f2 intercross.

Phase known outbred data

CARTHAGENE can handle outbred data as far as phases are fixed (either
known or fixed to the most probable phases). Such phase known outbred
datasets can be handled using different strategies. A first simple
method (which ignores part of the information) consists in projecting
the information on each parent side: this gives two backcross datasets
which can be merged using either the `mergen` or
`mergor` command. The first case will aim at computing a
consensus map (with consensus distances) while the second one will aim
at computing a consensus order with different recombination ratio on
each parent. We will not detail this strategy here although it has the
advantage of relying on our ``Boosted EM'' algorithm which means that
it will be a lot faster than the approach below. In this section, we
describe a more complex encoding which does not ignore any
information. All outbred datasets must start with the following
header line (same as for intercross data:

data type f2 intercross

Because the ability to handle outbred data has evolved from a classical intercross situation using Mapmaker syntax, the syntax used to encode such data is currently rather clumsy. This may change one of these days but you'd better not count on it...

At one locus, consider the cross of where , , and stand for the alleles on each haplotype of the father and the mother respectively. The genotype of the child obtained may be either , , or . Depending on the heterozygocity of the parents, or the number of different alleles, on the dominance or codominance of the markers, on the observations available on the child's phenotype, only a subset of these 4 possibilities is possible. For example, in the ``usual'' F2 intercross situation, the two parents are heterozygous and bear the same pair of alleles: and . In this simple case, observations on the phenotype of a child may lead to different situations:

- the child is typed and the allele is not dominant. In this
case the only possible genotype is or . This is
encoded by the character
`A`in MapMaker. - the child is typed and is dominant. Then the set of possible
genotypes is
, i.e,
.
This is encoded by the character
`D`in MapMaker. - the child is typed and the allele is not dominant. In this
case the only possible genotype is or .This is
encoded by the character
`B`in MapMaker. - the child is typed and is dominant. Then the set of possible
genotypes is
, i.e,
.
This is encoded by the character
`C`in MapMaker. - the child is typed (the marker is codominant) then the set of
possible genotypes is , i.e,
. This
is encoded by the character
`H`in MapMaker. - the child is untyped at this locus. The set of possible genotypes
includes all possible genotypes i.e
. This is encoded by the character
`-`in MapMaker.

In order to be able to cope with all phases known situations, including cases where one parent is homozygous, when 3 or 4 different alleles are present, Carthagène actually enables the user to express any subset of the 4 different possibilities. In order to do so, these 4 possibilities are associated with a number:

- genotype is associated with 1
- genotype is associated with 2
- genotype is associated with 4
- genotype is associated with 8

and the user will be able to tell Carthagène which set of genotype is actually possible at a given locus by:

- adding up the numbers associated with each possible genotypes at the locus given the observations
- converting this number to hexadecimal (base 16 where one counts 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f).

The following tables recapitulates all possible codes from `1` to
`f` and the corresponding set of possible genotypes at the locus.

Notation | Synonym | Possible Genotypes | |

1 |
A |
||

2 |
|||

3 |
, | ||

4 |
|||

5 |
, | ||

6 |
H |
, | |

7 |
D |
, | |

8 |
B |
||

9 |
, | ||

a |
, | ||

b |
, , | ||

c |
, | ||

d |
, , | ||

e |
C |
, , | |

f |
- |
, , , |

Here is an example of a small outbred dataset:

data type f2 intercross 40 5 0 0 *M1 1118822228821414212414281812248128422488 *M2 1418822228821414212414281882248128422488 *M3 4418828288821414242414281881148122422488 *M4 4412288228411814211444881884248222124488 *M5 8412288224412814811484881281848822184188

Let see how this can be used in some practical phase-known outbred situations according to the segregation ratio of the marker at the current locus.

This is the usual case in F2 intercross with codominance. In this case, the parents and are such that typically and and codominate. The usual observations on the child are either phenotype , or . this lead to the following encoding:

- : The only possible genotype is : this is coded by
`1`(the synonym`A`used in MapMaker can also be used). - : The only possible genotype is : this is coded by
`8`(the synonym`B`in MapMaker can also be used). - : The only possible genotypes are and
: this is coded by
`6`=4+2. The synonym`H`used in MapMaker can also be used.

When missing data occurs, it is still possible to give partial information
to Carthagène. Eg., if the child is typed but the other allele is not
known. The child's genotype can be either or
or . This is encoded by the character `7`=1+2+4
(synonym `D`)

The character `e` = 14 = 8+4+2 (synonym `C`) encodes a
situation where the child has been typed but the other allele is not
known. In this case, the child's genotype can be or or .

This type of segregation ratio occurs when dominance appears. Imagine
dominates , then it is impossible to distinguish between , and
. Precisely, if the child is typed , then the character (or the
synonym `B` of MapMaker) will be used. Else the character `7`
= 1+2+4 (or the synonym `D` of MapMaker) will be used to represent
the fact that we simply know that the child genotype is either , or .

Conversely, if dominates , the code `1` (resp. `e`)
will be used if the child is typed (resp. ). The respective synonyms
`A` and `C` of MapMaker can also be used.

This occurs when different alleles appear in the father and in the mother.
For example in
. In this case the child is either or
or or and the corresponding codes are respectively
`1` (or `A`), `2`, `4` and `8` (or
`B`).

When some data is missing, it is still possible to give information to
Carthagène. Imagine for example that the child is typed then the second
allele is unknown. Because we have 4 different alleles, we know for sure
that the first allele of the children is . Therefore, the only possible
genotypes are (code `1`) or (code `2`) and the
corresponding code is `3`= 1+2.

Imagine that instead of having 4 alleles, we have 3 alleles in
. Then children may be , , or i.e., there is still
a 1:1:1:1 segregation ratio. If again the children is just typed then
there is more indetermination at hand: the child may be either (code
`1`), (`2`) or (code `4`). In this case, the
corresponding code is `7` = 1+2+4 (or the synonym `D`).

Imagine the second parent is homozygous at current locus i.e., we cross
with (a backcross like situation), then the children genotypes
may either be or If we observe (or simply), it is not
know where does the come from i.e, the children may be or
. This case is encoded by `3` = 1+2. This is the sum of
`1` and `2` which corresponds to the two possible cases:
with coming from one grand-parent (code `1`) or the other (code
`2`).

Similarly, if we observe , then we know that the first allele is , the
second is but we don't where this allele comes from. The code will be
`c` = 12 = 4+8.

If the homozygocity appears on the first parent (
) instead
of the second one and if we observe we get the code `5`, the sum
of `1` and `4` corresponding to the fact that the origin of
the first allele is unknown. If we observe , we get the code `a`
which in hexadecimal corresponds to 10 which is the sum of `8` and
`2`.

Thomas Schiex 2009-10-27