Formatting the data

Essentially, CARTHAGENE uses a format inspired by MapMaker [LGA+87] for datasets. All datasets are described using two header lines followed by the data itself, one marker being described per line.

The first line indicates the type of the data: backcross, RIL, F2 intercross, outbreds, haploid or diploid radiated hybrids, order...See the following sections for more details.

The second line indicates the size of the data set: number of individual/hybrids typed followed by the number of markers. For compatibility with MapMaker, it may contains 2 further numbers which will be ignored by CARTHAGENE and aliases declaration that will be used by CARTHAGENE to analyze the last section of the file that contains typing data. An alias declaration specify that a given character (eg 0) will be used instead of another default one (eg. A) to specify marker typing data. Such an alias is specified by the statement 0=A in the second line.

The last section contains maker typing data. Each line describe typing data for one given marker, one individual (hybrid) after the other. A line begins with a star character (*) immediately followed by the marker name separated from the rest of the line using either tabulations or space characters. The rest of the line indicates typing data for the marker for each individual (hybrid), one after the other. The number of individual reported on each line must match the number of individuals indicated in the second line of the file and the number of markers (lines of data) must not be lower than the number of markers indicated in the second line of the file (if extra lines are there, they will be ignored).

The following is an example of an haploid radiation hybrid dataset with 4 markers and 40 hybrids were the default A (Absent) and H (Here) typing characters have been respectively aliased to 0 and 1.

data type radiated hybrid
40 4 0 0 1=H 0=A
*N7b 110110110-101100001011011111111000111010
*A11 0101001101101100001011011001111000101110
*N4 11010011011011000010110110011110001111101
*rA20 1101001101101100101011010001111111101010

The next section specify which header line and which typing conventions must be used to encode marker data depending on the dataset type.



Subsections
Thomas Schiex 2009-10-27