Road map to BEAP
Overview
Since many options were inherently flexible in BEAP, we wanted to test a
number of scenarios that could alter BEAP performance. To determine how
BEAP output was changed when BLAST settings were altered, we varied the
E-value stringency, word size and number of databases. To determine how
different sequence attributes would affect BEAP perform-ance, we compared
output generated when using intronic, exonic, exonic and untranslated
regions (UTR), and varying levels of repetitive sequence elements template
sequence. To determine how template sequence size altered BEAP
performance, we tested individual template sequence sizes, the number of
sequences used as template, and the total amount of se-quence used as
template. Last, we investigated the difference in BEAP output when using
local (megaBLAST) vs. network BLAST query. All trial runs of BEAP used
the same settings, with the exception of the factor tested. The same
template sequence was used for testing in each case, with the exception of
tests looking at repetitive sequence content. Tests using increased
sequence sizes used the same sequence used in the other trials with
additional contiguous sequence when larger sequences were tested. Default
test settings used network BLAST to query six bovine databases at NCBI,
include: BAC end sequence, trace expressed sequence tagged sites,
trace other sequence, trace whole genome sequence, high throughput
genome sequence, and EST databases. The E-value was set to e-30
when differences in E-value were not being queried. The database queries
using megaBLAST were always the same, and included the Bos taurus unique
sequence and Bos taurus tiger gene databases. In some cases, following
initial tests, the E-value was adjusted as needed to attempt to force BEAP
to run when no result could be obtained. These changes are noted and
reported only when the revised method was success-ful and the initial test
failed in creating contigs.
Template Sequence
The user must define the appropriate template sequence and species. The
template sequence is used much like primers in PCR for BLAST to query the
species of interest. Cross-species comparative maps (i.e. RH maps) can be
used to identify syntenic sequence blocks between species to find a
suitable template sequence.
Retrieval of template sequence
In our test case, we defined the "best" template sequence as that obtained
from the Human-Bovine RH map. Bovine genetic markers were used to find the
human syntenic gene block that corresponded to the bovine chro-mosomal
block of interest. Genes and pseudo genes were deemed template sequence,
as the conservation between species is greatest in genic regions. Human
gene sequences were used as the "template" to identify the ortholo-gous
genes in cattle. All template sequences were retrieved from UCSC "golden
path" genome browser at http://genome.ucsc.edu/ and the ENSEMBL database
at http://www.enembl.org/. Repetitive sequence elements are
often a hindrance to genomic assembly programs because they occur at
multiple sites across the genome with wide variation in flanking genomic
sequence. Repetitive sequences were masked using Repeat Masker software in
the template (i.e. human sequence) prior to use of BEAP.
This was easily facilitated using the repeat masking feature when querying
template sequence using the table browser at the UCSC website. Cattle
(Bos taurus) sequence was used in all tests of BEAP performance. The
sequences obtained in the application to the bovine dwarfism locus used
the sequence databases available in 2005, prior to full assembly. The BEAP
performance trials utilized the whole bovine genome sequence, version 3,
from 2007.
Use Case: An Application
Use BEAP to construct contigs within the Angus dwarfism locus
Fine-mapping of the Angus dwarfism locus resulted in a critical region of
roughly 1-2 Mbps on Bos taurus autosome 6. Since the bovine genome
was not fully sequenced upon the first application of BEAP, many of the
candidate genes in this genomic region were unknown and unanno-tated. The
Human-Bovine RH map was used to define the template se-quence allowing for
some extra sequence proximal and distal to the ho-mologous bovine
chromosome block. Homo sapien autosome 4 genomic DNA sequence from
78,000,000 to 83,000,000 base pairs was defined as the "template" for BEAP
assembly of the bovine. This genomic block contained 20 genes and
pseudo-genes. The
template sequence used by BEAP included both exonic and UTR sequences. We
used RH markers within genes in both the bovine and human genome builds to
anchor one map to the other. An E-value of e-30 was used for all tests.
Databases tested were the same six listed above as default for the network
BLAST tests.
|