The sequence database compilers cooperate extensively; EMBL, DDBJ (DNA DataBank of Japan), and GenBank, exchange new sequences daily. The vast majority of the sequences in Genbank are also in EMBL.
The DNA databases, in particular, have identical information for each sequence but organised differently. Compare the header information for the HSHEPSH sequence as stored in EMBL vs. Genbank.
EMBL Format
ID HSHEPSH standard; RNA; PRI; 2363 BP. XX AC X07732; M18930; XX DT 16-JUL-1988 (Rel. 16, Created) DT 22-SEP-1995 (Rel. 45, Last updated, Version 9) XX DE Human hepatoma mRNA for serine protease hepsin XX KW hepsin; membrane protein; serine protease; zymogen. XX OS Homo sapiens (human) OC Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia; OC Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae. ...Genbank Format
LOCUS HSHEPSH 2363 bp RNA PRI 22-SEP-1995 DEFINITION Human hepatoma mRNA for serine protease hepsin. ACCESSION X07732 M18930 KEYWORDS hepsin; membrane protein; serine protease; zymogen. SOURCE human. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. ...
In the example above, the accession number for HSHEPSH is X07732. It also has a secondary accession number, M18930, probably indicating another sequence was combined with HSHEPSH.
Each sequence database has a corresponding data library, usually named after the database. For example, EMBL, SwissProt, and GenBank are the names of databases, and are also the logical names of E/GCG data libraries. The GenEMBL data library represents a fusion of EMBL with Genbank.
All these data library names have short forms to save typing: em refers to EMBL, gb refers to GenBank, ge refers to GenEMBL, etc.
To specify a particular sequence in a particular data library, you give the logical name (or short form) of the data library together with the sequence identifier, separated by a colon. "gb:humrep2" specifies the humrep2 sequence from GenBank.
Logical Name | Abbreviation | Subsection Accessed |
---|---|---|
phage:* | ph:* | Bacteriophages |
Viral:* | vi:* | Viral |
Bacterial:* | ba:* | Bacterial (prokaryotes) |
Eukaryote:* | or:* | Eukaryote organelles |
Organelle:* | or:* | Organelle sequences |
Fungal:* | fun:* | Fungal (EMBL only) |
Plant:* | pl:* | Plant (includes fungi in Genbank) |
Invertebrate:* | in:* | Invertebrates |
Human:* | hu:* | Human sequences |
Rodent:* | ro:* | Rodent sequences |
Primate:* | pr:* | Primate sequences |
other_mammalian:* | om:* | Other Mammalian (not primate or rodent) |
Other_vertebrate:* | ov:* | Other Vertebrate |
sts:* | sts:* | Sequence-tagged site sequences (NEW) |
est:* | est:* | Expressed sequence tags (NEW) |
tags:* | tags:* | STSs and ESTs(NEW) |
Structural:* | st:* | Structural RNA |
Synthetic:* | sy:* | Synthetic |
Unclassified:* | un:* | Unclassified |
Patent:* | pat:* | Patented sequences |
There are three relatively new DNA database divisions available as E/GCG data libraries: sequence-tagged sites, expressed sequence tags, and the union of these two, called simply "tags". These subsections have grown so quickly in number that if you wish to include these sequences in a database search, you must now ask for them explicitly.
Data Accessed | GenEMBL | EMBL | GenBank |
---|---|---|---|
Entire sequence | GenEMBLPlus:* | EMBLPlus:* | GenBankPlus:* |
database | geplus:* | emplus:* | gbplus:* |
gep:* | emp:* | gbp:* | |
All sequences | genembl:* | embl:* | genbank:* |
except tags | ge:* | em:* | gb:* |
Only tags | tags:* | em_tags:* | gb_tags:* |
Data Accessed | SwissProt | PIR | TREMBL |
---|---|---|---|
Entire sequence database | swissprot:* | protein:* | not avail |
(Annotated in PIR) | swiss:* | prot:* | not avail |
sw:* | pir1:* | not avail | |
PIR Preliminary sequences | pir2:* | ||
PIR Unverified seqs | pir3:* | ||
PIR Unencoded/untranslated seqs | pir4:* |
prompt> lookup LookUp identifies sequences by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release--please look carefully at your results. LOOKUP in what sequence libraries: a) sw_release b) pir c) embl d) genbank e) em_tags f) gb_tags g) gb_new h) em_new i) sw_new j) epd k) All libraries q) quit Please choose one or more (* k *): c ... a new screen is written ... Complete the query form below: All text: Definition: mRNA Author: Keyword: Sequence name: Accession number: Organism: Carassius auratus Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue. Searching embl 53 entries were found. Do you wish to: 1) write out this list to a file 2) preview the results 3) refine the query 4) choose different libraries q) quit Please choose one (* 1 *): What should I call the output file (* lookup.list *) ? . 53 entries were written to "lookup.list" prompt>
The resulting file "lookup.list" contains the set of EMBL database sequence entries, with comments describing the sequences indicated by an exclamation mark:
prompt> more lookup.list LOOKUP in: embl of: "([SQ-DEF: mRNA*] & [SQ-ORG: Carassius auratus*])" 53 entries October 27, 1995 11:05 .. EM_OV:CA07056 ! ID: a0000103 ! DE Carassius auratus homeobox protein mRNA, complete cds. EM_OV:CA08016 ! ID: a1000103 ! DE Carassius auratus kainate receptor beta subunit mRNA, complete cds. EM_OV:CA08017 ! ID: a2000103 ! DE Carassius auratus kainate receptor alpha subunit mRNA, complete ! DE cds. EM_OV:CA12018 ! ID: a3000103 ! DE Carassius auratus glutamate receptor 4 (glur4) mRNA, partial cds. ...
prompt> lookup -out=rhodopsin.list
...Choose EMBL as the database ...
...Enter rhodopsin in the
"All text:", "Definition:", &
"Keyword:" fields,
selecting OR as the "Inter-field
operator:" ...
...Press <CTRL>D to continue, and accept the remaining defaults.
prompt> more rhodopsin.list
To copy a sequence entry from one of the E/GCG data libraries to a UNIX file, use the programme called fetch. It takes the database:entry you want as its argument. fetch responds by describing itself, and then prints the filename it has copied the database entry to.
prompt> fetch gb:hsef2 FETCH copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. hsef2.gb_pr
The name of the new UNIX file holding the E/GCG format sequence data is "hsef2.gb_pr". Because it is a normal UNIX file, you may use any normal UNIX commands on it. You can type it to the screen (using "more"), delete it (using "rm"), edit it (please use "seqed", NOT "pico, vi, emacs, etc."!), transfer it to your local site over the computer network, and use it as an input file to other E/GCG programs.
prompt> fetch ge:hsef2
prompt> more hsef2.ge_pr
prompt> etc.
prompt> typedata ge:hsef2 | more
prompt> etc.
This can be frustrating if you want to fetch long sequences, rather than search through data libraries! Retrieving complete long sequences is easier with specialist sequence retrieval programmes like SRS.