Frequently Asked Questions:  Indexing of Sequence Identifiers

Copyright © 2001,2002 Warren R. Gish.

All Rights Reserved.

Last updated: 2002-09-26

 

 


What sequence identifiers do xdformat and xdget recognize?

 

All NCBI standard FASTA sequence identifiers (NSIDs) are supported for indexing. User-definable, uncontrolled identifiers (UCIDs) consisting of arbitrary text strings are also supported. The complete list of NSIDs is presented in Table 1.  Note: the NSIDs include three user types denoted by the tags: lcl, gnl, and oth. In contrast, identifiers of the flexible UCID class do not use any tags. For a more complete description of UCIDs, see below.

 

Table 1. The complete NCBI standard FASTA sequence identifiers

 

 

Tag and Identifier Syntax

Identifier Source Description

bbm|integer

NCBI GenInfo Backbone database identifier

bbs|integer

NCBI GenInfo Backbone database identifier

dbj|coll-accession|locus

DNA Database of Japan

emb|coll-accession|entry

EBI EMBL Database

gb|coll-accession|locus

NCBI GenBank database

gi|integer

NCBI GenInfo Integrated Database (“jee-aye”)

gim|integer

NCBI GenInfo Import identifier

gnl|database|idstring

General (user-definable) database and identifier

gp|coll-accession|locus_cds#

GenPept (GenBank protein) identifier

lcl|integer

Local (user-definable) identifier

oth|accession|name|release

Other (user-definable) identifier*

pat|country|patentid|serialno

Patent sequence identifier

pdb|entry|chainid

Brookhaven Protein Database

pir|accession|entry

Protein Information Resource International

prf|accession|name

Protein Research Foundation

ref|coll-accession|locus

NCBI RefSeq

sp|coll-accession|locus

SWISS-PROT database

tpd|coll-accession|name

Third party annotation, DDBJ

tpe|coll-accession|name

Third party annotation, EMBL

tpg|coll-accession|name

Third party annotation, GenBank

 

*The NCBI has discontinued support for “oth” identifiers, but support for them is maintained in xdformat/xdget.

 


Are Accessions from the DDBJ/EBI/NCBI collaboration treated specially?

 

Yes, while “accession” appears in several of the identifiers described above, Accessions assigned by the International Nucleotide Sequence Database Collaboration between the DDBJ, EBI (EMBL) and NCBI are guaranteed unique by these organizations.  To reflect their special nature, the collaboration’s Accessions are labeled coll-accession in Table 1. These Accessions are all treated as being derived from the same identifier name space. Consequently, xdget can retrieve a sequence by Accession (or rather coll-accession) without having to know specifically which of the collaborating organizations assigned the identifier. Locus and Entry identifiers do not work this way, however, as the uniqueness of these identifiers is not controlled between the collaborators.

 


What is a compound identifier?

 

A compound identifier is a concatenation of multiple NCBI standard FASTA sequence identifiers (NSIDs) each separated from the next by a single vertical-bar character, ‘|’ (also known as the “logical-or”, “pipe”, “pling”, “gozinta” or “pipesinta” character). White space (e.g., one or more blank or tab characters) is used to delimit the identifier string from the accompanying sequence description.

 

Here is an example of a definition line containing a simple or atomic sequence identifier:

 

>gi|12346 hypothetical protein 185 – wheat chloroplast

 

Here is an example of a compound identifier, containing both a gi and a gp (GenPept) identifier:

 

>gi|12346|gp|CAA44030.1|CHTAHSRA_4 hypothetical protein 185 – wheat chloroplast

 

While the order of identifiers in a compound identifier is technically irrelevant, gi identifiers typically appear first.

 


What is a compound definition?

 

A compound definition is a concatenation of multiple component definitions, each separated from the next by a single Control-A character (sometimes symbolized ^A; hex 0x01; or ASCII SOH [start of header]). Compound definitions are frequently seen in “nr” (quasi-non-redundant) databases, where multiple instances of the exact same sequence are replaced by a single instance of the sequence with a concatenated definition line. Note: each component of a compound definition begins with an identifier which itself may be compound.

 


Can I use my own identifiers?

 

Yes, xdformat can index uncontrolled identifiers of your choosing (UCIDs), either alone or in combination with NCBI standard FASTA sequence identifiers (NSIDs).  A UCID consists merely of a non-blank string of text, lacking any identifier tag that would be required of an NSID.

 

UCIDs are subject to a few restrictions:

 

The purpose of imposing the above restrictions on UCIDs is to aid in the detection of syntax errors on input.

 

When an error is encountered in the left-to-right parsing of a string of identifiers, parsing stops and all subsequent identifiers in the current identifier string are ignored.  Any identifiers parsed correctly prior to the error are indexed. In the case of a compound definition line, parsing and indexing resume at the identifier string in the next component definition. Regardless of whether any syntax errors are detected in the identifiers, the entire definition line will be stored in the XDF database “as is”.

 

Here are a few examples of definition lines whose identifiers will all be completely parsed and indexed. All but the first two examples contain a compound identifier.

 

>gi|12346

>MYID001 my first sequence (NOTE: UCID is acceptable as the first identifier, iff it is the only identifier in the string)

>gi|5902966|gp|AAD55586.1|AF055084_1 very large GPCR-1 [Homo sapiens]

>gp|AAD55586.1|AF055084_1|gi|5902966 (NOTE: order of NCIDs is unimportant)

>gp|AAD55586.1|AF055084_1| very large GPCR-1 [Homo sapiens] (NOTE: vertical-bar is acceptable at end of identifier string)

>gp|AAD55586.1|AF055084_1|gi|5902966|MYID001 my first sequence (NOTE: UCID at end of identifier string will be properly indexed)

 

 

Here are a few examples of improperly constructed strings that will cause an identifier – or the entire string of identifiers – to be omitted from the index.

 

>gi|5902966|gp|AAD55586.1 very large GPCR-1 [Homo sapiens] (NOTE: gp identifier is missing the locus token and will be skipped)

>fb|AAD55586.1|AF055084_1|gi|5902966 (NOTE: unrecognized tag “fb”; none of the identifiers will be indexed)

>gi|5902966|MYID001|gp|AAD55586|AF055084_1 (NOTE: UCID not listed last; gp identifier will not be indexed)

>MYID001|gp|AAD55586.1|AF055084_1|gi|5902966 (NOTE: UCID not listed last; none of the subsequent  identifiers will be indexed)

 


Are all identifiers indexed in a compound definition line?

 

Yes, assuming no parse errors are encountered in any of the identifier strings among all component definition lines, all of the identifiers are indexed by default. If only a subset of identifier types needs to be indexed for later use in retrieval, indexing can be restricted to a subset of types with one or more ‑T specifications on the xdformat command line. Similarly, indexed retrieval can be restricted to a subset of identifier types by specifying one or more ‑T specifications on the xdget command line.  Of course, ‑T restrictions are only effective if the corresponding identifiers actually appear in the database.

 

Any ‑T index restrictions imposed during database creation on the xdformat command line automatically (and unconditionally) remain in effect during appends of additional data to the same database; the restrictions need not be replicated on the xdget command line unless even tighter restrictions are desired during retrieval. Tighter restrictions upon retrieval can be obtained by specifying a subset of the ‑T restrictions originally indicated on the xdformat command line.

 

The size of the index and the speed of index creation and retrieval will be improved by limiting the index to those identifiers of interest.

 

NOTE: The left-to-right order of multiple ‑T specifications may be important in future versions of xdformat and xdget.

 


How can indexing and retrieval be restricted to my own identifiers?

 

Just as the ‑T<tag> option can be used to restrict indexing and retrieval to a subset of NSIDs, the special tag specification ‑Tuser will restrict indexing to UCIDs. NSID and UCID restrictions can be combined on the same command line. For example, “xdformat ‑Tuser ‑Tgi …” will restrict indexing to UCIDs and NCBI gi identifiers.

 


What is a “redundant” identifier?

 

When the definition line for a single sequence record contains multiple instances of the same identifier within the same name space, each instance following the first is called redundant. Redundant identifiers may appear in the same or different components of a compound definition line. Depending on circumstances, redundant identifiers may or may not be problematic, because they all refer to (are associated with) the same sequence record.

 

The xdformat program reports redundant identifiers.

 


What is a “duplicate” identifier?

 

When a database contains instances of the same identifier in a name space in different sequence records, the identifiers are called duplicate. Duplicate identifiers are more prone to being problematic than redundant identifiers, because the association between database records (sequences) and duplicate identifiers is not unique. An identifier can be both redundant and duplicate.

 

The xdformat program reports duplicate identifiers.

 


What are “qualified” and “unqualified” identifiers?

 

A qualified identifier is one which conforms to the NCBI standard FASTA identifier (NSID) syntax outlined in Table 1. An unqualified identifier is just a bare word, lacking any indication of its database domain or name space in which it was assigned. For instance, while “U38670” could represent a GenBank Accession, it might also be an uncontrolled identifier (UCID). The string “gb|U38670|” tells us unambiguously that the identifier is a GenBank Accession.

 

Table 2. Examples of unqualified and qualified identifiers

 

Unqualified ID

Qualified ID

Interpretation

U85245

gb|U85245|

U85245 is a GenBank ACCESSION

1857636

gi|1857636

1857636 is a GenBank gi identifier

HSU85245

gb||HSU85245

HSU85245 is a GenBank LOCUS

AF218085.2

gb|AF218085.2|

AF218085.2 is a GenBank ACCESSION.VERSION

P18646

sp|P18646

P18646 is a SWISS-PROT ACCESSION

11S3_HELAN

sp||11S3_HELAN

11S3_HELAN is a SWISS-PROT ENTRY name

A00008

pir|A00008|

A00008 is a PIR accession

 

 

Note that all fields in a qualified identifier must be accounted for by vertical-bars, but all fields need not contain data. A field can be left empty if its value is unset or unknown. Furthermore, retrieval of the corresponding database entry will succeed if one or more fields in a qualified identifier are instantiated.

 


How are unqualified identifiers looked up in an index?

 

First of all, it is important to know that when indexing, all identifiers are assigned to a specific name space, with unqualified or uncontrolled identifiers in the UCID class being assigned to an ad hoc user” name space. The xdformat and xdget programs maintain an internal priority list of the possible name spaces. When provided with an unqualified identifier, xdget works its way down the priority list, successively looking for the requested identifier in each name space. The program stops at the name space in which the first matching identifier is found; any further work the program must do (e.g., to identify the earliest appearance of the identifier in the database) will be performed in this one name space.

 

Name spaces are examined in the decreasing priority order shown in Table 3. The qualifiers 1 and 2 on any given tag correspond respectively to the 1st and 2nd fields in the tag’s full identifier syntax. Note that non-standard “accession” tag may be used with the –T option as a synonym for the unified name space of Accessions assigned by the DDBJ/EBI/NCBI collaboration. The nonstandard tags “locus” and “entry” are both synonyms for the 2nd field in all dbj, emb, gb, gp, ref, and sp identifiers, although xdformat actually stores these identifiers in 4 distinct name spaces; xdget then looks up unqualified identifiers using the priority list in Table 3.

 

Table 3. Priority order of identifier name spaces, from highest to lowest

 

-T<tag>

Description

Synonyms

user

Uncontrolled UCID class

 

lcl

 

 

gi

 

 

dbj1, emb1, gb1, gp1, sp1,ref1

DDBJ/EMBL/GenBank Accession*

-Taccession

gb2,gp2,ref2

GenBank locus

-Tlocus, -Tentry

emb2

EMBL ID

-Tlocus, -Tentry

dbj2

DDBJ ID

-Tlocus, -Tentry

sp2

SWISS-PROT entry

-Tlocus, -Tentry

pdb

PDB entry|chain

 

pir1

PIR accession

 

pir2

PIR entry

 

prf1

PRF accession

 

prf2

PRF entry

 

pat

country|number|seqno

 

gnl

database|idstring

 

oth

database|accession|release

 

 

NOTE: the priority list of Table 3 is currently used both in the presence and the absence of any –T options when xdget looks up unqualified identifiers. Future versions of xdformat and xdget will likely use the left-to-right order of–T specifications as the priority order for lookups; in the absence of any –T specifications on either program’s command line, the order shown in Table 3 will be used by default.

 


How can an entire class of identifiers be omitted from the index if it is not needed?

 

Tag specifications similar to those shown in Table 3 can be used to suppress indexing of certain classes of identifiers, while permitting all others to be indexed.  If a tag specification simply ends with a 0 (zero), then that tag will be suppressed. For instance, to suppress indexing of identifiers appearing in the 2nd field of GenBank, EMBL, and DDBJ identifiers, one would specify –Tlocus0. Or to suppress indexing of gi identifiers, use –Tgi0. Such tag specifications may also be provided on the xdget command line to suppress the use of particular classes of identifiers during retrieval.

 


Can new sequences be appended to an existing database – and will they be indexed?

 

Yes, the rapid append mode (‑a option) of xdformat is available for indexed databases; appends occur only marginally slower when an index is being maintained. Appended sequences will have their identifiers indexed using the same –T restrictions (if any) that were specified when the database was first created. Indexing of identifiers occurs automatically and unconditionally during appends to a previously indexed database, without the need to specify the –I or –X option when appending.

 


How are ACCESSION.VERSION identifiers managed?

 

The numerical .VERSION extension that frequently accompanies Accessions assigned by the NCBI/EBI/DDBJ collaboration is automatically included in the index created by xdformat. Version information can then used by xdget to identify the latest version of a sequence, when keyed by its Accession alone. Specific versions can also be retrieved if xdget is provided with an identifier of the form ACCESSION.VERSION (e.g., AAB33294.2). The –N option of xdget can be used to report instead the first (-N0) or last (-Nn) instance of an Accession in the database; the –A0 option can be used to report the lowest-numbered Version present in the database rather than the highest (the default or –An). All instances of an accession will be reported by xdget if the ‑d option is specified.

 

Indexing and retrieval can be restricted to Accessions assigned by the NCBI/EBI/DDBJ collaboration using the special option ‑Taccession (or ‑Tacc for short).

 

Remember: Version numbers assigned by NCBI/EBI/DDBJ are only tied to changes in the sequence data, not the associated annotation.  The annotation of a database record may change greatly, while the Version will remain the same if the sequence itself has not changed.

 


What limitations exist on identifier indexes?

 

Assuming the underlying computer operating system and hardware have the capacity, index files produced by xdformat are currently limited to 8 TB (8,192 GB) in size, a limit that can be readily increased to 256 TB in the program if necessary. With its current configuration, however, an index of 50+ million entries requires less than 3 GB storage; and because storage requirements for the index increase only marginally faster than linearly with the number of entries, the current limit seems likely to suffice for some time. If the size of an index is problematic, or if faster retrieval is required, indexing can be restricted to the most important classes of identifiers using the –T option.

 


Return to the WU-BLAST Archives home page.