FAQ

Format db Files

Formatdb, should be used to format the FASTA databases. This must be done before blastall or blastpgp can be run locally. Usage of formatdb may be obtained by executing formatdb and a dash (note that additional comments have been added here as indented paragraphs):

Formatdb Arguments

  -t Title for database file [String] Optional

  -i Input file for formatting (this parameter must be set) [File In]

  -l Logfile name: [File Out] Optional

default = formatdb.log

  -p Type of file

T - protein

F - nucleotide [T/F] Optional

default = T

The "-p" option has two different meaning depending on whether input database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies type of input database. In case of ASN.1, the option specifies the type of sequence to be indexed for BLAST.

-o Parse options

T - True: Parse SeqId and create indexes.

F - False: Do not parse SeqId. Do not create indexes.

If the "-o" option is TRUE (and the input database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention described in the appendices of ftp:/ncbi.nlm.nih.gov/blast/dB/README.

[T/F] Optional

default = F

-a Input file is database in ASN.1 format (otherwise FASTA is expected)

T - True,

F - False.

[T/F] Optional

default = F

-b ASN.1 database in binary mode

T - binary,

F - text mode.

An input ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

[T/F] Optional

default = F

-e Input is a Seq-entry [T/F] Optional

default = F

An input ASN.1 database (either text ASCII or binary) may contains Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

-n Base name for BLAST files [String] Optional

This options allows one to produce BLAST databases with a different name than the original FASTA file. One could have a file named 'nt' and and format it as 'nr':

formatdb -i NT -p F -o T -n NR

One could also uncompress the original FASTA file on the fly and send it to formatdb through the 'stdin' (under UNIX):

uncompress -c nr.z | formatdb -i stdin -o T -n NR

This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.

-v Number of sequence bases to be created in the volume [Integer] Optional

default = 0

This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the

extension 'nal' or 'pal', is written.

-s Create indexes limited only to accessions - sparse [T/F] Optional

default = F

This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

FORMATDB NOTES

A.) Note on -o identifiers:

It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp:/ncbi.nlm.nih.gov/blast/dB/README. If the database identifiers are in the parseable formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA defintion line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases:

1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.

2.) query-anchored alignments are desired (i.e., the '-m' option with a non-zero value is used).

3.) The gi's are desired as part of the output (i.e., '-I' is used).

4.) fastacmd is used to fetch sequences from the database by accession or gi.

B.) Note on "SORTFiles failed" message:

Formatdb will use the 'standard' temporary directory to sort the string indices on disk.

Under UNIX this is often /var/tmp and if there is not enough space there, then the error message: "ERROR: [000.000] SORTFiles failed" will be issued. This can be avoided by setting the TMPDIR environment variable to a partition with more free space. This message may also often be avoided by using the sparse option (-s) for formatdb described above.

C.) Note on formatting large (4 Gig and larger) FASTA files:

A single BLAST database can contain up to 4 billion letters. If one wishes to formatdb a FASTA file containing more letters than this, then it is necessary to use the '-v' (volume) option to produce a number of BLAST databases, each containing 4 billion or fewer letters. An example usage of formatdb for this would be:
formatdb -i hugefasta -p F -v 2000000000

Formatdb is run on the FASTA file 'hugefasta' and multiple BLAST databases, as well as an alias file that describes to BLAST how to search these databases, are produced. Each BLAST database contains at most 2000000000 letters. Note that 2000000000 is the current limitation on the NCBI toolkit command-line parser, explaining why 2000000000 rather than 4000000000 is used.

The reason for using database volumes, as opposed to simply making the indices in the BLAST databases large enough to handle all conceivable databases with an eight-byte 'integer', is that this would have doubled the size of the indices for all searches no matter how small the database. Hence very large FASTA files are broken down into a couple of databases. This process can also be inverted; a user could manually write an alias file (with a name like 'ntest.nal') to combine two databases formatted with separate runs of formatdb.
Look at the *.nal file that is produced when formatdb is run to see the syntax.

Formatdb must be able to open files larger than 2 Gig in order to work on very large files. This is not a problem on a 64-bit OS and on certain 32-bit OS that allows binaries to be made large-file aware. The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled large file aware.

D.) Note on running formatdb on a database without uncompressing it:

To run formatdb without uncompressing the 'nt' database one should execute:
uncompress -c nt.Z | formatdb -i stdin -p F -o T -n NT Note the use of the '-n' option that specifies the name of the resulting BLAST database. Note also that 'stdin' specifies that input will be coming from 'standard input'. The NT FASTA file is not needed for running BLAST searches and nt.Z may be deleted after formatdb has been run.


customer service software technical support
Live Chat by Comm100