FAQ
Formatdb, should be used to format the FASTA databases. This must be done before blastall or blastpgp can be run locally. Usage of formatdb may be obtained by executing formatdb and a dash (note that additional comments have been added here as indented paragraphs):
-t Title for database file [String] Optional
-i Input file for formatting (this parameter must be set) [File In]
-l Logfile name: [File Out] Optional
default = formatdb.log
-p Type of file
T - protein
F - nucleotide [T/F] Optional
default = T
The "-p" option has two different meaning depending on whether input database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies type of input database. In case of ASN.1, the option specifies the type of sequence to be indexed for BLAST.
-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
If the "-o" option is TRUE (and the input database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention described in the appendices of ftp:/ncbi.nlm.nih.gov/blast/dB/README.
[T/F] Optional
default = F
-a Input file is database in ASN.1 format (otherwise FASTA is expected)
T - True,
F - False.
[T/F] Optional
default = F
-b ASN.1 database in binary mode
T - binary,
F - text mode.
An input ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
[T/F] Optional
default = F
-e Input is a Seq-entry [T/F] Optional
default = F
An input ASN.1 database (either text ASCII or binary) may contains Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.
-n Base name for BLAST files [String] Optional
This options allows one to produce BLAST databases with a different name than the original FASTA file. One could have a file named 'nt' and and format it as 'nr':
formatdb -i NT -p F -o T -n NR
One could also uncompress the original FASTA file on the fly and send it to formatdb through the 'stdin' (under UNIX):
uncompress -c nr.z | formatdb -i stdin -o T -n NR
This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.
-v Number of sequence bases to be created in the volume [Integer] Optional
default = 0
This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the
extension 'nal' or 'pal', is written.
-s Create indexes limited only to accessions - sparse [T/F] Optional
default = F
This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
A.) Note on -o identifiers:
It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp:/ncbi.nlm.nih.gov/blast/dB/README. If the database identifiers are in the parseable formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA defintion line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases:
1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.
2.) query-anchored alignments are desired (i.e., the '-m' option with a non-zero value is used).
3.) The gi's are desired as part of the output (i.e., '-I' is used).
4.) fastacmd is used to fetch sequences from the database by accession or gi.
B.) Note on "SORTFiles failed" message:
Formatdb will use the 'standard' temporary directory to sort the string indices on disk.
Under UNIX this is often /var/tmp and if there is not enough space there, then the error message: "ERROR: [000.000] SORTFiles failed" will be issued. This can be avoided by setting the TMPDIR environment variable to a partition with more free space. This message may also often be avoided by using the sparse option (-s) for formatdb described above.
C.) Note on formatting large (4 Gig and larger) FASTA files:
A single BLAST database can contain up to 4 billion letters.
If one wishes to formatdb a FASTA file containing more letters
than this, then it is necessary to use the '-v' (volume)
option to produce a number of BLAST databases, each containing
4 billion or fewer letters. An example usage of formatdb
for this would be:
formatdb -i hugefasta -p F -v 2000000000
Formatdb is run on the FASTA file 'hugefasta' and multiple BLAST databases, as well as an alias file that describes to BLAST how to search these databases, are produced. Each BLAST database contains at most 2000000000 letters. Note that 2000000000 is the current limitation on the NCBI toolkit command-line parser, explaining why 2000000000 rather than 4000000000 is used.
The reason for using database volumes, as opposed to simply
making the indices in the BLAST databases large enough to
handle all conceivable databases with an eight-byte 'integer',
is that this would have doubled the size of the indices
for all searches no matter how small the database. Hence
very large FASTA files are broken down into a couple of
databases. This process can also be inverted; a user could
manually write an alias file (with a name like 'ntest.nal')
to combine two databases formatted with separate runs of
formatdb.
Look at the *.nal file that is produced when formatdb is
run to see the syntax.
Formatdb must be able to open files larger than 2 Gig in order to work on very large files. This is not a problem on a 64-bit OS and on certain 32-bit OS that allows binaries to be made large-file aware. The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled large file aware.
D.) Note on running formatdb on a database without uncompressing it:
To run formatdb without uncompressing the 'nt' database
one should execute:
uncompress -c nt.Z | formatdb -i stdin -p F -o T -n NT Note
the use of the '-n' option that specifies the name of the
resulting BLAST database. Note also that 'stdin' specifies
that input will be coming from 'standard input'. The NT
FASTA file is not needed for running BLAST searches and
nt.Z may be deleted after formatdb has been run.