CDD FTP-archive revised 16 January 2015 ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/README =============================================================================== CONTENTS: 1. SUMMARY 2. SCOPE OF DATA IN FTP FILES 3. LIST OF FILES AND SUBDIRECTORIES, AND SUMMARY OF THEIR CONTENTS 4. DETAILS FOR EACH FILE AND SUBDIRECTORY (ALPHABETICALLY BY FILENAME) 5. CD-SEARCH WEB SERVICE VS. STANDALONE RSP-BLAST 5.1 What accounts for the differences in search results generated by the CD-Search web service and standalone RSP-BLAST? 5.2 How can I configure standalone RPS-BLAST to generate the same results as the CD-Search web service? ================================================================================ 1. SUMMARY ================================================================================ This ftp-directory contains collections of position-specific scoring matrices (PSSMs) that have been created for the CD-Search service (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). CD-Search can be used to identify conserved domains in a query protein sequence and infer its putative function (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). PSSMs are briefly described in the CDD help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CD_PSSM. The PSSMs are meant to be used for compiling RPS-BLAST search databases, which can be used with the standalone RPS-BLAST program. That program, as well as the makeprofiledb application needed to convert files in this directory, are part of the BLAST executables (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/) and the NCBI software development toolkit distribution (ftp://ncbi.nlm.nih.gov/toolbox). The makeprofiledb application is described at www.ncbi.nlm.nih.gov/books/NBK1763 Be sure to use recent BLAST executables in order to obtain the makeprofiledb application that is compatible with the CDD FTP files. (The formatrpsdb application packaged with earlier BLAST releases is not compatible and will result in an error message, "unable to match element in intermediateData... ERROR: no data found in file.") The "little_endian" subdirectory of this CDD FTP site contains preformatted databases, eliminating the need to use the makeprofiledb application. More information about the "little_endian" subdirectory is provided in subsequent sections of this file. However, if you prefer to create customized search sets, you will still need to run makeprofiledb. An "rpsbproc" subdirectory of this CDD FTP site contains a command line utility that serves as an addition to the standalone RPS-BLAST executable. The "rpsbproc" utility post-processes the results of local RPS-BLAST searches in order to provide a non-redundant view of the search results, and to provide additional annotation on query sequences, such as domain superfamilies and conserved sites, similar to the annotation provided by the corresponding web services. Additional details about both the standalone RPS-BLAST program and the "rpsbproc" utility are provided in subsequent sections of this file. Finally, note that the E-values you get (for any given protein query--conserved domain hit pair) on the CD-Search web service might differ from those you get when using standalone RPS-BLAST on your local PC. The last section of this document describes the differences between the web service and standalone program, and provides a tip on how you can configure standalone RPS-BLAST to generate the same results as those produced by the CD-Search web service. =============================================================================== 2. SCOPE OF DATA IN FTP FILES =============================================================================== Data accessible via the CD-Search tool and in the Entrez Conserved Domain Database (CDD) originate from a number of source databases, including NCBI- curated domain models as well as models from external sources. The data are grouped by source database into the following subsets: 1 cd ....... alignment models curated at NCBI as part of the CDD project (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSource_NCBI_curated) 2 Pfam ..... PSSMs from a mirror of the Pfam-A seed alignment database (see: http://pfam.sanger.ac.uk/) 3 Smart .... PSSMs from a mirror of the Smart domain alignment database (see: http://smart.embl-heidelberg.de/) 4 COG ...... PSSMs from automatically aligned sequences and sequence fragments classified in the COGs resource, which focuses primarily on prokaryotes (see: http://www.ncbi.nlm.nih.gov/COG/new/) 5 PRK ...... PSSMs from automatically aligned sequences and sequence fragments classified as stable clusters in the Protein Clusters database (see: http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&db=proteinclusters) 6 TIGRFAM PSSMs from a mirror of the TIGRFAM database of protein families (see: http://www.jcvi.org/cms/research/projects/tigrfams/overview/) 7 KOG ...... PSSMs from automatically aligned sequences and sequence fragments classified in the KOGs resource, the eukaryotic counterpart to COGs (see "http://www.ncbi.nlm.nih.gov/COG/new/"). These are available as a separate search set in CD-Search, but are not in the CD-Search tool's DEFAULT "cdd" database, and are not indexed for text searching in Entrez CDD. 8 LOAD ..... Library of Ancient Domains These 55 models are available only as a data file on the FTP site but are not searchable via CD-Search and are not indexed for text searching in Entrez CDD. The domains in this set are represented by domain models in the other data collections above. (Additional details about source Databases are provided in the CDD Help Doc: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSource) The CD-Search databases, Entrez CDD, and the FTP files in this directory, encompass various data sets. The scope of data covered by each FTP file is noted in the FILE LIST and SUMMARY, below, and can be one of the following: SCOPE A: ALL CD models accessible via the CD-Search tool (subsets 1-7, described above), plus subset 8 (which is accessible from this FTP site, but is not searchable via CD-Search and is not indexed for text searching in Entrez CDD). SCOPE B: Data from the CD-Search tool's DEFAULT "cdd" database, which includes subsets 1-6, above. These subsets are also indexed and searchable in NCBI's Entrez CDD database. SCOPE C: NCBI-curated CD models (subset 1, above). SCOPE D: Data from specific, individual NCBI-curated CD models SCOPE E: conserved domain models that are members of superfamilies; these can include models from subset 1, and models from subsets 2-6 that are not multidomains. Superfamilies and multidomains are described in the CDD Help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types =============================================================================== 3. LIST OF FILES AND SUBDIRECTORIES, AND SUMMARY OF THEIR CONTENTS =============================================================================== The CDD FTP directory includes the files and subdirectories listed below. Files are listed first, followed by subdirectories. The letter in the "SCOPE" column indicates the set or subset of data in a file, as described in the preceding section, "SCOPE OF DATA in FTP FILES." Additional DETAILS for each file and subdirectory are provided in the next section, "DETAILS FOR EACH FILE AND SUBDIRECTORY." ------------------------------------------------------------------------- FILE NAME |scope| summary ------------------------------------------------------------------------- cdd.tar.gz | A | PSSMs originating from various alignment | | collections; can be used to build search | | databases for RPS-BLAST. | | (scope A: all CD models) ------------------------------------------------------------------------ acd.tar.gz | A+ | CD data as used by the CD-server for | | visualization of CD-search results | | (scope A, PLUS data for superfamily clusters) ------------------------------------------------------------------------ cddid_all.tbl.gz | A | summary information about all CD models in this | | distribution | | (scope A: all CD models) ------------------------------------------------------------------------ fasta.tar.gz | A | sequence alignments from the CDs in mFASTA format | | (scope A: all CD models) ------------------------------------------------------------------------ cdd.versions | A | list of all conserved domain model accessions, | | versions, and PSSM IDs present in the current and | | previous versions of the Conserved Domain Database | | (scope A: all CD models) ------------------------------------------------------------------------ cdd.info | B | CDD release version number and details | | (scope B: default "cdd" database) ------------------------------------------------------------------------- cddid.tbl.gz | B | summary information about the CD models in this | | distribution that are part of the CD-Search tool's | | default "cdd" database and are indexed in | | NCBI's Entrez CDD database | | (scope B: default "cdd" database) ------------------------------------------------------------------------ cddmasters.fa.gz | B | FASTA-formatted sequences that show representative | | sequences for each conserved domain model in the | | collection | | (scope B: default "cdd" database) ------------------------------------------------------------------------ cddannot.dat.gz | C | information about conserved family features | | (such as binding and catalytic sites) as | | recorded for NCBI-curated CD models | | (scope C: NCBI-curated domain models) ------------------------------------------------------------------------ cddannot_generic.dat.gz | information about generic conserved family features | C | (such as binding and catalytic sites) in root CD | | models that are mapped to all hierarchy members. | | (scope C: NCBI-curated domain models) | | (scope C: NCBI-curated domain models) ------------------------------------------------------------------------ cdtrack.txt | C | information from NCBI's internal tracking system | | about hierarchies of related domain models in | | NCBI-curated domains (scope C) ------------------------------------------------------------------------ bitscore_specific_X.XX.txt | domain-specific score thresholds used by | | CD-Search tool to determine whether hits to | C | NCBI-curated domain models are specific or | | non-specific. The X.XX portion of the filename | | indicates CDD release number. (scope C) ------------------------------------------------------------------------ cd00882_notree.acd | D | versions of files distributed within acd.tar cd01659_notree.acd | D | that are meant for users of the old NCBI C-toolkit cd02039_notree.acd | D | (scope D: specific, individual NCBI-curated models) ------------------------------------------------------------------------ family_superfamily_links | list of NCBI-curated and imported domain models | that are members of CDD superfamilies, along | E | with the superfamily accession (cl*) to which | | each domain model belongs | | (scope E: superfamily members) | | ------------------------------------------------------------------------ DIRECTORY NAME |scope| summary ------------------------------------------------------------------------- big_endian | | These subdirectories contain pre-formatted search little_endian | | databases for use with the standalone RPS-BLAST (subdirectories) | | executable (available from | | ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/ | | and described in | | ftp://ftp.ncbi.nih.gov/blast/documents/rpsblast.html). | | The databases are formatted for use with various | | architecture/OS combinations. | | | | The "little_endian" directory is up to date and | | contains databases formatted for Intel/Linux, | | Intel/Windows, and Intel/Solaris. | | | | The "big_endian" directory is no longer supported; | | it contains databases formatted for Sun/Solaris and | | SGI/IRIX. | | | A-| The scope of data in both directories is A- | | (i.e., just less than scope A). The directories | | contain ALL CD models accessible via the CD-Search | | tool (subsets 1-7, described in"SCOPE OF DATA in FTP | | FILES", above) BUT NOT subset 8 (LOAD)). The data | | are organized into files based on source database. | | | B | The files named "Cdd_LE.tar.gz" and "Cdd_BE.tar.gz | | have scope B. That is, they contain all domain | | models that are in the CD-Search tool's DEFAULT | | "cdd" database (subsets 1-6, described in | | "SCOPE OF DATA in FTP FILES", above). | | | C | The file named "Cdd_NCBI_LE.tar.gz" and is | | available only in the little_endian directory, | | and has scope C. That is, it contains only | | NCBI-curated CD models (subset 1, described in | | "SCOPE OF DATA in FTP FILES", above). | | | | File names in the little_endian directory | | contain the fragment "LE" and filenames in the | | big_endian directory contain the fragment "BE." | | ------------------------------------------------------------------------ rpsbproc | N/A | This directory contains the "rpsbproc" command line | | utility, which is an addition to the standalone | | RPS-BLAST executable. The "rpsbproc" utility | | post-processes the results of local RPS-BLAST | | searches in order to provide a non-redundant view | | of the search results, and to provide additional | | annotation on query sequences, such as | | domain superfamilies and conserved sites, | | similar to the annotation provided by the | | corresponding web services, such as the | | NCBI Batch CD-Search web service. | | ------------------------------------------------------------------------- =============================================================================== 4. DETAILS FOR EACH FILE AND SUBDIRECTORY (ALPHABETICALLY BY FILENAME) =============================================================================== Files are listed below, in alphabetical order. Subdirectories are described last, also in alphabetical order. -------------------------- FILES: -------------------------- =============================================================================== acd.tar.gz =============================================================================== "acd.tar.gz" is a gzipped archive that contains the CD data as used by the CD-server for visualization of CD-search results. They have been stored as ASN.1 formatted files. The types of information provided for each conserved domain model are described in the CDD help document section on "CDD Record (CD Summary page): What information is displayed for each domain model?" http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDVisual. (SCOPE A+: this file includes data from all CD models, PLUS data for superfamily clusters (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily). Additional details about data coverage are provided in the earlier section, "SCOPE OF DATA in FTP FILES") Technical note: The "acd" acronym is used at NCBI to denote "ASN.1 Cd Datafile". It is also used as a file extension for CD data files (e.g., the "cd0????_notree.acd" files in this FTP directory). However, the conserved domain file extensions appear as "*.cn3" when using the CDD database web server's "Structure View" function. That allows the conserved domain data files to be uniquely associated with the Cn3D viewing program (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml), and to differentiate them from *.acd file extension used by computer-aided design programs such as Autocad. =============================================================================== bitscore_specific_X.XX.txt =============================================================================== "bitscore_specific_X.XX.txt" (e.g., "bitscore_specific_2.14.txt") contains the domain-specific bit score thresholds used by CD-Search tool to determine whether hits to NCBI-curated domain models are specific or non-specific (both hit types are described in: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types). This file is saved for each CDD release (the X_XX portion of the filename indicates the CDD release number), allowing retrieval of current and previous bit score thresholds for a domain model. The file contains three columns: 1. conserved domain PSSM ID This is a unique identifier for a domain model's position-specific scoring matrix (PSSM). If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, curators might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change. Additional information about PSSMs is accessible from: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 2. conserved domain accession number Domain-specific score thresholds are currently calculated only for NCBI-curated domains; therefore, all accessions in the file begin with the prefix "cd" (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd_help.shtml#CDSource_accession_prefix) 3. domain-specific score threshold, shown as bit score This column shows the lowest bit score among self-hits of a domain’s member protein sequences to the resulting domain model. This domain-specific score threshold can change for the same reasons the PSSM ID can change (explained in #1, above). An illustrated example and additional details about specific hits and domain-specific thresholds are provided in the CD-Search help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#SpecificHit (SCOPE C: this file includes data from NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cd0????_notree.acd (no longer updated) =============================================================================== "cd0????_notree.acd" are versions of files distributed within "acd.tar", which have been stored without data representing the sequence tree of the underlying set of sequence fragments. Trees in these particular examples are deeply nested and can not be read with the old NCBI C-toolkit object loaders. These separate files allow users of the old NCBI C-toolkit to load the full set of conserved domain models into their applications. (SCOPE D: these files include data from specific, individual NCBI-curated models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdd.info =============================================================================== "cdd.info" contains the CDD release version number and details the content of the release (number of models from each data source) (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdd.tar.gz =============================================================================== "cdd.tar.gz" is a gzipped archive file that contains Position-Specific Scoring Matrices (PSSMs) originating from all of the alignment collections encompassed by the Conserved Domain database project. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) To build search databases for RPS-Blast you need to unpack the archive and extract its contents. It contains ascii formatted files only, with the following extensions: *.smp ...... Position Specific Scoring Matrices (PSSMs). These are stored in a new ASN.1 format ("scoremat"), which is shared between various BLAST applications. *.pn ....... lists of PSSM file names and allows for the compilation of the following RPS-Blast search databases Smart Pfam Cog Kog Prk Tigr Cdd_NCBI (NCBI/CDD-curated domain models only) Cdd (domains from Smart, Pfam, COG, PRK, and cd, this is the set that's indexed in NCBI's Entrez) The databases must be formatted with the "makeprofiledb" application that is distributed with the BLAST executables (ftp://ftp.ncbi.nih.gov/blast/executables/). Be sure to use recent BLAST executables in order to obtain the makeprofiledb application that is compatible with the CDD FTP files. (The formatrpsdb application packaged with earlier BLAST releases is not compatible and will result in an error message, "unable to match element in intermediateData...error no data found in file.") The following sequence of commands will build the search databases: makeprofiledb -title SMART.v6.0 -in Smart.pn -out Smart -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title Pfam.v.26.0 -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title COG.v.1.0 -in Cog.pn -out Cog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title KOG.v.1.0 -in Kog.pn -out Kog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title CDD.v.3.12 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title CDD_NCBI.v.3.12 -in Cdd_NCBI.pn -out Cdd_NCBI -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title PRK.v.6.00 -in Prk.pn -out Prk -threshold 9.82 -scale 100.0 -dbtype rps -index true Note that the parameter '-threshold' supplied with makeprofiledb, the three-letter word score threshold for detecting and extending hits in RPS-Blast, will determine the size of the search database. A lower threshold will result in larger databases and slightly increased search sensitivity, at the cost of additional memory requirements and reduced search speed. Matrices distributed for creating RPS-Blast search databases are scaled by a factor of 100 (parameter -scale). A score threshold value of 9.82 will result in search-databases of a size very similar to using unscaled matrices and a threshold value of 11. Note also that the RPS-Blast search databases generated by makeprofiledb are architecture dependent, it may not be possible to create them on one and use them on another platform. When searching with your local version of RPS-blast, use the command-line argument "-d" to specify the database name and location. You need an executable version of the "rpsblast" program, type "rpsblast" without arguments to obtain a list of command-line options. You can now take any arbitrary subset of PSSMs and compile them into an RPS-Blast search database. All that makeprofiledb needs is a list of file names (such as "Smart.pn" in the example above) and the corresponding "scoremats" (*.smp) files. Newer versions of Psi-BLAST (blastpgp) can now write out "checkpoints" in the "scoremat" format as well (blastpgp parameter -u1). These again can be combined with arbitrary subsets of scoremat- formatted PSSMs distributed here, to create customized RPS-Blast search sets. The scoremat-formatted PSSMs distributed here are scaled with a factor 100.0, and if one was to combine them with Psi-BLAST generated "scoremats", the same scaling factor must be set as a parameter with makeprofiledb. Note: If you prefer to use preformatted databases, see the little_endian subdirectory of the CDD FTP site. It contains databases that have been preformatted for use with Intel-based platforms, under Linux, Windows, and other operating systems. =============================================================================== cdd.versions =============================================================================== "cdd.versions" lists all conserved domain model accessions, versions, and PSSM IDs present in the current and previous versions of the Conserved Domain Database. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) Example/Excerpt from file: # Acc ShortName PssmId Root Ver Lv Rl ER Time # ------------ ----------------- -------- ---- -- -- -- ----------------- ... pfam09006 Surfac_D-t 90442 N/A 4 1 1 0 01/09/08 09:49:00 pfam09006 Surfac_D-t 87766 N/A 3 0 1 0 09/13/07 17:36:00 pfam09006 Surfac_D-t 72424 N/A 2 0 1 0 05/07/07 17:24:00 pfam09006 Surfac_D-t 72424 N/A 1 0 1 0 03/12/07 13:54:00 ... Column descriptions: Acc = conserved domain model accession number (e.g., pfam09006) ShortName = first 10 characters of domain model's short name, in this case, Surfac_D-t, for Surfac_D-trimer. PSSMID = unique identifier for the position specific scoring matrix (e.g., as the pfam09006 domain model has evolved, it has had three PSSMs, with IDs 72424, 87766, and 90442, respectively). If there are any changes in the protein sequence alignment of a domain model (for example, the addition/deletion of member protein sequences or changes in the span of aligned residues), or if there are changes in the interpretation of the alignment, a new PSSM will be calculated. In that case, it will receive a new PSSM ID, although the accession number of the conserved domain model will remain the same. If only the domain model description or other annotations have changed, but the PSSM did not change, the version of the model will be incremented but the the PSSM ID will remain the same, as it did for version 1 and 2 of pfam09006, both of which had the PSSM ID 72424. Root = if the domain model is NCBI-curated, the "Root" column will show the accession number of the parent node of the curated domain hierarchy. If the domain hierarchy contains only a single node, the value in the "Root" column will be the same as that in the "Acc" column. The values will also be the same if the accession listed in the first column is the parent node of a multi-level hierarchy. Version = version number of that particular domain model Lv = indicates the current live version of the record: 1 = live status; 0 = dead, earlier version. Rl = indicates whether the domain model version has been released into the public database. This is a flag NCBI uses for internal data tracking. For most domain models, the value will be 1= released, which means at some point the model was live in the database. Occasionally a value of "0" might appear, primarily for ncbi-curated models. This indicates a newer version of a model is in preparation at NCBI and will be released in the future. ER = Expendable or redundant models; value in this column can be: 0 = non-expendable or not redundant 1 = expendable or redundant; indicates a model that has been removed from the default "cdd" search set because the information in it is represented in another domain model. Time = date and time on which the model was last updated in the internal conserved domain tracking database. =============================================================================== cddannot.dat.gz and cddannot_generic.dat.gz =============================================================================== "cddannot.dat.gz" contains information about conserved family features (such as binding and catalytic sites) as recorded for, and thus are ‘specific’ to, curated CD models. "cddannot_generic.dat.gz" contains so-called ‘generic’ features that are present in the root CD model of a hierarchy and mapped to each descendant CD model in that hierarchy. For the root CD, the generic and specific features are identical. If the root CD has no features, then there are no generic features for any CD in that hierarchy. These are tab-delimited text files, with a single row per "feature" and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd') CD "short name" Feature number Feature description/name Feature sequence motif (given in Prosite syntax), or ‘0’ when not specified Boolean flag (0/1), indicating presence of structure-based feature evidence Boolean flag (0/1), indicating presence of reference-based feature evidence Boolean flag (0/1), indicating presence of additional comments comma-separated feature addresses site type (numerical) The feature addresses are positions on the alignment's "master sequence", which is a consensus sequence, and on the alignment's PSSM (the database search model). Note that feature addresses are stored in a coordinate system that counts the first residue in the consensus sequence as "0". The site types are assigned as follows: 0 ... unassigned or type "other" 1 ... active site 2 ... polypeptide binding site 3 ... nucleic acid binding site 4 ... ion binding site 5 ... chemical binding site 6 ... posttranslational modification site 7 ... structural motifs (SCOPE C: this file includes data from NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddid.tbl.gz =============================================================================== "cddid.tbl.gz" contains summary information about the CD models in this distribution, which are part of the default "cdd" search database and are indexed in NCBI's Entrez database. This is a tab-delimited text file, with a single row per CD model and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK' or "CHL') CD "short name" CD description PSSM-Length (number of columns, the size of the search model) (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddid_all.tbl.gz =============================================================================== "cddid_all.tbl.gz" contains summary information about all CD models in this distribution. This is a tab-delimited text file, with a single row per CD model and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK', 'CHL', 'KOG', or 'LOAD') CD "short name" CD description PSSM-Length (number of columns, the size of the search model) (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddmasters.fa.gz =============================================================================== "cddmasters.fa.gz" is an archive containing the FASTA-formatted sequences that shows representative sequences for each conserved domain model in the collection. The representative sequences are consensus sequences with an approximate median length relative to all the sequence footprints used in the alignment. They are constructed for calculating a position-specific score matrix (PSSM), each residue in the representative sequence corresponds to a column in the PSSM. When RPS-BLAST formats output, it will display pair-wise alignments between the query and PSSMs' representative sequences. (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdtrack.txt =============================================================================== "cdtrack.txt" lists information from NCBI's internal tracking system for conserved domain models curated at NCBI. The intent of this file is to provide information about hierarchies of related domain models. All models that map to the same root accession have been linked together in a hierarchical set, in which the alignment models are consistent with each other. Columns in this table are: Acc .......... CD accession ShortName .... CD short name PssmId ....... CD PSSM-ID, a unique numerical identifier for each CD Root ......... Accession of the CD hierarchy root model. Ver .......... CD version number Lv ........... is model live in the tracking system? Rl ........... has model been released to the public? ER ........... has model been flagged as "expendable or redundant"? Time ......... time stamp in the tracking system (last modification) (SCOPE C: NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== family_superfamily_links =============================================================================== "family_superfamily_links" lists the conserved domain models that are members of superfamilies, along with the superfamily cluster (cl*) accession to which each domain model belongs. Superfamily members can include NCBI-curated domain models as well as imported models that are not multi-domains. More information about superfamilies and multidomains is available at: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily and http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types. The file contains four columns: 1. conserved domain accession number For examples, see: http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd_help.shtml#CDSource_accession_prefix. 2. conserved domain PSSM ID This is a unique identifier for a domain model's position-specific scoring matrix (PSSM). If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, the curators of a source database might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change. Additional information about PSSMs is accessible from: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 3. superfamily cluster accession number If a conserved domain model belongs to a superfamily with two or more members, this column contains the accession of the corresponding superfamily (an alphanumeric string that starts with a "cl" prefix, which means "cluster," and followed by a series of digits, e.g., cl02915). If a conserved domain model is a "singleton" (the sole member of a superfamily), this column simply repeats the conserved domain model's accession number that is shown in column 1. (Note: The majority of superfamilies are singletons, containing a single model from either Pfam, TIGRFAM, COGs, etc. While the CDD data processing pipeline does generate corresponding superfamily cluster models, they are not indexed in the Entrez search system in order to reduce redundancy in the presentation of search results.) Superfamily clusters are produced via an automated procedure each time there is a new CDD release. Information about clustering methodology is provided at: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily. The composition of a cluster can change over time due to a variety of factors, such as (a) availability of new domain models, (b) changes to previously existing models, (c) new and/or updated sequence records in the Entrez Protein database, and (d) refinements to the automated clustering procedures. A superfamily cluster accession number will remain the same if at least 50 percent of its member models (conserved domain accessions) have not changed relative to the previous version of the cluster. If more than 50 percent of the conserved domain accessions from a previous version of a cluster are no longer present in the new build of that cluster, or if the cluster size more than doubles with a new build, then the superfamily cluster accession is retired and replaced by a new accession(s). If two previous clusters merge into a single new cluster, the superfamily cluster accession number of the larger component cluster is used for the new grouping. 4. superfamily cluster PSSM ID A superfamily's PSSM ID refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily. The superfamily cluster PSSM ID will change if there is any change to the set of member PSSM IDs relative to the previous version of the cluster (e.g., if a member conserved domain gets a new PSSM ID due to changes in its multiple sequence alignment, of if a new conserved domain model is added to the superfamily as the result of a CDD database update). The family_superfamily_links file for each CDD release will be saved on the FTP site and can be used to track changes in superfamily clusters over time. (Scope E: this file includes data from NCBI-curated and imported domain models that are members of superfamiles; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== fasta.tar.gz =============================================================================== "fasta.tar.gz" contains sequence alignments from the CDs in mFASTA format. Note that sequence fragments are identified with GIs and/or accessions, but the alignments do not necessarily contain full-length sequences: the fragments span the region between the first and last aligned residue only. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) -------------------------- DIRECTORIES: -------------------------- =============================================================================== "little_endian" subdirectory (still supported) and "big_endian" subdirectory (no longer supported) =============================================================================== The "little_endian" and "big_endian" subdirectories contain pre-formatted search databases for use with the standalone RPS-BLAST executable. The executable is available from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/ and is described in ftp://ftp.ncbi.nih.gov/blast/documents/rpsblast.html). The databases are formatted for use with following architecture/OS combinations: little_endian (current): Intel/Linux, Intel/Windows, Intel/Solaris big_endian (outdated): Sun/Solaris, SGI/IRIX ****************************************************************** ** We can no longer maintain pre-computed search databases ** ** for big_endian architectures. Search databases distributed ** ** via the big_endian FTP directory are outdated. The data in ** ** the little_endian FTP directory, however, are current. ** ****************************************************************** The subdirectories contain gzipped archives for each of the 5 different search sets listed above. Simply download the set you need, unpack the archive, and use the search set with rpsblast on your platform. (Scope A-: The scope of data in both directories is A- (i.e., just less than scope A). The directories contain ALL CD models accessible via the CD-Search tool (subsets 1-7, described in the "SCOPE OF DATA in FTP FILES" section of this document) BUT NOT subset 8 (LOAD)). The data are organized into files based on source database. Scope B: The files named "Cdd_LE.tar.gz" and "Cdd_BE.tar.gz have scope B. That is, they contain all domain models that are in the CD-Search tool's DEFAULT "cdd" database (subsets 1-6, described in the "SCOPE OF DATA in FTP FILES" section of this document). Scope C: The file named "Cdd_NCBI_LE.tar.gz" and is available only in the little_endian directory, and has scope C. That is, it contains only NCBI-curated CD models (subset 1, described in the "SCOPE OF DATA in FTP FILES" section of this document).) File names in the little_endian directory contain the fragment "LE" and filenames in the big_endian directory contain the fragment "BE." Note that starting with CDD version v3.11 the pre-calculated RPS-BLAST databases will be presented in a new format that includes frequency tables, and RPS-BLAST searches can now be run using composition-corrected scoring. This will require a recent RPS-BLAST binary (BLAST versions 2.2.28 and up). If you prefer to format the search databases on your own rather than use preformatted databases, see the "cdd.tar.gz" file description. =============================================================================== rpsbproc subdirectory =============================================================================== The "rpsbproc" subdirectory contains the "rpsbproc" command line utility, which is an addition to the standalone RPS-BLAST executable. Both programs are described below: ----------------------------- RPS-BLAST ----------------------------- RPS-BLAST is used to identify conserved domains, or functional units, within a query sequence. RPS-BLAST searches a protein sequence (or a protein translation of a nucleotide sequence) against a database of profiles that represent conserved domains. This is the opposite of PSI-BLAST, which searches a profile against a database of protein sequences, hence the term 'Reverse'. For each query sequence, standalone RPS-BLAST lists the conserved domain models that score above a certain threshold (default set to an evalue of 10), sorted by scores. The information provided for each hit includes the conserved domain's PSSMID, a set of scores (e-value, bitscore, etc) and the actual sequence alignment between the conserved domain and the query sequence. A standalone version of the RPS-BLAST program is packaged with the BLAST executables (available on the NCBI FTP site at ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/), and is also available as part of the NCBI toolkit distribution (see ftp://ftp.ncbi.nih.gov/toolbox). Additional details are provided in ftp://ftp.ncbi.nih.gov/blast/documents/rpsblast.html. ----------------------------- rpsbproc command line utility ----------------------------- The "rpsbproc" utility post-processes the results of local RPS-BLAST searches in order to provide a non-redundant view of the search results, and to provide additional annotation on query sequences, such as domain superfamilies and conserved sites, similar to the annotation provided by the corresponding web services (e.g., the NCBI Batch CD-Search web service at http://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi). Specifically, the rpsbproc utility reads the output of rpsblast/rpstblastn, fills in domain superfamily and conserved site information for each region of the sequence, re-sorts the hits by a different standard, and calculates a set of non-redundant representative hits. In this way, it turns the raw alignments into domain/site annotations on the query sequence at different redundancy level, just like the Batch CD-Search service does on the web. The annotation data is presented in tab-delimited tables to be processed either programmatically or manually with a spreadsheet. The rpsbproc command line utility is available from the Conserved Domain Database (CDD) FTP site: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/ and additional details are provided in the corresponding README file: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/rpsbproc/README =============================================================================== 5. CD-SEARCH WEB SERVICE VS. STANDALONE RSP-BLAST =============================================================================== ---------------------------------------------------------- 5.1 What accounts for the differences in search results generated by the CD-Search web service and standalone RSP-BLAST? ---------------------------------------------------------- There are several differences between the CD-Search web service and standalone RSP-BLAST, as distributed by NCBI and used with search databases as distributed by the CDD group. The web server is optimized for the most common use of the CDD resource, which is to annotate protein sequences with clearly identified and well understood protein domains, and is also optimized for speed in order to accommodate a high volume of searches. As part of the optimization, we use some different statistical parameters for the web service than for the standalone RPS-BLAST application. Specifically, we use a constant, assumed search "database size" setting on the web server for calculating E-values. This means that the actual size of the search database can change (we are adding new models every few weeks), but the E-value computed for any individual GI -- PSSM match will remain constant. This approach: (a) ensures that pre-calculated residues are not dependent on the actual size of the model collection (which is redundant and mostly grows by increasing that redundancy); (b) facilitates incremental updates of pre-computed sequence annotation with conserved domains; and (c) is used for the creation of protein-CDD links. In contrast, standalone RPS-BLAST does not employ the constant, assumed database size parameter. So when you use a search set downloaded from the CDD FTP site, the database size might be different than the one used by the CD-Search web service, and the same hit of your query protein to a model will receive a different E-value in the standalone result. For example, if the size of the FTP'ed database is smaller than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a model will receive a lower E-value in the standalone. Conversely, if the size of the FTP'ed database is larger than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a conserved domain model will receive a higher E-value in the standalone. ---------------------------------------------------------- 5.2 How can I configure standalone RPS-BLAST to generate the same results as the CD-Search web service? ---------------------------------------------------------- If you want standalone RPS-BLAST to use the same database size parameter that is used for the web server (and thereby reproduce the same E-values with standalone RPS-BLAST that are generated by the web service), you can do that by creating an "alias" file on your local computer and placing it in the same directory as the standalone RPS-BLAST executable. The file can have a name such as "mycdd.pal" and can have contents such as the following (where lines starting with "#" are comments): # # RPSBLAST alias file # TITLE mycdd # DBLIST ./Cdd # STATS_TOTLEN 5000000 STATS_NSEQ 21000 This will now let you search against the database named "Cdd" using the two search set size parameters as specified, e.g.: ~$ rpsblast -query rpstest.tfa -db mycdd -seg no -comp_based_stats 1 -evalue 0.01 -outfmt 7 # RPSBLAST 2.2.30+ # Query: gi|156356500|ref|XP_001623960.1| predicted protein [Nematostella vectensis] # Database: mycdd # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 10 hits found gi|156356500|ref|XP_001623960.1| gnl|CDD|259194 28.45 116 73 7 434 545 1 110 2e-07 50.0 gi|156356500|ref|XP_001623960.1| gnl|CDD|259194 24.04 104 62 6 54 149 2 96 0.009 35.3 gi|156356500|ref|XP_001623960.1| gnl|CDD|215056 21.18 85 61 2 463 541 27 111 8e-04 40.9 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 23.53 51 34 2 493 542 1 47 0.001 38.0 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 21.57 51 35 2 375 424 1 47 0.004 36.0 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 25.58 43 27 2 111 152 1 39 0.009 34.9 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.91 47 29 2 432 475 4 50 0.002 36.9 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.48 54 31 3 493 545 6 54 0.002 36.5 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 33.33 42 27 1 312 352 2 43 0.007 35.3 gi|156356500|ref|XP_001623960.1| gnl|CDD|192197 40.00 40 20 1 235 274 4 39 0.003 36.9 # BLAST processed 1 queries In addition to the different statistical parameters, the CD-Search web service does not filter out, by default, compositionally biased regions in the query sequence. In contrast, the standalone RPS-BLAST versions 2.2.28 and up filters them out by default. In the current RPS-BLAST version 2.2.30, you can avoid filtering (masking) by specifying "-seg no", where "seg" represents the SEG algorithm used to compute the filter. The CD-Search web service also employs composition-corrected scoring, use the standalone RPS-BLAST command line option "-comp_based_stats 1" to mirror this behavior. Finally, some advanced options in standalone RPS-BLAST are not available in the web service, such as the ability to use a single-hit/two-pass mode in order to detect more distant homologous relationships. Users who select such options in the standalone version may get different search results with the web service. =============================================================================== Aron Marchler-Bauer, Renata Geer, 16 January 2015