Files in this directory comprise the GlobDB release 220. 

Descriptions of files and brief methods how the files were created are listed below.
More information can be found on https://globdb.org/.


### File descriptions 

globdb_r220_genome_fasta.tar.gz
Fasta files for all genomes in the GlobDB. Fasta headers have been renamed using the anvi'o script anvi-script-reformat-fasta to simplify the contig header lines and include the GlobDB genome ID at the start of every contig.

globdb_r220_anvio_dbs.tar.gz
Anvi'o contigs databases for all GlobDB genomes, generated using the program anvi-gen-contigs-database from the fasta files provided in the "genome fasta". These anvi'o databases include the gene calling and annotation information of the genomes, and can be used to generate the protein fasta and gff files. Furthermore, the anvi'o databases can be used as a resource for phylogenomics, pangenomics, or other comparative analyses. Note that the anvi'o contigs databases are created with the development version of anvi'o and therefore can only be used with that version.

globdb_r220_protein_faa.tar.gz
fasta files for all proteins in the GlobDB genomes annotated using prodigal as integrated in anvi'o, exported using the program anvi-get-sequences-for-gene-calls. Note that the headers of all proteins were renamed after export to include the GlobDB genome ID, and to be consistent with the IDs in the exported gff files.

globdb_r220_gff_cog.tar.gz
Anvi'o exported files in GFF format for all GlobDB genomes with COG annotation as done using the anvi'o program anvi-run-ncbi-cogs. See "Methods" page for more details.

globdb_r220_gff_kegg.tar.gz
Anvi'o exported files in GFF format for all GlobDB genomes with KEGG annotation as done using the anvi'o program anvi-run-kegg-kofams. See "Methods" page for more details.

globdb_r220_gff_pfam.tar.gz
Anvi'o exported files in GFF format for all GlobDB genomes with PFAM annotation as done using the anvi'o program anvi-run-pfams. See "Methods" page for more details. 

globdb_r220_cugo.tar.gz
Twelve column tab delimited files derived from "gff cog" for all GlobDB genomes. Used for consensus genomic context visualization.

globdb_r220_tmhmm.tar.gz
Two-column tab delimited files for each GlobDB genome, with the number of predicted transmembrane segments for each protein using TMHMM.

globdb_r220_stats.tsv
output of the `statswrapper.sh` program from the BBmap package for all genomes in the GlobDB.

globdb_r220_tax.tsv
Tab delimited file containing GlobDB genome ID as well as 7 level taxonomy (assigned by GTDB-tk), each as separate field.

globdb_r220_md5sum
md5sums of the downloadable files


### Brief methods

# Dereplication of the underlying databases
Each of the four data sources for the GlobDB release 220 (GTDB, GEM, SPIRE, SMAG) provides a dereplicated set of genomes, using a 95 % average nucleotide identity (ANI) dereplication criterion. Since the underlying data for each database is distinct, there may be centroids between datasets that belong to the same species by the operational definition of 95 % ANi similarity, but between them represent a cluster of genomes that extends beyond that operational species boundary. Thus it is possible that the genomes that two centroids represent wouldn't all be classified to the same species if all data was processed together.
The GlobDB therefore conservatively dereplicates the underlying (already dereplicated) datasets at 96 % ANI over 50 % aligned fractions (AF). We have chosen these cutoffs after inspection of a crossplot of ANI vs AF values, showing a clear ANI gap between the dereplicated datasets starting at approximately 96% ANI. Furthermore, the order of priority for inclusion in the GlobDB is as follows:
1) all GTDB species representatives are included. 
2) GEM species representatives not present in the GTDB are included. 
3) SPIRE species representatives not present in either GTDB or GEM are included. 
4) SMAG species representatives not present in either of the three datasets are included.

Dereplication of the data sources is done using fastANI, version 1.3.4, run with the 3000bp fragment length default.

As the GTDB updates conservatively, minimizing the amount of genomes that get removed between versions, and the other three data sources are (thus far) static datasets, updates between versions are done by checking the GEM, SPIRE, and SMAG species representative genomes against the new GTDB releases, and dropping any genomes now represented in the GTDB.
For convenience in later use, the names of the genomes in the GEM, SPIRE, and SMAG datasets are standardised to GEMOTU, SPIREOTU and SMAGOTU respectively. In addition, the SPIRE dataset includes genomes from the proGenomes3 resource, identified as SPECIV4.
Dictionary files to relate the GlobDB identifiers back to the identifiers from the original publications are available in the "Downloads" page.
 
 
# Generation of anvi'o databases and annotation
After dereplication, the resulting genome fasta files are turned into anvi'o databases and basic annotation is performed. 

First, contig names are standardized using anvi'o with anvi-script-reformat-fasta and the options: 
--seq-type NT			# specifies the sequence type
--simplify-names		# standardizes the contig name format
--overwrite-input		# overwrites the source file instead of generating a new file
--prefix <GlobDB_ID>		# prepends the GlobDB ID to the contig name

Then, contigs databases are created using anvi-gen-contigs-database, using the GlobDB ID as database name (-n flag)

These contigs databases are then populated with annotations using the following commands:
anvi-run-hmms			# annotates rRNA genes, single copy marker genes, and tRNAs using the "--also-scan-trnas" flag
anvi-run-kegg-kofams		# annotates coding regions with kegg kofams, and a custom heuristic
anvi-run-pfams			# annotates coding regions with pfams
anvi-run-cazymes		# annotates coding regions with dbCAN2 CAzymes
anvi-run-ncbi-cogs		# annotates coding regions with COGs (2020 update)

Next, protein fasta files and gff files for KEGG, COG and Pfam annotations are exported from the contigs databases

These files can also be generated from the anni'o databases directly using the anvi-get-sequences-for-gene-calls command with different flags. By default, the GlobDB ID is not included in the protein fasta file header, but is added after export.
--get-aa-sequences --wrap 0					# protein fasta
--export-gff3 --annotation-source COG20_FUNCTION		# gff with COG annotation
--export-gff3 --annotation-source KOfam				# gff with kegg annotation
--export-gff3 --annotation-source Pfam				# gff with pfam annotation