Making Custom Databases

EsViritu ships with a curated database of human, animal and plant virus genomes, but you can build and use a custom database if you need to target a different set of viruses (e.g. environmental viruses, phages, or a custom subset of NCBI sequences). One use case is improving consensus genome assembly by using lots of closely related genomes for a species of interest. See this example for RSV.

Two helper commands are provided:

esv_create_taxonomy — Takes a table of accessions and generates a fully annotated EsViritu-style metadata TSV using NCBI taxonomy.
esv_combine_tax — Merges two or more metadata TSVs into a single file, deduplicating records.

A custom database requires three files, both pointing to the same set of accessions:

A FASTA file of virus genome sequences (e.g. downloaded from NCBI).
virus_pathogen_database.fna
a minimap2 index (.mmi) with short-read settings.
virus_pathogen_database.mmi
via: minimap2 -x sr -d virus_pathogen_database.mmi virus_pathogen_database.fna
A metadata TSV produced by esv_create_taxonomy (and optionally combined with esv_combine_tax).
virus_pathogen_database.all_metadata.tsv

The directory with the three files are then passed to EsViritu via --db, respectively.

Making an EsViritu-Style Database with NCBI Records

Note

If you want to use custom records (not in NCBI), you'll have to generate a metadata file with the columns specified in the Output columns section.

Install `pytaxonkit`

esv_create_taxonomy requires pytaxonkit, which is available on Bioconda:

conda install bioconda::pytaxonkit">=0.10"

Download the NCBI accession-to-TaxID mapping

wget https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

zcat nucl_gb.accession2taxid.gz | cut -f 2,3 | gzip -c > acc2taxid.tsv.gz

Estimated size: ~900 MB (compressed).

Download the NCBI taxonomy dump

mkdir taxdump && cd taxdump

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

tar -xvf taxdump.tar.gz

Estimated size: ~3 GB (uncompressed).

Step 1: Prepare your accession table

Create a tab-delimited input TSV with at minimum these required columns:

Column	Required	Description
`Accession`	yes	NCBI accession with version (e.g. `NC_001401.2`)
`Organism_Name`	yes	Virus name (becomes the `Name` field in output)
`Length`	yes	Sequence length in bp
`Assembly`	recommended	Assembly identifier for grouping multi-segment viruses (e.g. `GCF_000862125.1`). If absent, each accession gets its own assembly ID (`set:<Accession>`).
`Segment`	optional	Segment label (e.g. `S`, `M`, `L`).
`description`	optional	Free-text description. Defaults to `Organism_Name` if absent.

This table is typically exported from the NCBI Virus web interface, but could be generated from NCBI entrez.

Tip

For segmented viruses, make sure all segments share the same Assembly value. Asm_length is computed as the sum of Length across all accessions with the same Assembly.

Step 2: Run `esv_create_taxonomy`

esv_create_taxonomy \
    -i my_accessions.tsv \
    -o my_virus_metadata.tsv \
    -a /path/to/acc2taxid.tsv.gz \
    -t /path/to/taxdump

Arguments

Flag	Required	Description
`-i` / `--input`	yes	Input accession TSV (see Step 1).
`-o` / `--output`	yes	Output metadata TSV path.
`-a` / `--accfile`	yes	Path to `acc2taxid.tsv.gz` (see Prerequisites).
`-t` / `--taxonkit_dir`	yes	Path to the taxdump directory (see Prerequisites).
`-s` / `--subspecies-label`	no	How to populate the `subspecies` field. See below.

`--subspecies-label` options

By default (-s not specified), esv_create_taxonomy uses pytaxonkit to assign the subspecies/strain rank from NCBI taxonomy. This is the recommended option.

Alternatively, you can override the subspecies field using an existing column from your input:

Value	Source column	Notes
(default)	NCBI taxonomy	Recommended; uses taxonkit `t__` format.
`organism`	`Organism_Name`	Useful when organism names encode strain info.
`genotype`	`Genotype`	Input must have a `Genotype` column.
`subspecies`	`subspecies`	Use a pre-existing subspecies column as-is.
`species`	`Species`	Input must have a `Species` column.

Output columns

The output TSV contains the following columns (plus any extra columns from the input):

Column	Description
`Accession`	NCBI accession with version
`description`	Sequence description
`Name`	Virus name (from `Organism_Name`)
`Segment`	Segment label (null if not applicable)
`kingdom`	Taxonomic rank
`phylum`	Taxonomic rank
`tclass`	Taxonomic class
`order`	Taxonomic rank
`family`	Taxonomic rank
`genus`	Taxonomic rank
`species`	Taxonomic rank
`subspecies`	Subspecies/strain label
`Length`	Sequence length (bp)
`TaxID`	NCBI Taxonomy ID
`Assembly`	Assembly identifier
`Asm_length`	Total assembly length (sum of `Length` across segments)

NOTE: There will be several extra columns from pytaxonkit that are not removed by default as they do not effect EsViritu processing.

Step 3 (optional): Combine multiple metadata tables with `esv_combine_tax`

If you have built metadata tables from multiple input TSVs (e.g. different virus groups or NCBI datasets exports), combine them into a single file:

esv_combine_tax standard_DB/virus_pathogen_database.all_metadata.tsv table1.tsv table2.tsv -o custom_DB/virus_pathogen_database.all_metadata.tsv

Duplicate rows (identical across all columns) are removed.
Tables must all have the required columns (Accession, description, Name, Segment, kingdom, phylum, tclass, order, family, genus, species, subspecies, Length, TaxID, Assembly, Asm_length). Tables missing any of these will cause an error.
Extra columns beyond the required set are preserved.

Arguments

Flag	Default	Description
`tables` (positional)	—	Two or more metadata TSV files to merge.
`-o` / `--output`	`virus_pathogen_database.all_metadata.tsv`	Output file path.

Step 4: Combine FASTA sequences and Index for minimap2

Combine all the FASTA files that correspond to records in your metadata table.

cat standard_DB/virus_pathogen_database.fna custom_seqs.fna | seqkit rmdup | seqkit seq > custom_DB/virus_pathogen_database.fna

minimap2 -x sr -d custom_DB/virus_pathogen_database.mmi custom_DB/virus_pathogen_database.fna

Note

The FASTA record IDs must match the Accession column in your metadata TSV exactly (including version suffix, e.g. NC_001401.2).

A script to check this is not provided, so please do your due diligence.

Step 5: Run EsViritu with the custom database

Pass your custom FASTA and metadata files via --db:

EsViritu \
    -r /path/to/reads1.fastq /path/to/reads2.fastq \
    -s my_sample \
    -o my_output_dir \
    --db /path/to/custom_DB

Making Custom Databases

Making an EsViritu-Style Database with NCBI Records

Install pytaxonkit

Download the NCBI accession-to-TaxID mapping

Download the NCBI taxonomy dump

Step 1: Prepare your accession table

Step 2: Run esv_create_taxonomy

Arguments

--subspecies-label options

Output columns

Step 3 (optional): Combine multiple metadata tables with esv_combine_tax

Arguments

Step 4: Combine FASTA sequences and Index for minimap2

Step 5: Run EsViritu with the custom database

Install `pytaxonkit`

Step 2: Run `esv_create_taxonomy`

`--subspecies-label` options

Step 3 (optional): Combine multiple metadata tables with `esv_combine_tax`