SOURMASH(1) | sourmash | SOURMASH(1) |
NAME¶
sourmash - sourmash Documentation
sourmash is a command-line tool and Python/Rust library for metagenome analysis and genome comparison using k-mers. It supports the compositional analysis of metagenomes, rapid search of large sequence databases, and flexible taxonomic profiling with both NCBI and GTDB taxonomies (see our prepared databases for more information). sourmash works well with sequences 30kb or larger, including bacterial and viral genomes.
You might try sourmash if you want to -
- identify which reference genomes to use for metagenomic read mapping;
- search all Genbank microbial genomes with a sequence query;
- cluster hundreds or thousands of genomes by similarity;
- taxonomically classify genomes or metagenomes against NCBI and/or GTDB;
- search thousands of metagenomes with a query genome or sequence;
New! The sourmash project also supports querying all 1 million publicly available metagenomes in the Sequence Read Archive. Give it a try!
Our vision: sourmash strives to support biologists in analyzing modern sequencing data at high resolution and with full context, including all public reference genomes and metagenomes.
MISSION STATEMENT¶
This project's mission is to provide practical tools and approaches for analyzing extremely large sequencing data sets, with an emphasis on high resolution results. Our designs follow these guiding principles:
- genomic and metagenomic analyses should be able to make use of all available reference genomes.
- metagenomic analyses should support assembly independent approaches, to avoid biases stemming from low coverage or high strain variability.
- private and public databases should be equally well supported.
- a variety of data structures and algorithms are necessary to support a wide set of use cases, including efficient command-line analysis, real-time queries, and massive-scale batch analyses.
- our tools should be well behaved members of the bioinformatics analysis tool ecosystem, and use common installation approaches, standard formats, and semantic versioning.
- our tools should be robustly tested, well documented, and supported.
- we discuss scientific and computational tradeoffs and make specific recommendations where possible, admitting uncertainty as needed.
HOW DOES SOURMASH WORK?¶
Underneath, sourmash uses FracMinHash sketches for fast and lightweight sequence comparison; FracMinHash builds on MinHash sketching to support both Jaccard similarity and containment analyses with k-mers. This significantly expands the range of operations that can be done quickly and in low memory. sourmash also implements a number of new and powerful techniques for analysis, including minimum metagenome covers and alignment-free ANI estimation.
sourmash is inspired by mash, and supports most mash analyses. sourmash also implements an expanded set of functionality for metagenome and taxonomic analysis.
While sourmash is currently single-threaded, the branchwater plugin for sourmash provides faster and lower-memory multithreaded implementations of several important sourmash features - sketching, searching, and gather (metagenome decomposition). It does so by implementing higher-level functions in Rust on top of the core Rust library of sourmash. As a result it provides some of the same functionality as sourmash, but 10-100x faster and in 10x lower memory. Note that this code is functional and tested, but does not have all of the features of sourmash. Code and features will be integrated back into sourmash as they mature.
sourmash development was initiated with a grant from the Moore Foundation under the Data Driven Discovery program, and has been supported by further funding from the NIH and NSF. Please see funding acknowledgements for details!
USING SOURMASH¶
Tutorials and examples¶
These tutorials are command line tutorials that should work on Mac OS X and Linux. They require about 5 GB of disk space and 5 GB of RAM.
- Installing sourmash with conda
- The first sourmash tutorial - making signatures, comparing, and searching
- Using sourmash LCA to do taxonomic classification
- Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash
- Some sourmash command line examples!
How-To Guides¶
- Classifying genome and metagenome sketches
- Working with private collections of genome sketches
- Using the LCA_Database API
- Building plots from sourmash compare output.
- A short guide to using sourmash output with R.
FREQUENTLY ASKED QUESTIONS¶
- •
- Frequently asked questions
How sourmash works under the hood¶
- An introduction to k-mers for genome comparison and analysis
- Support, versioning, and migration between versions
Reference material¶
- Full table of contents for all docs
- UNIX command-line documentation
- Genbank and GTDB databases and taxonomy files
- Python examples using the API
- Publications about sourmash
- A guide to the internal design and structure of sourmash
- Funding acknowledgements
DEVELOPING AND EXTENDING SOURMASH¶
- •
- Releasing a new version of sourmash
Tutorials and examples¶
These tutorials are command line tutorials that should work on Mac OS X and Linux. They require about 5 GB of disk space and 5 GB of RAM.
- Installing sourmash with conda
- The first sourmash tutorial - making signatures, comparing, and searching
- Using sourmash LCA to do taxonomic classification
- Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash
- Some sourmash command line examples!
How-To Guides¶
- Classifying genome and metagenome sketches
- Working with private collections of genome sketches
- Using the LCA_Database API
- Building plots from sourmash compare output.
- A short guide to using sourmash output with R.
Frequently Asked Questions¶
- •
- Frequently asked questions
How sourmash works under the hood¶
- An introduction to k-mers for genome comparison and analysis
- Support, versioning, and migration between versions
Reference material¶
- UNIX command-line documentation
- Genbank and GTDB databases and taxonomy files
- Python examples using the API
- Publications about sourmash
- A guide to the internal design and structure of sourmash
- Funding acknowledgements
Developing and extending sourmash¶
- Getting started with sourmash development
- Releasing a new version of sourmash
Full table of contents for all docs¶
Using sourmash from the command line¶
Contents¶
- •
- Using sourmash from the command line
- An example
- The sourmash command and its subcommands
- sourmash sketch - make sourmash signatures from sequence data
- sourmash compute - make sourmash signatures from sequence data
- sourmash compare - compare many signatures
- sourmash plot - cluster and visualize comparisons of many signatures
- sourmash search - search for signatures in collections or databases
- sourmash gather - find metagenome members
- sourmash index - build an SBT index of signatures
- sourmash prefetch - select subsets of very large databases for more processing
- sourmash multigather - do gather with many queries
- •
- sourmash tax subcommands for integrating taxonomic information into gather results
- sourmash tax metagenome - summarize metagenome content from gather results
- sourmash tax genome - classify a genome using gather results
- sourmash tax annotate - annotates gather output with taxonomy
- sourmash tax prepare - prepare and/or combine taxonomy files
- sourmash tax grep - subset taxonomies and create picklists based on taxonomy string matches
- sourmash tax summarize - print summary information for lineage spreadsheets or taxonomy databases
- •
- sourmash lca subcommands for in-memory taxonomy integration
- sourmash lca classify - classify a genome using an LCA database
- sourmash lca summarize - summarize a metagenome's contents using an LCA database
- sourmash lca index - build an LCA database
- sourmash lca rankinfo - examine an LCA database
- sourmash lca compare_csv - compare taxonomic spreadsheets
- •
- sourmash signature subcommands for signature manipulation
- sourmash signature cat - combine signatures into one file
- sourmash signature describe - display detailed information about signatures
- sourmash signature fileinfo - display a summary of the contents of a sourmash collection
- sourmash signature grep - extract matching signatures using pattern matching
- sourmash signature split - split signatures into individual files
- sourmash signature merge - merge two or more signatures into one
- sourmash signature rename - rename a signature
- sourmash signature subtract - subtract other signatures from a signature
- sourmash signature intersect - intersect two (or more) signatures
- sourmash signature inflate - transfer abundances from one signature to others
- sourmash signature downsample - decrease the size of a signature
- sourmash signature extract - extract signatures from a collection
- sourmash signature flatten - remove abundance information from signatures
- sourmash signature filter - remove hashes based on abundance
- sourmash signature import - import signatures from mash.
- sourmash signature export - export signatures to mash.
- sourmash signature overlap - detailed comparison of two signatures' overlap
- sourmash signature kmers - extract k-mers and/or sequences that match to signatures
- sourmash signature manifest - output a manifest for a file
- sourmash signature check - compare picklists and manifests
- sourmash signature collect - collect manifests across databases
- •
- Advanced command-line usage
- Loading signatures and databases
- Selecting signatures
- Using picklists to subset large collections of signatures
- Storing (and searching) signatures
- Zip files
- Choosing signature output formats
- Loading many signatures
- Combining search databases on the command line
- Using stdin
- Using standalone manifests to explicitly refer to collections of files
- •
- Using sourmash plugins
- The branchwater plugin - multithreaded and optimized sourmash operations
- The betterplot plugin - improved plotting and visualization
From the command line, sourmash can be used to create FracMinHash sketches from DNA and protein sequences, compare them to each other, and plot the results; these sketches are saved into "signature files". These signatures allow you to estimate sequence similarity and containment quickly and accurately in large collections, among other capabilities.
sourmash also provides a suite of metagenome functionality. This includes genome search in metagenomes, metagenome decomposition into a list of genomes from a database, and taxonomic classification functionality.
Please see the mash software and the mash paper (Ondov et al., 2016) for background information on how and why MinHash sketches work. The FracMinHash preprint (Irber et al, 2022) describes FracMinHash sketches as well as the metagenome-focused features of sourmash.
sourmash uses a subcommand syntax, so all commands start with sourmash followed by a subcommand specifying the action to be taken.
An example¶
Download three bacterial genomes from NCBI:
curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/017/325/GCF_000017325.1_ASM1732v1/GCF_000017325.1_ASM1732v1_genomic.fna.gz curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/021/665/GCF_000021665.1_ASM2166v1/GCF_000021665.1_ASM2166v1_genomic.fna.gz curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
Compute sourmash signatures for them all:
sourmash sketch dna -p k=31 *.fna.gz
This will produce three .sig files containing MinHash signatures using a k-mer size of 31.
Next, compare all the signatures to each other:
sourmash compare *.sig -o cmp.dist
Finally, plot a dendrogram:
sourmash plot cmp.dist --labels
This will output three files, cmp.dist.dendro.png, cmp.dist.matrix.png, and cmp.dist.hist.png, containing a clustering & dendrogram of the sequences, a similarity matrix and heatmap, and a histogram of the pairwise similarities between the three genomes.
Matrix:
[image: Matrix] [image]
Here, the two genomes that cluster together are strains of the same species, while the third is from a completely different genus.
The sourmash command and its subcommands¶
To get a list of subcommands, run sourmash without any arguments.
Please use the command line option --help to get more detailed usage information for each command.
All signature saving commands can save to a variety of formats (we suggest .zip files) and all signature loading commands can load signatures from any of these formats.
There are seven main subcommands: sketch, compare, plot, search, gather, index, and prefetch. See the tutorial for a walkthrough of these commands.
- sketch creates signatures.
- compare compares signatures and builds a similarity matrix.
- plot plots similarity matrices created by compare.
- search finds matches to a query signature in a collection of signatures.
- gather finds the best reference genomes for a metagenome, using the provided collection of signatures.
- index builds a fast index for many (thousands) of signatures.
- prefetch selects signatures of interest from a very large collection of signatures, for later processing.
There are also a number of commands that work with taxonomic information; these are grouped under the sourmash tax and sourmash lca subcommands.
sourmash tax commands:
- tax metagenome - summarize metagenome gather results at each taxonomic rank.
- tax genome - summarize single-genome gather results and report most likely classification.
- tax annotate - annotate gather results with lineage information (no summarization or classification).
- tax prepare - prepare and/or combine taxonomy files.
- tax grep - subset taxonomies and create picklists based on taxonomy string matches.
- tax summarize - print summary information (counts of lineages) for a taxonomy lineages file or database.
sourmash lca commands:
ATTENTION:
- lca classify classifies many signatures against an LCA database.
- lca summarize summarizes the content of metagenomes using an LCA database.
- lca index creates a database for use with LCA subcommands.
- lca rankinfo summarizes the content of a database.
- lca compare_csv compares lineage spreadsheets, e.g. those output by lca classify.
See the LCA tutorial for a walkthrough of some of these commands.
Finally, there are a number of utility and information commands:
- info shows version and software information.
- index indexes many signatures using a Sequence Bloom Tree (SBT).
- sbt_combine combines multiple SBTs.
- categorize is an experimental command to categorize many signatures.
- watch is an experimental command to classify a stream of sequencing data.
- multigather is an experimental command to run multiple gathers against the same collection of databases.
Please use the command line option --help to get more detailed usage information for each command.
sourmash sketch - make sourmash signatures from sequence data¶
Most of the commands in sourmash work with signatures, which contain information about genomic or proteomic sequences. Each signature contains one or more sketches, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways.
To create a signature with one or more sketches, you use the sourmash sketch command. There are four main commands:
sourmash sketch dna sourmash sketch protein sourmash sketch translate sourmash sketch fromfile
The sketch dna command reads in DNA sequences and outputs DNA sketches.
The sketch protein command reads in protein sequences and outputs protein sketches.
The sketch translate command reads in DNA sequences, translates them in all six frames, and outputs protein sketches.
The sketch fromfile command takes in a CSV file containing the locations of genomes and proteomes, and outputs all of the requested sketches. It is primarily intended for large-scale database construction. (fromfile is a new command as of sourmash v4.4.0.)
All of the sourmash sketch commands take FASTA or FASTQ sequences as input; input data can be uncompressed, compressed with gzip, or compressed with bzip2. The output will be one or more signature files that can be used by other sourmash commands.
Please see the sourmash sketch documentation page for details on sketch, and see Using sourmash: a practical guide for more information on creating signatures.
sourmash compute - make sourmash signatures from sequence data¶
Note: sourmash compute is deprecated in sourmash 4.0 and will be removed in sourmash 5.0; please switch to using sourmash sketch, above.
The compute subcommand computes and saves signatures for each sequence in one or more sequence files. It takes as input FASTA or FASTQ files, and these files can be uncompressed or compressed with gzip or bzip2. The output will be one or more JSON signature files that can be used with sourmash compare.
Please see Using sourmash: a practical guide for more information on computing signatures.
----
Usage:
sourmash compute <filename> [<filename2> ... ]
Optional arguments:
--ksizes K1[,K2,K3] -- one or more k-mer sizes to use; default is 31 --force -- recompute existing signatures; convert non-DNA characters to N --output -- save all the signatures to this file; can be '-' for stdout. --track-abundance -- compute and save k-mer abundances. --name-from-first -- name the signature based on the first sequence in the file --singleton -- instead of computing a single signature for each input file,
compute one for each sequence --merged <name> -- compute a single signature for all of the input files,
naming it <name>
sourmash compare - compare many signatures¶
The compare subcommand compares one or more signatures (created with sketch) using estimated Jaccard index or (if signatures are created with -p abund) the angular similarity.
The default output is a text display of a similarity matrix where each entry [i, j] contains the estimated Jaccard index between input signature i and input signature j. The output matrix can be saved to a numpy binary file with --output <outfile.mat> and used with the sourmash plot subcommand (or loaded with numpy.load(...). Using --csv <outfile.csv> will output a CSV file that can be loaded into other languages than Python, such as R.
As of sourmash 4.4.0, compare also supports Average Nucleotide Identity (ANI) estimates instead of Jaccard or containment index; use --ani to enable this.
Usage:
sourmash compare <sourmash signature file> [ <sourmash signature file> ... ]
Options:
- --output <filename> -- save the output matrix to this file, as a numpy binary matrix.
- --distance-matrix -- create and output a distance matrix, instead of a similarity matrix.
- --ksize <k> -- do the comparisons at this k-mer size.
- --containment -- calculate containment instead of similarity; C(i, j) = size(i intersection j) / size(i)
- --ani -- output estimates of Average Nucleotide Identity (ANI) instead of Jaccard similarity or containment.
- --from-file <filelist.txt> -- append the list of files in this text file to the input signatures.
- --ignore-abundance -- ignore abundances in signatures.
- --picklist <pickfile>:<colname>:<coltype> -- select a subset of signatures with a picklist
- --csv <outfile.csv> -- save the output matrix in CSV format.
- --labels-to <labels.csv> -- create a CSV file (spreadsheet) that can be passed in to sourmash plot with --labels-from in order to customize the labels.
Note: compare by default produces a symmetric similarity matrix that can be used for clustering in downstream tasks. With --containment, however, this matrix is no longer symmetric and cannot formally be used for clustering.
The containment matrix is organized such that the value in row A for column B is the containment of the B'th sketch in the A'th sketch, i.e.
C(A, B) = B.contained_by(A)
Note: The ANI estimate will be calculated based on Jaccard similarity by default; however, if --containment, --max-containment, or --avg-containment is specified, those values will be used instead. With --containment --ani, the ANI output matrix will be asymmetric as discussed above.
sourmash plot - cluster and visualize comparisons of many signatures¶
The plot subcommand produces two plots -- a dendrogram and a dendrogram+matrix -- from a matrix created by sourmash compare --output <matrix>. The default output is two PNG files.
Usage:
sourmash plot <matrix_file>
Options:
- --pdf -- output PDF files. (defaults to PNG)
- --labels -- display the signature names on the plot (default)
- --indices -- turn on index display on the plot.
- --vmax -- maximum value (default 1.0) for heatmap.
- --vmin -- minimum value (default 0.0) for heatmap.
- --subsample=<N> -- plot a maximum of samples, randomly chosen.
- --subsample-seed=<seed> -- seed for pseudorandom number generator.
Example command lines for label and index display -
- --indices will show only numbers;
- --no-labels --no-indices will remove all labels!
Example output:
[image: An E. coli comparison plot] [image]
sourmash search - search for signatures in collections or databases¶
The search subcommand searches a collection of signatures (in any of the formats supported by sourmash) for matches to the query signature. It can search for matches with either high Jaccard similarity or containment; the default is to use Jaccard similarity, unless --containment is specified. -o/--output will create a CSV file containing all of the matches with respective similarity or containment score.
search makes use of indexed databases to decrease search time and memory where possible.
Usage:
sourmash search query.sig <database1> [ <database2> ... ]
Example output:
% sourmash search tests/test-data/47.fa.sig gtdb-rs207.genomic-reps.dna.k31.zip ... -- loaded 65703 total signatures from 1 locations. after selecting signatures compatible with search, 65703 remain. 2 matches above threshold 0.080: similarity match ---------- -----
32.3% GCF_900456975.1 Shewanella baltica strain=NCTC10735, 5088...
14.0% GCF_002838165.1 Shewanella sp. Pdp11 strain=Pdp11, ASM283...
search takes a number of command line options -
- --containment - find matches using the containment index rather than Jaccard similarity;
- --max-containment - find matches using the max containment index rather than Jaccard similarity;
- -t/--threshold - lower threshold for matching; defaults to 0.08;
- --best-only - find and report only the best match;
- -n/--num-results - number of matches to report to stdout; defaults to 3; 0 to report all;
Match information can be saved to a CSV file with -o/--output; with -o, all matches above the threshold will be saved, not just those printed to stdout (which are limited to -n/--num-results).
The --containment flag calculates the containment of the query in database matches; this is an asymmetric order-dependent measure, unlike Jaccard. Here, search --containment Q A B C D will report the containment of Q in each of A, B, C, and D. This is opposite to the order used by prefetch, where the composite sketch (e.g. metagenomes) is the query, and the matches are contained items (e.g. genomes).
As of sourmash 4.2.0, search supports --picklist, to select a subset of signatures to search, based on a CSV file. This can be used to search only a small subset of a large collection, or to exclude a few signatures from a collection, without modifying the collection itself.
sourmash gather - find metagenome members¶
The gather subcommand selects the best reference genomes to use for a metagenome analysis, by finding the smallest set of non-overlapping matches to the query in a database. This is specifically meant for metagenome and genome bin analysis. (See Classifying Signatures for more information on the different approaches that can be used here.)
sourmash gather takes exactly one query and one or more collections of signatures. Please see sourmash multigather if you have multiple queries!
If the input signature was created with -p abund, output will be abundance weighted (unless --ignore-abundances is specified). -o/--output will create a CSV file containing the matches.
gather, like search, works with any of the signature collection formats supported by sourmash and will make use of indexed databases to decrease search time and memory where possible.
Usage:
sourmash gather query.sig <database1> [ <database2> ... ]
Example output:
overlap p_query p_match --------- ------- -------- 1.4 Mbp 11.0% 58.0% JANA01000001.1 Fusobacterium sp. OBRC... 1.0 Mbp 7.7% 25.9% CP001957.1 Haloferax volcanii DS2 pla... 0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c... 0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi... 0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub... ... found less than 50.0 kbp in common. => exiting found 64 matches total; the recovered matches hit 94.0% of the abundance-weighted query. the recovered matches hit 45.6% of the query k-mers (unweighted).
For each match,
- 'overlap', the first column, is the estimated number of base pairs shared between the match and the query, based on the number of shared hashes.
- 'p_query' is the percentage of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match. It is typically a lower bound on the percent of metagenomes reads that will map to this genome.
- 'p_match' is the percentage of the match that overlaps with the query; it is the "detection" of the match in the metagenome. It is typically a lower bound on the number of base pairs that will be covered by read mapping.
Quite a bit more information per match row is available in the CSV output saved with -o; for details, see Classifying signatures: how sourmash gather works.
The "recovered matches" lines detail how much of the query is explained by the entire collection of matches. You will get two numbers if your metagenome sketch has been calculated with -p abund, and only one if it does not have abundances. The abundance-weighted number should approximate the fraction of metagenome reads that will map to at least one reference genome, while the unweighted number describes how much of the metagenome itself matches to genomes. Here's another way to put it: if the metagenome could be perfectly assembled into contigs, the unweighted number would approximate the number of bases from the contigs that would match perfectly to at least one genome in the reference database. More practically, the abundance-weighted number is less sensitive to sequencing errors. See classifying signatures or the FAQ for more information!
The command line option --threshold-bp sets the threshold below which matches are no longer reported; by default, this is set to 50kb. See the Appendix in Classifying Signatures for details.
As of sourmash 4.2.0, gather supports --picklist, to select a subset of signatures based on a CSV file. This can be used to search only a small subset of a large collection, or to exclude a few signatures from a collection, without modifying the collection itself.
Note:
Use sourmash gather to analyze a metagenome against a collection of genomes. Then use sourmash tax metagenome to integrate that collection of genomes with taxonomic information.
Alternative search mode for low-memory (but slow) search: --linear¶
By default, sourmash gather uses all information available for faster search. In particular, for SBTs, prefetch will prune the search tree. This can be slow and/or memory intensive for very large databases, and --linear asks sourmash prefetch to instead use a linear search across all leaf nodes in the tree.
The results are the same whether --no-linear or --linear is used.
Alternative search mode: --no-prefetch¶
By default, sourmash gather does a "prefetch" to find all candidate signatures across all databases, before removing overlaps between the candidates. In rare circumstances, depending on the databases and parameters used, this may be slower or more memory intensive than doing iterative overlap removal. Prefetch behavior can be turned off with --no-prefetch.
The results are the same whether --prefetch or --no-prefetch is used. This option can be used with or without --linear (although --no-prefetch --linear will generally be MUCH slower).
sourmash index - build an SBT index of signatures¶
The sourmash index command creates a Zipped SBT database (.sbt.zip) from a collection of signatures. This can be used to create databases from private collections of genomes, and can also be used to create databases for e.g. subsets of GenBank.
These databases support fast search and gather on large collections of signatures in low memory.
All signatures in an SBT must be of compatible types (i.e. the same k-mer size and molecule type). You can specify the usual command line selectors (-k, --scaled, --dna, --protein, etc.) to pick out the types of signatures to include when running index.
Usage:
sourmash index <database_name> <inputfile1> [ <inputfile2> ... ]
This will create a database.sbt.zip file containing the SBT of the input signatures. You can create an "unpacked" version by specifying database.sbt.json and it will create the JSON file as well as a subdirectory of files under .sbt.database.
Note that you can use --from-file to pass index a text file containing a list of file names to index; you can also provide individual signature files, directories full of signatures, or other sourmash databases.
As of sourmash 4.2.0, index supports --picklist, to select a subset of signatures based on a CSV file. This can be used to index a subset of a large collection, or to exclude a few signatures from an index being built from a large collection.
sourmash prefetch - select subsets of very large databases for more processing¶
The prefetch subcommand searches a collection of scaled signatures for matches in a large database, using containment. It is similar to search --containment, while taking a --threshold-bp argument like gather does for thresholding matches (instead of using Jaccard similarity or containment). Note that prefetch uses the composite sketch (e.g. a metagenome) as the query, and finds all matching subjects (e.g. genomes) from the database - the arguments are in the opposite order from search --containment.
sourmash prefetch is intended to select a subset of a large database for further processing. As such, it can search very large collections of signatures (potentially millions or more), operates in very low memory (see --linear option, below), and does no post-processing of signatures.
prefetch has four main output options, which can all be used individually or together:
- -o/--output produces a CSV summary file;
- --save-matches saves all matching signatures;
- -save-matching-hashes saves a single signature containing all of the hashes that matched any signature in the database at or above the specified threshold;
- --save-unmatched-hashes saves a single signature containing the complement of --save-matching-hashes.
Other options include:
- the usual -k/--ksize and --dna/--protein/--dayhoff/--hp signature selectors;
- --threshold-bp to require a minimum estimated bp overlap for output;
- --scaled for downsampling;
- --force to continue past survivable errors;
- --picklist will select a subset of signatures to search, using a picklist
Alternative search mode for low-memory (but slow) search: --linear¶
By default, sourmash prefetch uses all information available for faster search. In particular, for SBTs, prefetch will prune the search tree. This can be slow and/or memory intensive for very large databases, and --linear asks sourmash prefetch to instead use a linear search across all leaf nodes in the tree.
Caveats and comments¶
sourmash prefetch provides no guarantees on output order. It runs in "streaming mode" on its inputs, in that each input file is loaded, searched, and then unloaded. And sourmash prefetch can be run separately on multiple databases, after which the results can be searched in combination with search, gather, compare, etc.
A motivating use case for sourmash prefetch is to run it on multiple large databases with a metagenome query using --threshold-bp=0, --save-matching-hashes matching-hashes.sig, and --save-matches db-matches.sig, and then run sourmash gather matching-hashes.sig db-matches.sig.
This combination of commands ensures that the more time- and memory-intensive gather step is run only on a small set of relevant signatures, rather than all the signatures in the database.
sourmash multigather - do gather with many queries¶
The multigather subcommand runs sourmash gather on multiple queries. (See sourmash gather docs for specifics on what gather does, and how!)
Usage:
sourmash multigather --query <queries ...> --db <collections>
Note that multigather is single threaded, so it offers no substantial efficiency gains over just running gather multiple times! Nonetheless, it is useful for situations where you have many sketches organized in a combined file, e.g. sketches built with sourmash sketch ... --singleton).
multigather output files¶
multigather produces three output files for each query:
- <output_base>.csv - gather CSV output
- <output_base>.matches.sig - all matching outputs
- <output_base>.unassigned.sig - all remaining unassigned hashes
As of sourmash v4.8.7, <output_base> is set as follows:
- the filename attribute of the query sketch, if it is not empty or -;
- the query sketch md5sum, if the query filename is empty or -;
- the query filename + the query sketch md5sum (<query_file>.<md5sum>), if -U/--output-add-query-md5sum is specified;
By default, multigather will complain and exit with an error if the same <output_base> is used repeatedly and an output file is going to be overwritten. With -U/--output-add-query-md5sum this should only happen when identical sketches are present in a query database. Use --force-allow-overwrite-output to allow overwriting of output files without an error.
sourmash tax subcommands for integrating taxonomic information into gather results¶
The sourmash tax subcommands support taxonomic analysis of genomes and taxonomic profiling of metagenomes. See taxonomic profiling with sourmash for more information.
The sourmash tax or taxonomy commands integrate taxonomic information with the results of sourmash gather. All tax commands require one or more properly formatted taxonomy files where the identifiers correspond to those in the database(s) used for gather. Note that if using multiple databases, the gather needs to have been conducted against all desired databases within the same gather command (we cannot combine separate gather runs for the same query). For supported databases (e.g. GTDB, NCBI), we provide taxonomy csv files, but they can also be generated for user-generated databases. As of v4.8 and 4.8.6, respectively, some sourmash taxonomy commands can also use LIN or ICTV lineage information.
tax commands rely upon the fact that gather provides both the total fraction of the query matched to each database matched, as well as a non-overlapping f_unique_to_query, which is the fraction of the query uniquely matched to each reference genome. The f_unique_to_query for any reference match will always be between (0% of query matched) and 1 (100% of query matched), and for a query matched to multiple references, the f_unique_to_query will sum to at most 1 (100% of query matched). We use this property to aggregate gather matches at the desired taxonomic rank. For example, if the gather results for a metagenome include results for 30 different strains of a given species, we can sum the fraction uniquely matched to each strain to obtain the fraction uniquely matched to this species. Alternatively, taxonomic summarization can take into account abundance weighting; see classifying signatures for more information.
As with all reference-based analysis, results can be affected by the completeness of the reference database. However, summarizing taxonomic results from gather minimizes issues associated with increasing size and redundancy of reference databases.
For more details on how gather works and can be used to classify signatures, see Classifying signatures: search, gather, and lca methods.
sourmash tax metagenome - summarize metagenome content from gather results¶
sourmash tax metagenome summarizes gather results for each query metagenome by taxonomic lineage.
Here is an example command to summarize a single gather csv, where the query was gathered against gtdb-rs202 representative species database:
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv
The possible output formats are listed below, followed by the file extension used when writing to a file rather than stdout. When using more than one output format, you must provide an output basename (--output-base) that will be used to name the output files. If an --output-dir is provided, files will output to that directory.
- human: ".human.txt",
- csv_summary: ".summarized.csv",
- lineage_summary: ".lineage_summary.tsv",
- krona: ".krona.tsv",
- kreport: ".kreport.txt",
- lingroup: ".lingroup.tsv",
- bioboxes: ".bioboxes.profile",
csv_summary output format¶
csv_summary is the default output format. This outputs a csv with lineage summarization for each taxonomic rank. This output currently consists of six columns, query_name,rank,fraction,lineage,query_md5,query_filename, where fraction is the fraction of the query matched to the reported rank and lineage.
example csv_summary output from the command above:
query_name,rank,fraction,lineage HSMA33MX,superkingdom,0.131,d__Bacteria HSMA33MX,phylum,0.073,d__Bacteria;p__Bacteroidota HSMA33MX,phylum,0.058,d__Bacteria;p__Proteobacteria . . . HSMA33MX,species,0.058,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria; o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli HSMA33MX,species,0.057,d__Bacteria;p__Bacteroidota;c__Bacteroidia; o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri HSMA33MX,species,0.016,d__Bacteria;p__Bacteroidota;c__Bacteroidia; o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus
The query_md5 and query_filename columns are omitted here for brevity.
Note: When using --lins with a --lingroup file, the csv_summary file will report summarization for each specified lingroup, rather than all possible lin ranks (v4.8.12+).
krona output format¶
krona format is a tab-separated list of these results at a specific rank. The first column, fraction is the fraction of the query matched to the reported rank and lineage. The remaining columns are superkingdom, phylum, ... etc down to the rank used for summarization. This output can be used directly for summary visualization.
To generate krona, we add --output-format krona to the command above, and need to specify a rank to summarize. Here's the command for reporting krona summary at species level:
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
example krona output from this command:
fraction superkingdom phylum class order family genus species 0.05815279361459521 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli 0.05701254275940707 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella Prevotella copri 0.015637726014008795 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides vulgatus
lineage_summary output format¶
The lineage summary format is most useful when comparing across metagenome queries. Each row is a lineage at the desired reporting rank. The columns are each query used for gather, with the fraction match reported for each lineage. This format is commonly used as input for many external multi-sample visualization tools.
To generate lineage_summary, we add --output-format lineage_summary to the summarize command, and need to specify a rank to summarize. Here's the command for reporting lineage_summary for two queries (HSMA33MX, PSM6XBW3) summary at species level.
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format lineage_summary --rank species
example lineage_summary:
lineage HSMA33MX PSM6XBW3 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus 0.015637726014008795 0.015642822225843248 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella copri 0.05701254275940707 0.05703112269838684 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli 0.05815279361459521 0.05817174515235457
To produce multiple output types from the same command, add the types into the --output-format argument, e.g. --output-format summary krona lineage_summary
kreport output format¶
The kreport output reports kraken-style kreport output, which may be useful for comparison with other taxonomic profiling methods. While this format typically records the percent of number of reads assigned to taxa, we create ~comparable output by reporting the percent of k-mers matched to each taxon and the estimated number of base pairs that these k-mers represent. To best represent the percent of all reads, we use k-mer abundance information in this output. To generate this properly, query FracMinHash sketches should be generated with abundance information (-p abund) to allow abundance-weighted gather results.
Note: sourmash gather makes all assignments to genomes, and then sourmash tax integrates taxonomy information and uses LCA-style summarization to build assignments. For species-level specificity, our current recommendation is to use use our default k-mer size of 31.
standard kreport columns (read-based tools):
- Percent Reads Contained in Taxon: The cumulative percentage of reads for this taxon and all descendants.
- Number of Reads Contained in Taxon: The cumulative number of reads for this taxon and all descendants.
- Number of Reads Assigned to Taxon: The number of reads assigned directly to this taxon (not a cumulative count of all descendants).
- Rank Code: (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
- NCBI Taxon ID: Numerical ID from the NCBI taxonomy database.
- Scientific Name: The scientific name of the taxon.
Example reads-based kreport with all columns:
88.41 2138742 193618 K 2 Bacteria
0.16 3852 818 P 201174 Actinobacteria
0.13 3034 0 C 1760 Actinomycetia
0.13 3034 45 O 85009 Propionibacteriales
0.12 2989 1847 F 31957 Propionibacteriaceae
0.05 1142 352 G 1912216 Cutibacterium
0.03 790 790 S 1747 Cutibacterium acnes
sourmash kreport columns:
- Percent [k-mers] contained in taxon: The cumulative percentage of k-mers for this taxon and all descendants.
- Estimated base pairs contained in taxon: The cumulative estimated base pairs for this taxon and all descendants.
- Estimated base pairs "assigned" (species-level): The estimated base pairs assigned at species-level (cumulative count of base pairs assigned to individual genomes in this species).
- Rank Code: (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
- NCBI Taxon ID: Reported (v4.7+) if using NCBI taxonomy. Otherwise blank.
- Scientific Name: The scientific name of the taxon.
notes:
- gather assigns k-mers to specific genomes. To mimic the output of other tools, we report all results as "assigned" to species-level, which summarizes the k-mers matched to each genome within a given species. Hence, column 3 will show all estimated base pairs at this level, and 0 for all other ranks. Column 2 contains the summarized info at the higher ranks.
- Since gather results are non-overlapping and all assignments are done at the genome level, the percent match (first column) will sum to 100% at each rank (aside from rounding issues) when including the unclassified (U) percentage. Higher-rank assignments are generated using LCA-style summarization of genome matches.
- Rows are ordered by rank and then ~percent containment.
example sourmash {output-name}.kreport.txt:
92.73 64060000 D Bacteria 0.44 11299000 D Eukaryota 6.82 284315000 U unclassified 60.23 30398000 P Proteobacteria 21.86 22526000 P Firmicutes 10.41 5250000 P Bacteroidetes . . . 3.94 6710000 S Escherichia coli 4.56 6150000 S Pseudomonas aeruginosa 0.71 5801000 S Clostridium beijerinckii 2.55 5474000 S Bacillus cereus 21.95 4987000 S Escherichia sp. XD7 28.57 4124000 S Cereibacter sphaeroides 0.25 4014000 S Acinetobacter baumannii 7.23 3934000 S Staphylococcus haemolyticus 0.09 3187000 S Phocaeicola vulgatus 0.61 2820000 S Streptococcus agalactiae 0.20 2499000 S Cutibacterium acnes 0.03 2339000 S Deinococcus radiodurans 10.31 2063000 S Porphyromonas gingivalis 9.24 2011000 S Streptococcus mutans
lingroup output format¶
When using LIN taxonomic information, you can optionally also provide a lingroup file with two required columns: name and lin. If provided, we will produce a file, {base}.lingroups.tsv, where {base} is the name provided via the -o, --output-base option. This output will select information from the full summary that match the LIN prefixes provided as groups.
This output format consists of four columns:
- name, lin columns are taken directly from the --lingroup file
- percent_containment, the total percent of the dataset contained in this lingroup and all descendants
- num_bp_contained, the estimated number of base pairs contained in this lingroup and all descendants.
Similar to kreport above, we use the wording "contained" rather than "assigned," because sourmash assigns matches at the genome level, and the tax functions summarize this information.
example output:
name lin percent_containment num_bp_contained lg1 0;0;0 5.82 714000 lg2 1;0;0 5.05 620000 lg3 2;0;0 1.56 192000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000
Related lingroup subpaths will be grouped in output, but exact ordering may change between runs.
Note: this output format requires a single sample only. For a similar output with multiple query samples, provide the lingroup file and use the 'csv_summary' output format.
bioboxes output format¶
When using standard taxonomic ranks (not lins), you can choose to output a 'bioboxes' profile, {base}.bioboxes.profile, where {base} is the name provided via the -o/--output-base option. This output is organized according to the bioboxes profile specifications so that this file can be used for CAMI challenges.
This output format starts with some header information:
#CAMI Submission for Taxonomic Profiling @Version:0.9.3 @SampleID:SAMPLEID @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @__program__:sourmash @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
and then provides taxonomic profiling information in the tab-separated columns described by the last header line:
- TAXID - specifies a unique alphanumeric ID for a node in a reference tree such as the NCBI taxonomy
- RANK - superkingdom --> strain
- TAXPATH - the path from the root of the reference taxonomy to the respective taxon
- TAXPATHSN - scientific names of taxpath
- PERCENTAGE (0-100) - field specifies what percentage of the sample was assigned to the respective TAXID
example output (using small test data):
# Taxonomic Profiling Output @SampleID:test1 @Version:0.10.0 @Ranks:superkingdom|phylum|class|order|family|genus|species @__program__:sourmash @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 13.08 976 phylum 2|976 Bacteria|Bacteroidota 7.27 1224 phylum 2|1224 Bacteria|Pseudomonadota 5.82 200643 class 2|976|200643 Bacteria|Bacteroidota|Bacteroidia 7.27 1236 class 2|1224|1236 Bacteria|Pseudomonadota|Gammaproteobacteria 5.82 171549 order 2|976|200643|171549 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales 7.27 91347 order 2|1224|1236|91347 Bacteria|Pseudomonadota|Gammaproteobacteria|Enterobacterales 5.82 171552 family 2|976|200643|171549|171552 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Prevotellaceae 5.70 543 family 2|1224|1236|91347|543 Bacteria|Pseudomonadota|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae 5.82 815 family 2|976|200643|171549|815 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Bacteroidaceae 1.56 838 genus 2|976|200643|171549|171552|838 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Prevotellaceae|Prevotella 5.70 561 genus 2|1224|1236|91347|543|561 Bacteria|Pseudomonadota|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Escherichia 5.82 909656 genus 2|976|200643|171549|815|909656 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Bacteroidaceae|Phocaeicola 1.56 165179 species 2|976|200643|171549|171552|838|165179 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Prevotellaceae|Prevotella|Prevotella copri 5.70 562 species 2|1224|1236|91347|543|561|562 Bacteria|Pseudomonadota|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Escherichia|Escherichia coli 5.82 821 species 2|976|200643|171549|815|909656|821 Bacteria|Bacteroidota|Bacteroidia|Bacteroidales|Bacteroidaceae|Phocaeicola|Phocaeicola vulgatus 1.56
lingroup output format¶
When using LIN taxonomic information, you can optionally also provide a lingroup file with two required columns: name and lin. If provided, we will produce a file, {base}.lingroups.tsv, where {base} is the name provided via the -o, --output-base option. This output will select information from the full summary that match the LIN prefixes provided as groups.
This output format consists of four columns:
- name, lin columns are taken directly from the --lingroup file
- percent_containment, the total percent of the dataset contained in this lingroup and all descendants
- num_bp_contained, the estimated number of base pairs contained in this lingroup and all descendants.
Similar to kreport above, we use the wording "contained" rather than "assigned," because sourmash assigns matches at the genome level, and the tax functions summarize this information.
example output:
name lin percent_containment num_bp_contained lg1 0;0;0 5.82 714000 lg2 1;0;0 5.05 620000 lg3 2;0;0 1.56 192000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000
Related lingroup subpaths will be grouped in output, but exact ordering may change between runs.
sourmash tax genome - classify a genome using gather results¶
sourmash tax genome reports likely classification for each query, based on gather matches. By default, classification requires at least 10% of the query to be matched. Thus, if 10% of the query was matched to a species, the species-level classification can be reported. However, if 7% of the query was matched to one species, and an additional 5% matched to a different species in the same genus, the genus-level classification will be reported.
sourmash tax genome can use an ANI threshold (--ani-threshold) instead of a containment threshold. This works the same way as the containment threshold (and indeed, is using the same underlying information). Note that for DNA k-mers, k=21 ANI is most similar to alignment-based ANI values, and ANI values should only be compared if they were generated using the same ksize.
Optionally, genome can instead report classifications at a desired rank, regardless of match threshold (--rank argument, e.g. --rank species).
If using --lins taxonomy, you can also provide a --lingroup file containing two columns, name, and lin, which provide a series of lin prefixes of interest. If provided, genome classification will be restricted to provided lingroups only. All other options (--rank, --ani-threshold, etc) should continue to function. If you specify a --rank that does not have an associated lingroup, sourmash will notify you that you eliminated all classification options.
Note that these thresholds and strategies are under active testing.
To illustrate the utility of genome, let's consider a signature consisting of two different Shewanella strains, Shewanella baltica OS185 strain=OS185 and Shewanella baltica OS223 strain=OS223. For simplicity, we gave this query the name "Sb47+63".
When we gather this signature against the gtdb-rs202 representatives database, we see 66% matches to one strain, and 33% to the other:
abbreviated gather_csv:
f_match,f_unique_to_query,name,query_name 0.664,0.664,"GCF_000021665.1 Shewanella baltica OS223 strain=OS223, ASM2166v1",Sb47+63 0.656,0.335,"GCF_000017325.1 Shewanella baltica OS185 strain=OS185, ASM1732v1",Sb47+63
We can use tax genome on this gather csv to classify our "Sb47+63" mixed-strain query:
sourmash tax genome
--gather-csv 47+63_x_gtdb-rs202.gather.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv
sourmash tax genome can produce the following output formats:
- human: ".human.txt",
- csv_summary: ".classifications.csv",
- krona: ".krona.tsv",
- lineage_summary: ".lineage_summary.tsv",
csv_summary output format¶
csv_summary is the default output format. This outputs a csv with taxonomic classification for each query genome. This output currently consists of six columns, query_name,rank,fraction,lineage,query_md5,query_filename, where fraction is the fraction of the query matched to the reported rank and lineage. The status column provides additional information on the classification:
- match - this query was classified
- nomatch- this query could not be classified
- below_threshold - this query was classified at the specified rank, but the query fraction matched was below the containment threshold
Here is the csv_summary output from classifying this mixed-strain Shewanella query to species level:
query_name,status,rank,fraction,lineage "Sb47+63",match,species,1.000,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Shewanellaceae;g__Shewanella;s__Shewanella baltica
krona output format¶
krona format is a tab-separated list of these results at a specific rank. The first column, fraction is the fraction of the query matched to the reported rank and lineage. The remaining columns are superkingdom, phylum, ... etc down to the rank used for summarization. This output can be used directly for krona visualization.
To generate krona, we must classify by --rank instead of using the classification threshold. For the command, we add --output-format krona and --rank <RANK> to the command above. Here's the command for producing krona output for species-level classifications:
sourmash tax genome
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
Here is the krona-formatted output for this command:
fraction superkingdom phylum class order family genus species 1.0 d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Shewanellaceae g__Shewanella s__Shewanella baltica
To produce multiple output types from the same command, add the types into the --output-format argument, e.g. --output-format csv_summary krona. Note that specifying the classification rank with --rank, (e.g. --rank species), as needed for krona output, forces classification by rank rather than by containment threshold. If the query classification at this rank does not meet the containment threshold (default=0.1), the status column will contain below_threshold.
sourmash tax annotate - annotates gather output with taxonomy¶
sourmash tax annotate adds a column with taxonomic lineage information for each genome match in the gather output, without LCA summarization or classification. This format is not required for either metagenome or genome, but may be helpful for other downstream analyses.
By default, annotate uses the name of each input gather csv to write an updated version with lineages information. For example, annotating sample1.gather.csv would produce sample1.gather.with-lineages.csv.
This will produce an annotated gather CSV, Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv:
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv
The with-lineages output file format can be summarized with sourmash tax summarize and can also be used as an input taxonomy spreadsheet for any of the tax subcommands (new as of v4.6.0).
sourmash tax prepare - prepare and/or combine taxonomy files¶
sourmash tax prepare prepares taxonomy files for other sourmash tax commands.
All sourmash tax commands must be given one or more taxonomy files as parameters to the --taxonomy argument. These files can be either CSV files or (as of sourmash 4.2.1) SQLite databases. SQLite databases are much faster for large taxonomies, while CSV files are easier to view and modify using spreadsheet software.
sourmash tax prepare is a utility function that can ingest and validate multiple CSV files or SQLite databases, and output a CSV file or a SQLite database. It can be used to combine multiple taxonomies into a single file, as well as change formats between CSV and SQLite.
The following command will take in two taxonomy files and combine them into a single taxonomy SQLite database.
sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
Input databases formats can be mixed and matched, and the output format can be set to CSV like so:
sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
Note: As of sourmash v4.6.0, the output of sourmash tax annotate can be used as a taxonomy input spreadsheet as well.
sourmash tax grep - subset taxonomies and create picklists based on taxonomy string matches¶
(sourmash tax grep is a new command as of sourmash v4.5.0.)
sourmash tax grep searches taxonomies for matching strings, optionally restricting the string search to a specific taxonomic rank. It creates new files containing matching taxonomic entries; these new files can serve as taxonomies and can also be used as picklists to restrict database matches.
Usage:
sourmash tax grep <pattern> -t <taxonomy-db> [<taxonomy-db> ...]
where pattern is a regular expression; see Python's Regular Expression HOWTO for details on supported regexp features.
For example,
sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb -o shew-picklist.csv
will search for a string match to Shew within the entire GTDB RS207 taxonomy, and will output a subset taxonomy in shew-picklist.csv. This picklist can be used with the GTDB RS207 databases like so:
sourmash search query.sig gtdb-rs207.genomic.k31.zip \
--picklist shew-picklist.csv:ident:ident
tax grep can also restrict string matching to a specific taxonomic rank with -r/--rank; for example,
sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb \
-o shew-picklist.csv -r genus
will restrict matches to the rank of genus. Available ranks are superkingdom, phylum, class, order, family, genus, and species.
tax grep also takes several standard grep arguments, including -i to ignore case and -v to output only taxonomic lineages that do not match the pattern.
Note: tax grep only searches taxonomic ranks, not identifier strings. Use sig grep to search for identifiers in sketch collections.
Currently only CSV output (optionally gzipped) is supported; use sourmash tax prepare to convert CSV output from tax grep into a SQLite taxonomy database.
sourmash tax summarize - print summary information for lineage spreadsheets or taxonomy databases¶
(sourmash tax summarize is a new command as of sourmash v4.6.0.)
sourmash tax summarize loads in one or more lineage spreadsheets, counts the distinct taxonomic lineages, and outputs a summary. It optionally will output a CSV file with a detailed count of how many identifiers belong to each taxonomic lineage.
For example,
sourmash tax summarize gtdb-rs202.taxonomy.v2.db -o ranks.csv
outputs
number of distinct taxonomic lineages: 258406 rank superkingdom: 2 distinct taxonomic lineages rank phylum: 169 distinct taxonomic lineages rank class: 419 distinct taxonomic lineages rank order: 1312 distinct taxonomic lineages rank family: 3264 distinct taxonomic lineages rank genus: 12888 distinct taxonomic lineages rank species: 47894 distinct taxonomic lineages
and creates a file ranks.csv with the number of distinct identifier counts for each lineage at each rank:
rank,lineage_count,lineage superkingdom,254090,d__Bacteria phylum,120757,d__Bacteria;p__Proteobacteria class,104665,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria order,64157,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales family,55347,d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae ...
That is, there are 254,090 identifiers in GTDB rs202 under d__Bacteria, and 120,757 within the p__Proteobacteria.
tax summarize can also be used to summarize the output of tax annotate.
sourmash lca subcommands for in-memory taxonomy integration¶
These commands use LCA databases (created with lca index, below, or prepared databases such as genbank-k31.lca.json.gz).
sourmash lca classify - classify a genome using an LCA database¶
sourmash lca classify classifies one or more signatures using the given list of LCA DBs. It is meant for classifying metagenome-assembled genome bins (MAGs) and single-cell genomes (SAGs).
ATTENTION:
Usage:
sourmash lca classify --query query.sig [query2.sig ...] --db <lca db> [<lca db2> ...]
For example, the command
sourmash lca classify --query tests/test-data/63.fa.sig \
--db podar-ref.lca.json
will produce the following logging to stderr:
loaded 1 LCA databases. ksize=31, scaled=10000 finding query signatures... outputting classifications to stdout ... classifying NC_011663.1 Shewanella baltica OS223, complete genome classified 1 signatures total
and the example classification output is a CSV file with headers:
ID,status,superkingdom,phylum,class,order,family,genus,species "NC_009665.1 Shewanella baltica OS185, complete genome",found,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Shewanellaceae,Shewanella,Shewanella baltica
The status column in the classification output can take three possible values: nomatch, found, and disagree. nomatch means that no match was found for this query, and found means that an unambiguous assignment was found - all k-mers were classified within the same taxonomic hierarchy, and the most detailed lineage available was reported. disagree means that there was a taxonomic disagreement, and the lowest compatible taxonomic node was reported.
To elaborate on this a bit, suppose that all of the k-mers within a signature were classified as family Shewanellaceae, genus Shewanella, or species Shewanella baltica. Then the lowest compatible node (here species Shewanella baltica) would be reported, and the status of the classification would be found. However, if a number of additional k-mers in the input signature were classified as Shewanella oneidensis, sourmash would be unable to resolve the taxonomic assignment below genus Shewanella and it would report a status of disagree with the genus-level assignment of Shewanella; species level assignments would not be reported. Here, the assigned rank is the rank immediately above where there is a taxonomic disagreement, and the taxid & lineage refer to the name at that rank (the lowest common ancestor at which an assignment can be made).
For another example, if you saw this line in the CSV file:
TARA_ASW_MAG_00029,1224,disagree,phylum,Bacteria;Proteobacteria
you would know that TARA_ASW_MAG_00029 has k-mers that are shared between different orders: 'Pseudomonadales' and 'Rhodobacterales'. Therefore, the classifier status is disagree, and the classified taxid is at rank phylum - just above order.
(This is the approach that Kraken and other lowest common ancestor implementations use, we believe.)
Note: you can specify a list of file names to load signatures from in a text file passed to sourmash lca classify with the --query-from-file flag; these files will be appended to the --query input.
sourmash lca summarize - summarize a metagenome's contents using an LCA database¶
sourmash lca summarize produces a Kraken-style summary of the combined contents of the given query signatures. It is meant for exploring metagenomes and metagenome-assembled genome bins.
sourmash lca summarize also weights output with hash abundances, so that output percentages are weighted by the number of times a k-mer is seen; this can be turned off with --ignore-abundance.
ATTENTION:
Usage:
sourmash lca summarize --query query.sig [query2.sig ...]
--db <lca db> [<lca db2> ...]
For example, with the data in tests/test-data/fake-abund, the command line:
sourmash lca summarize --query query.sig.gz --db matches.lca.json.gz
will produce the following log output to stderr:
loaded 1 LCA databases. ksize=31, scaled=10000 finding query signatures... loaded 1 signatures from 1 files total.
and the following example summarize output to stdout:
79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223 79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica 79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella 79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae 79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales 79.6% 550 Bacteria;Proteobacteria;Gammaproteobacteria 79.6% 550 Bacteria;Proteobacteria 79.6% 550 Bacteria 20.4% 141 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum;Aciduliprofundum boonei;Aciduliprofundum boonei T469 20.4% 141 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum;Aciduliprofundum boonei 20.4% 141 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum 20.4% 141 Archaea;Euryarchaeota;unassigned;unassigned;unassigned 20.4% 141 Archaea;Euryarchaeota;unassigned;unassigned 20.4% 141 Archaea;Euryarchaeota;unassigned 20.4% 141 Archaea;Euryarchaeota 20.4% 141 Archaea
The output is space-separated and consists of three columns: the percentage of total k-mers that have this classification; the number of k-mers that have this classification; and the lineage classification. K-mer classifications are reported hierarchically, so the percentages and totals contain all assignments that are at a lower taxonomic level - e.g. Bacteria, above, contains all the k-mers in Bacteria;Proteobacteria.
The same information is reported in a CSV file if -o/--output is used.
The proportions reflect the query signature construction, where the metagenome contains a 1.5 Mbp Archaeal genome and a 5.4 Mbp Bacterial genome. The Archaeal genome is therefore only ~20% of the distinct k-mers in the metagenome (1.5 Mbp divided by 6.9 Mbp).
If --with-abundance is given, the output changes to reflect the proportions of the query metagenome based on k-mer/read abundances:
56.8% 740 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum;Aciduliprofundum boonei;Aciduliprofundum boonei T469 56.8% 740 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum;Aciduliprofundum boonei 56.8% 740 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum 56.8% 740 Archaea;Euryarchaeota;unassigned;unassigned;unassigned 56.8% 740 Archaea;Euryarchaeota;unassigned;unassigned 56.8% 740 Archaea;Euryarchaeota;unassigned 56.8% 740 Archaea;Euryarchaeota 56.8% 740 Archaea 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales 43.2% 563 Bacteria;Proteobacteria;Gammaproteobacteria 43.2% 563 Bacteria;Proteobacteria 43.2% 563 Bacteria
Here, the changed proportions reflect the query signature abundances, where the 1.5 Mbp Archaeal genome is present 5 times, while the 5.4 Mbp Bacterial genome is present only once; when weighted by abundance, the Bacterial genome is only 41.8% of the metagenome content, while the Archaeal genome is 58.1% of the metagenome content.
Note: you can specify a list of file names to load signatures from in a text file passed to sourmash lca summarize with the --query-from-file flag; these files will be appended to the --query input.
sourmash lca index - build an LCA database¶
The sourmash lca index command creates an LCA database from a lineage spreadsheet and a collection of signatures. This can be used to create LCA databases from private collections of genomes, and can also be used to create databases for e.g. subsets of GenBank.
See the sourmash lca tutorial and the blog post Why are taxonomic assignments so different for Tara bins? for some use cases.
If you are interested in preparing lineage spreadsheets from GenBank genomes (or building off of NCBI taxonomies more generally), please see the NCBI lineage repository.
You can use --from-file to pass lca index a text file containing a list of file names to index.
As of sourmash 4.2.0, lca index supports --picklist, to select a subset of signatures based on a CSV file. This can be used to index a subset of a large collection, or to exclude a few signatures from an index being built from a large collection.
As of sourmash 4.4.0, lca index can produce an on disk LCA database using SQLite. To prepare such a database, use sourmash lca index ... -F sql.
All sourmash commands work with either type of LCA database (the default JSON database, and the SQLite version). SQLite databases are larger than JSON databases on disk but are typically much faster to load and search, and use much less memory.
sourmash lca rankinfo - examine an LCA database¶
The sourmash lca rankinfo command displays k-mer specificity information for one or more LCA databases. See the blog post How specific are k-mers for taxonomic assignment of microbes, anyway? for example output.
sourmash lca compare_csv - compare taxonomic spreadsheets¶
The sourmash lca compare_csv command compares two lineage spreadsheets (such as those output by sourmash lca classify or taken as input by sourmash lca index) and summarizes their agreement/disagreement. Please see the blog post Why are taxonomic assignments so different for Tara bins? for an example use case.
sourmash signature subcommands for signature manipulation¶
These commands manipulate signatures from the command line.
The signature commands that combine or otherwise have multiple signatures interacting (merge, intersect, subtract) work only on compatible signatures, where the k-mer size and nucleotide/protein sequences match each other. If working directly with the hash values (e.g. merge, intersect, subtract) then the scaled values must also match; you can use downsample to convert a bunch of samples to the same scaled value.
If there are multiple signatures in a file with different ksizes and/or from nucleotide and protein sequences, you can choose amongst them with -k/--ksize and --dna or --protein, as with other sourmash commands such as search, gather, and compare.
Note, you can use sourmash sig as shorthand for all of these commands.
Most commands will load signatures automatically from indexed databases (SBT and LCA formats) as well as from signature files, and you can load signatures from stdin using - on the command line.
sourmash signature cat - combine signatures into one file¶
Concatenate signature files.
For example,
sourmash signature cat file1.sig file2.sig -o all.zip
will combine all signatures in file1.sig and file2.sig and put them in the file all.zip.
Using picklists with sourmash sig cat¶
As of sourmash 4.2.0, cat also supports picklists, a feature by which you can select signatures based on values in a CSV file. See Using picklists to subset large collections of signatures, below.
sourmash signature describe - display detailed information about signatures¶
Display signature details.
For example,
sourmash sig describe tests/test-data/track_abund/47.fa.sig
will display:
signature filename: tests/test-data/track_abund/47.fa.sig signature: NC_009665.1 Shewanella baltica OS185, complete genome source file: podar-ref/47.fa md5: 09a08691ce52952152f0e866a59f6261 k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1 size: 5177 sum hashes: 5292 signature license: CC0
Here, the size is the number of distinct hashes in the sketch, and sum_hashes is the total number of hashes in the sketch, with abundances. When track_abundance is 0, size is always the same as sum_hashes.
sourmash signature fileinfo - display a summary of the contents of a sourmash collection¶
Display signature file, database, or collection.
For example,
sourmash sig fileinfo tests/test-data/prot/all.zip
will display:
path filetype: ZipFileLinearIndex location: /Users/t/dev/sourmash/tests/test-data/prot/all.zip is database? yes has manifest? yes is nonempty? yes num signatures: 8 ** examining manifest... 31758 total hashes summary of sketches:
2 sketches with dayhoff, k=19, scaled=100 7945 total hashes
2 sketches with hp, k=19, scaled=100 5184 total hashes
2 sketches with protein, k=19, scaled=100 8214 total hashes
2 sketches with DNA, k=31, scaled=1000 10415 total hashes
sig fileinfo will recognize all accepted sourmash input files, including individual .sig and .sig.gz files, Zip file collections, SBT databases, LCA databases, and directory hierarchies.
sourmash sig fileinfo provides optional JSON and YAML output, and those formats are under semantic versioning.
Note: sourmash signature summarize is an alias for fileinfo; they are the same command.
sourmash signature grep - extract matching signatures using pattern matching¶
Extract matching signatures with substring and regular expression matching on the name, filename, and md5 fields.
For example,
sourmash signature grep -i shewanella tests/test-data/prot/all.zip -o shew.zip
will extract the two signatures in all.zip with 'Shewanella baltica' in their name and save them to shew.zip.
grep will search for substring matches or regular expressions; e.g. sourmash sig grep 'os185|os223' ... will find matches to either of those expressions.
Command line options include -i for case-insensitive matching, and -v for exclusion rather than inclusion.
A CSV file of the matching sketch information can be saved using --csv <outfile>; this file is in the sourmash manifest format and can be used as a picklist with --pickfile <outfile>::manifest.
If --silent is specified, sourmash sig grep will not output matching signatures.
sourmash sig grep also supports a counting mode, -c/--count, in which only the number of matching sketches in files will be displayed; for example,
% sourmash signature grep -ci 'os185|os223' tests/test-data/prot/*.zip
will produce the following output:
2 matches: tests/test-data/prot/all.zip 0 matches: tests/test-data/prot/dayhoff.sbt.zip 0 matches: tests/test-data/prot/dayhoff.zip 0 matches: tests/test-data/prot/hp.sbt.zip 0 matches: tests/test-data/prot/hp.zip 0 matches: tests/test-data/prot/protein.sbt.zip 0 matches: tests/test-data/prot/protein.zip
sourmash signature split - split signatures into individual files¶
Split each signature in the input file(s) into individual files, with standardized names.
For example,
sourmash signature split tests/test-data/2.fa.sig
will create 3 files,
f372e478.k=21.scaled=1000.DNA.dup=0.2.fa.sig, f3a90d4e.k=31.scaled=1000.DNA.dup=0.2.fa.sig, and 43f3b48e.k=51.scaled=1000.DNA.dup=0.2.fa.sig, representing the three different DNA signatures at different ksizes created from the input file 2.fa.
The format of the names of the output files is standardized and stable for major versions of sourmash: currently, they are period-separated with fields:
- md5sum - a unique hash value based on the contents of the signature.
- k=<ksize> - k-mer size.
- scaled=<scaled> or num=<num> - scaled or num value for MinHash.
- <moltype> - the molecule type (DNA, protein, dayhoff, or hp)
- dup=<n> - a non-negative integer that prevents duplicate signatures from colliding.
- basename - basename of first input file used to create signature; if none provided, or stdin, this is none.
If --outdir is specified, all of the signatures are placed in outdir.
Note: split only saves files in the JSON .sig format.
sourmash signature merge - merge two or more signatures into one¶
Merge two (or more) signatures.
For example,
sourmash signature merge file1.sig file2.sig -o merged.sig
will output the union of all the hashes in file1.sig and file2.sig to merged.sig.
All of the signatures passed to merge must either have been created with -p abund, or not. If they have track_abundance on, then the merged signature will have the sum of all abundances across the individual signatures. The --flatten flag will override this behavior and allow merging of mixtures by removing all abundances.
sig merge can only merge compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, you will need to choose one k-mer size with -k/--ksize, and/or one moltype with --dna/--protein/--hp/--dayhoff.
Use --set-name <name> to set the name of the output sketch.
Note: merge only creates one output file, with one signature in it.
sourmash signature rename - rename a signature¶
Rename the display name for one or more signatures - this is the name output for matches in compare, search, gather, etc.
For example,
sourmash signature rename file1.sig "new name" -o renamed.sig
will place a renamed copy of the hashes in file1.sig in the file renamed.sig. If you provide multiple signatures, all will be renamed to the same name.
sourmash signature subtract - subtract other signatures from a signature¶
Subtract all of the hash values from one signature that are in one or more of the others.
For example,
sourmash signature subtract file1.sig file2.sig file3.sig -o subtracted.sig
will subtract all of the hashes in file2.sig and file3.sig from file1.sig, and save the new signature to subtracted.sig.
To use subtract on signatures calculated with -p abund, you must specify --flatten.
sig subtract can only work with compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, you will need to choose one k-mer size with -k/--ksize, and/or one moltype with --dna/--protein/--hp/--dayhoff.
Use --set-name <name> to set the name of the output sketch.
Note: subtract only creates one output file, with one signature in it.
sourmash signature intersect - intersect two (or more) signatures¶
Output the intersection of the hash values in multiple signature files.
For example,
sourmash signature intersect file1.sig file2.sig file3.sig -o intersect.sig
will output the intersection of all the hashes in those three files to intersect.sig.
The intersect command flattens all signatures, i.e. the abundances in any signatures will be ignored and the output signature will have track_abundance turned off. The -A/--abundance-from argument will borrow abundances from the specified signature (which will also be added to the intersection).
sig intersect can only work with compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, you will need to choose one k-mer size with -k/--ksize, and/or one moltype with --dna/--protein/--hp/--dayhoff.
Use --set-name <name> to set the name of the output sketch(es).
sourmash signature inflate - transfer abundances from one signature to others¶
Use abundances from one signature to provide abundances on other signatures.
For example,
sourmash signature inflate file1.sig file2.sig file3.sig -o inflated.sig
will take the abundances from hashes file1.sig and use them to set the abundances on matching hashes in file2.sig and file3.sig. Any hashes that are not present in file1.sig will be removed from file2.sig and file3.sig as they will now have zero abundance.
sig inflate can only work with compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, you will need to choose one k-mer size with -k/--ksize, and/or one moltype with --dna/--protein/--hp/--dayhoff.
sourmash signature downsample - decrease the size of a signature¶
Downsample one or more signatures.
With downsample, you can --
- increase the scaled value for a signature created with -p scaled=SCALED, shrinking it in size;
- decrease the num value for a traditional num MinHash, shrinking it in size;
- try to convert a scaled signature to a num signature;
- try to convert a num signature to a scaled signature.
For example,
sourmash signature downsample file1.sig file2.sig --scaled 100000 -o downsampled.sig
will output each signature, downsampled to a scaled value of 100000, to downsampled.sig; and
sourmash signature downsample --num 500 scaled_file.sig -o downsampled.sig
will try to convert a scaled MinHash to a num MinHash.
sourmash signature extract - extract signatures from a collection¶
Extract the specified signature(s) from a collection of signatures.
For example,
sourmash signature extract *.sig -k 21 --dna -o extracted.sig
will extract all nucleotide signatures calculated at k=21 from all .sig files in the current directory.
There are currently two other useful selectors for extract: you can specify (part of) an md5sum, as output in the CSVs produced by search and gather; and you can specify (part of) a name.
For example,
sourmash signature extract tests/test-data/*.fa.sig --md5 09a0869
will extract the signature from 47.fa.sig which has an md5sum of 09a08691ce52952152f0e866a59f6261; and
sourmash signature extract tests/test-data/*.fa.sig --name NC_009665
will extract the same signature, which has an accession number of NC_009665.1.
Using picklists with sourmash sig extract¶
As of sourmash 4.2.0, extract also supports picklists, a feature by which you can select signatures based on values in a CSV file. See Using picklists to subset large collections of signatures, below.
sourmash signature flatten - remove abundance information from signatures¶
Flatten the specified signature(s), removing abundances and setting track_abundance to False.
For example,
sourmash signature flatten *.sig -o flattened.sig
will remove all abundances from all of the .sig files in the current directory.
The flatten command accepts the same selectors as extract.
sourmash signature filter - remove hashes based on abundance¶
Filter the hashes in the specified signature(s) by abundance, by either -m/--min-abundance or -M/--max-abundance or both. Abundance selection is inclusive, so -m 2 -M 5 will select hashes with abundance greater than or equal to 2, and less than or equal to 5.
For example,
sourmash signature filter -m 2 *.sig
will output new signatures containing only hashes that occur two or more times in each signature.
The filter command accepts the same selectors as extract.
sourmash signature import - import signatures from mash.¶
Import signatures into sourmash format. Currently only supports mash, and can import mash sketches output by mash info -d <filename.msh>.
For example,
sourmash signature import filename.msh.json -o imported.sig
will import the contents of filename.msh.json into imported.sig.
Note: import only creates one output file, with one signature in it.
Note: ingest is an alias for import.
sourmash signature export - export signatures to mash.¶
Export signatures from sourmash format. Currently only supports mash dump format.
For example,
sourmash signature export filename.sig -o filename.sig.msh.json
sourmash signature overlap - detailed comparison of two signatures' overlap¶
Display a detailed comparison of two signatures. This calculates the Jaccard similarity (as in sourmash compare or sourmash search) and the Jaccard containment in both directions (as with --containment). It also displays the number of hash values in the union and intersection of the two signatures, as well as the number of disjoint hash values in each signature.
This command has two uses - first, it is helpful for understanding how similarity and containment are calculated, and second, it is useful for analyzing signatures with very small overlaps, where the similarity and/or containment might be very close to zero.
For example,
sourmash signature overlap tests/test-data/63.fa.sig \
tests/test-data/47.fa.sig
will display the detailed comparison of the two files like so:
loaded one signature each from tests/test-data/63.fa.sig and tests/test-data/47.fa.sig first signature:
signature filename: tests/test-data/63.fa.sig
signature: NC_011663.1 Shewanella baltica OS223, complete genome
md5: 38729c6374925585db28916b82a6f513
k=31 molecule=DNA num=0 scaled=1000 second signature:
signature filename: tests/test-data/47.fa.sig
signature: NC_009665.1 Shewanella baltica OS185, complete genome
md5: 09a08691ce52952152f0e866a59f6261
k=31 molecule=DNA num=0 scaled=1000 similarity: 0.32069 first contained in second: 0.48282 second contained in first: 0.48851 number of hashes in first: 5238 number of hashes in second: 5177 number of hashes in common: 2529 only in first: 2709 only in second: 2648 total (union): 7886
sig overlap can only work with compatible sketches - if there are multiple k-mer sizes or molecule types present in any of the signature files, you will need to choose one k-mer size with -k/--ksize, and/or one moltype with --dna/--protein/--hp/--dayhoff.
sourmash signature kmers - extract k-mers and/or sequences that match to signatures¶
Given one or more compatible sketches and some sequence files, extract the k-mers and/or sequences corresponding to the hash values in the sketch. Because the sourmash hash function is one-way, this requires FASTA or FASTQ sequence files in addition to the sketch.
For example,
sourmash sig kmers --signatures sig1.sig --sequences seqfile.fasta \
--save-sequences matches.fasta --save-kmers kmer-matches.csv
will search seqfile.fasta for matching sequences and k-mers, and produce two files. The file matches.fasta will contain FASTA sequences that match the hashes in the input signature, while the file kmer-matches.csv provides the matching k-mers and hash values, together with their originating filename and sequence name.
If the sketch is a protein sketch (protein, dayhoff, or hp), then the input sequences are assumed to be protein. To search DNA sequences for translated protein hashes, provide the --translate flag to sig kmers.
--save-sequences and --save-kmers are both optional. If neither are given, basic statistics on k-mer matching are given.
Please note that --save-kmers can be very slow on large files!
The input sketches are the source of the input hashes. So, for example, If --scaled=1 sketches are provided, sig kmers can be used to yield all the k-mers and their matching hashes. Likewise, if the sketch is built from the intersection of two other sketches, only the k-mers and hash values present in both sketches will be used.
Likewise, the input sequences are used for matching; they do not need to be the same sequences that were used to create the sketches. Input sequences can be in FASTA or FASTQ format, and either flat text or compressed with gzip or bzip2; formats are auto-detected.
By default, sig kmers ignores bad k-mers (e.g. non-ACGT characters in DNA). If --check-sequence is provided, sig kmers will error exit on the first bad k-mer. If --check-sequence --force is provided, sig kmers will provide error messages (and skip bad sequences), but will continue processing input sequences.
sourmash signature manifest - output a manifest for a file¶
Output a manifest for a file, database, or collection. Note that these manifests are not usually suitable for use as standalone manifests; the sourmash sig collect and sourmash sig check commands produce standalone manifests.
For example,
sourmash sig manifest tests/test-data/prot/all.zip -o manifest.csv
will create a CSV file, manifest.csv, in the internal sourmash manifest format. The manifest will contain an entry for every signature in the file, database, or collection. This format is largely meant for internal use, but it can serve as a picklist pickfile for subsetting large collections.
By default, sourmash sig manifest will rebuild the manifest by iterating over the signatures in the input file. This can be slow for large collections. Use --no-rebuild-manifest to load an existing manifest if it is available.
As of sourmash 4.4.0, sig manifest can produce a manifest in a fast on-disk format (a SQLite database). SQLite manifests can be much faster when working with very large collections of signatures. To produce a SQLite manifest, use sourmash sig manifest ... -F sql.
All sourmash commands that work with manifests will accept both CSV and SQLite manifest files.
sourmash signature check - compare picklists and manifests¶
Compare picklists and manifests across databases, and optionally output matches and missing items. In particular, sig check can be used to create standalone manifests for a subset of a large collection, using picklists.
For example,
sourmash sig check tests/test-data/gather/GCF*.sig \
--picklist tests/test-data/gather/salmonella-picklist.csv::manifest
will load all of the GCF signatures and compare them to the given picklist. With -o/--output-missing, sig check will save unmatched elements of the picklist CSV. With --save-manifest-matching, sig check will save all of the matched elements to a manifest file, which can then be used as a sourmash database.
sourmash sig check is particularly useful when working with large collections of signatures and identifiers.
With -m/--save-manifest-matching, sig check creates a standalone manifest. In these manifests, sourmash v4 will by default write paths to the matched elements that are relative to the current working directory. In some cases - when the output manifest is in a different directory - this will create manifests that do not work properly with sourmash. The --relpath argument will rewrite the paths to be relative to the manifest, while the --abspath argument will rewrite paths to be absolute. The --relpath behavior will be the default in sourmash v5.
Standalone manifests created with -m/--save-manifest-matching will use the paths given to sig check on the command line; we recommend using zip files and sig files, and avoiding directory hierarchies or path lists. You can use --from-file to pass in long lists of filenames via a text file.
sourmash signature collect - collect manifests across databases¶
Collect manifests from across (many) files and merge into a single standalone manifest. Standalone manifests can be used directly as a sourmash database; they support efficient searching and selection of sketches, as well as lazy loading of individual sketches from large collections. See advanced usage information on sourmash databases for more information.
For example,
sourmash sig collect tests/test-data/gather/GCF*.sig -o mf.sqlmf
will load all of the GCF signatures and build a manifest file mf.sqlmf that contains references to all of the signatures, but not the signatures themselves. This manifest file can be loaded directly from the command line by sourmash.
sourmash sig collect defaults to outputting SQLite manifests. It is particularly useful when working with large collections of signatures and identifiers, and has command line options for merging and updating manifests.
The standalone manifests created by sig collect will reference the paths given on the command line; we recommend using zip files and sig files, and avoiding directory hierarchies or path lists. You can also use --from-file to pass in long lists of filenames.
Standalone manifests produced by sig collect work most efficiently when constructed from many small zip file collections.
As with sig check, the standalone manifests created by sig collect in sourmash v4 will by default write paths to the matched elements relative to the current working directory. When the output manifest is in a different directory, this will create manifests that do not work properly with sourmash. The --relpath argument will rewrite the paths to be relative to the manifest, while the --abspath argument will rewrite paths to be absolute. The --relpath behavior will be the default in sourmash v5.
Advanced command-line usage¶
Loading signatures and databases¶
sourmash uses several different command-line styles. Most sourmash commands can load sketches from any standard collection type; we primarily recommend using zipfiles (but read on!)
Briefly,
- search and gather both take a single query signature and search multiple signatures or databases. In this case, there has to be a single identifiable query for sourmash to use, and if you're using a database or list of signatures as the source of a query, you'll need to provide a selector (ksize with -k, moltype with --dna etc, or md5sum with --query-md5) that picks out a single signature.
- compare takes multiple signatures and can load them from any sourmash collection type.
- the lca classify and lca summarize commands take multiple signatures with --query, and multiple LCA databases, with --db. sourmash multigather also uses this style. This allows these commands to specify multiple queries and multiple databases without (too much) confusion. The database must be LCA databases.
- index and lca index take a few fixed parameters (database name, and for lca index, a taxonomy file) and then an arbitrary number of other files that contain signatures.
None of these commands currently support searching, comparing, or indexing signatures with multiple ksizes or moltypes at the same time; you need to pick the ksize and moltype to use for your query. Where possible, scaled values will be made compatible.
Selecting signatures¶
(sourmash v4.3.0 and later)
sourmash is built to work with very large collections of signatures, and you may want to select (or exclude) specific signatures from search or other operations, based on their name. This can be done without modifying the collections themselves via the --include-db-pattern and --exclude-db-pattern arguments to many sourmash commands, including search, gather, compare, prefetch, and sig extract.
In brief, sourmash search ... --include <pattern> will search only those database signatures that match <pattern> in their name, filename, or md5 strings. Here, <pattern> can be either a substring or a regular expression. Likewise, sourmash search ... --exclude <pattern> will search only those database signatures that don't match pattern in their name, filename, or md5 strings.
Using picklists to subset large collections of signatures¶
(sourmash v4.2.0 and later)
Many commands support picklists, a feature by which you can select or "pick out" signatures based on values in a CSV file. This is typically used to index, extract, or search a subset of a large collection where modifying the collection itself isn't desired.
For example,
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
will extract only the signatures that have md5sums matching the column md5sum in the CSV file list.csv. The command
sourmash sig extract --picklist list.csv::prefetch <signatures>
will extract only the signatures found in the output of sourmash prefetch ... -o list.csv.
The --picklist argument string must be of the format pickfile:colname:coltype[:pickstyle], where pickfile is the path to a CSV file, colname is the name of the column to select from the CSV file (based on the headers in the first line of the CSV file), and coltype is the type of match. An optional pickstyle argument, :include or :exclude, can be added as a fourth parameter; if omitted, the default is :include.
The following coltypes are currently supported for picklists:
- name - exact match to signature's name
- md5 - exact match to signature's md5sum
- md5prefix8 - match to 8-character prefix of signature's md5sum
- md5short - same as md5prefix8
- ident - exact match to signature's identifier
- identprefix - match to signature's identifier, before '.'
- gather - use the CSV output of sourmash gather as a picklist
- prefetch - use the CSV output of sourmash prefetch as a picklist
- search - use the CSV output of sourmash prefetch as a picklist
- manifest - use CSV manifests produced by sig manifest as a picklist
Identifiers are constructed by using the first space delimited word in the signature name.
One way to build a picklist is to use sourmash sig grep <pattern> <collection> --csv out.csv to construct a CSV file containing a list of all sketches that match the pattern (which can be a string or regexp). The out.csv file can be used as a picklist via the picklist manifest CSV format with --picklist out.csv::manifest.
You can also use sourmash sig describe --csv out.csv <signatures> or sourmash sig manifest -o out.csv <filename_or_db> to construct an initial CSV file that you can then edit further and use as a picklist as above.
The picklist functionality also supports excluding (rather than including) signatures matching the picklist arguments. To specify a picklist for exclusion, add :exclude to the --picklist argument string, e.g. pickfile:colname:coltype:exclude.
For example,
sourmash sig extract --picklist list.csv:md5:md5sum:exclude <signatures>
will extract only the signatures that have md5sums that do not match entries in the column md5sum in the CSV file list.csv.
In addition to sig extract, the following commands support --picklist selection: index, search, gather, prefetch, compare, index, and lca index.
Storing (and searching) signatures¶
Backing up a little, there are many ways to store and search signatures. sourmash supports storing and loading signatures from JSON files, directories, lists of files, Zip files, custom indexed databases, and SQLite databases. These can all be used interchangeably for most sourmash operations.
The simplest is one signature in a single JSON file. You can also put many signatures in a single JSON file, either by building them that way with sourmash sketch or by using sourmash sig cat or other commands. Searching or comparing these files involves loading them sequentially and iterating across all of the signatures - which can be slow, especially for many (100s or 1000s) of signatures.
Zip files¶
All of the sourmash commands support loading collections of signatures from zip files. You can create a compressed collection of signatures using sourmash sig cat *.sig -o collections.zip and then specifying collections.zip on the command line in place of *.sig; you can also sketch FASTA/FASTQ files directly into a zip file with -o collections.zip.
Choosing signature output formats¶
(sourmash v4.1 and later)
All signature saving arguments (--save-matches for search and gather, -o for sourmash sketch, and -o for the sourmash signature commands) support flexible saving of collections of signatures into JSON text, Zip files, and/or directories.
This behavior is triggered by the requested output filename --
- to save to JSON signature files, use .sig; using the filename - will send JSON to stdout.
- to save to gzipped JSON signature files, use .sig.gz;
- to save to a Zip file collection, use .zip;
- to save signature files to a directory, use a name ending in /; the directory will be created if it doesn't exist;
- to save to a SQLite database, use .sqldb (as of sourmash v4.4.0).
If none of these file extensions is detected, output will be written in the JSON .sig format, either to the provided output filename or to stdout.
All of these save formats can be loaded by sourmash commands.
We strongly suggest using .zip files to store signatures: they are fast, small, and fully supported by all the sourmash commands and API.
Note that when outputting large collections of signatures, some save formats require holding all the sketches in memory until they can be written out, and others can save progressively. This can affect memory usage! Currently .sig and .sig.gz formats are held in memory, while .zip, directory outputs, and .sqldb formats write progressively to disk.
For more detailed information on database formats and performance tradeoffs, please see the advanced usage information for databases!
Loading many signatures¶
Indexed databases¶
Indexed databases can make searching signatures much faster. SBT databases are low memory and disk-intensive databases that allow for fast searches using a tree structure, while LCA databases are higher memory and (after a potentially significant load time) are quite fast. SQLite databases (new in sourmash v4.4.0) are typically larger on disk than SBTs and LCAs, but in turn are fast to load and support very low memory search.
Commands that take multiple signatures or collections of signatures will also work with indexed databases.
One limitation of indexed databases is that they are all restricted in to certain kinds of signatures. Both SBT and LCA databases can only contain one "type" of signature (one ksize/one moltype at one scaled value). SQLite databases can contain multiple ksizes and moltypes, but only at one scaled value. If the database signature type is incompatible with the other signatures, sourmash will complain appropriately.
In contrast, signature files and zip collections can contain many different types of signatures, and compatible ones will be selected automatically.
Use the sourmash index command to create an SBT.
Use the sourmash lca index command to create an LCA database; the database can be saved in JSON or SQL format with -F json or -F sql.
Use sourmash sig cat <list of signatures> -o <output>.sqldb to create a SQLite indexed database.
Loading signatures within a directory hierarchy¶
All of the sourmash commands support loading signatures (.sig or .sig.gz files) from within directory hierarchies; you can just provide the paths to the top-level directory on the command line.
However, this is no longer recommended because it can be very inefficient; we instead suggest passing all of the sketch files in the directory into sig collect to build a standalone manifest, or using sig cat on the directory to generate a zip file.
Passing in lists of files¶
sourmash commands support --from-file or --query-from-file, which will take the location of a text file containing a list of file paths. This can be useful for situations where you want to specify thousands of queries, or a subset of signatures produced by some other command.
This is no longer recommended when using large collections; we instead suggest using standalone manifests built with sig collect and sig check, which will include extra metadata that supports fast loading.
Combining search databases on the command line¶
All of the commands in sourmash operate in "online" mode, so you can combine multiple databases and signatures on the command line and get the same answer as if you built a single large database from all of them. The only caveat to this rule is that if you have multiple identical matches present across the databases, the order in which they are used may depend on the order that the files are passed in on the command line.
Using stdin¶
Most commands will take signature JSON data via stdin using the usual UNIX convention, -. Moreover, sourmash sketch and the sourmash sig commands will output to stdout. So, for example,
sourmash sketch ... -o - | sourmash sig describe -
will describe the signatures that were just created.
Using standalone manifests to explicitly refer to collections of files¶
(sourmash v4.4 and later)
Manifests are metadata catalogs of signatures that are used for signature selection and loading. They are used extensively by sourmash internals to speed up signature selection through picklists and pattern matching.
Manifests can also be used externally (via the command-line), and these "standalone manifests" may be useful for organizing large collections of signatures. They can be generated with the sig collect, sig manifest, and sig check subcommands.
Suppose you have a large collection of signatures (.sig or .sig.gz files) in a location (e.g., under a directory, or in a zip file). You can create a manifest file for them like so:
sourmash sig collect <dir> <zipfile> -o manifest.sqlmf
and then use the manifest directly for sourmash operations, for example:
sourmash sig fileinfo manifest.sqlmf
This manifest contains references to the signatures (but not the signatures themselves) and can then be used as a database target for most sourmash operations - search, gather, etc. Manifests support fast selection and lazy loading of sketches in many situations.
The sig check command can also be used to create standalone manifests from collections using a picklist, with the -m/--save-manifest-matching option. This is useful for commands that don't support picklists natively, such as commands in plugins.
Note that sig collect and sig check will generate manifests containing the pathnames given to them - so if you use relative paths, the references will be relative to the working directory in which the command was run. You can use sig collect --abspath to rewrite the paths into absolute paths, or sig collect --relpath to rewrite the paths relative to the manifest file.
Our advice: We suggest using zip file collections for most situations; we strongly recommend using standalone manifests for situations where you have very large sketches or a very large collection of sketches (1000s or more), and don't want to make multiple copies of signatures in the collection (as you would have to, with a zipfile). This is particularly useful if you want to refer to different subsets of the collection without making multiple copies in a zip file.
You can read more about the details of zip files and manifests in the advanced usage information for databases.
Using sourmash plugins¶
As of sourmash v4.7.0, sourmash has an experimental plugins interface! The plugin interface supports extending sourmash to load and save signatures in new ways, and also supports the addition of sourmash subcommands via sourmash scripts.
In order to use a plugin with sourmash, you will need to use pip or conda to install the plugin the same environment that sourmash is installed in.
In the future, we will include a list of available sourmash plugins in the documentation, and also provide a way to list available plugins.
You can list all installed plugins and their versions with sourmash info -v.
Below are some useful plugins that the sourmash team uses regularly and supports!
The branchwater plugin - multithreaded and optimized sourmash operations¶
(Installable via conda and pip as sourmash_plugin_branchwater.)
The branchwater plugin provides faster and lower memory versions of search, gather, and sketch, as well as large-scale metagenome search (used for petabyte-scale sequence search) and large-scale clustering.
Read the branchwater plugin docs for more information, and ask questions on the sourmash issue tracker!
The betterplot plugin - improved plotting and visualization¶
(Installable via pip as sourmash_plugin_betterplot.)
The betterplot plugin provides a variety of new plotting outputs for sourmash, including improved distance matrices, MDS plots, tSNE plots, upset plots, and Venn diagrams. It also supports cluster-cutting and extraction, as well as improved labeling and coloring by category.
Read the betterplot docs for more information, and ask questions on the sourmash issue tracker!.
Prepared databases¶
Contents¶
- •
- Prepared databases
- Types of databases
- Taxonomic Information (for non-LCA databases)
- Downloading and using the databases
- Sketches for human and animal genomes
- Sketches for plant genomes
- GTDB R08-RS214 - DNA databases
- GTDB R08-RS214 genomic representatives (85k)
- GTDB R08-RS214 all genomes (403k)
- •
- Genbank genomes from March 2022
- Genbank viral
- Genbank archaeal
- Genbank protozoa
- Genbank fungi
- Genbank bacterial:
- •
- GTDB R07-RS207 - DNA databases
- GTDB R07-RS207 genomic representatives (66k)
- GTDB R07-RS207 all genomes (318k)
- •
- GTDB R06-RS202 - DNA databases
- GTDB R06-RS202 genomic representatives (47.8k)
- GTDB R06-RS202 all genomes (258k)
- Appendix: database use and construction details
- Appendix: Memory and time requirements
- Appendix: legacy databases
We provide a number of pre-built collections and indexed databases that you can use with sourmash.
Types of databases¶
For each k-mer size, three types of databases may be available: Zipfile (.zip), SBT (.sbt.zip), and LCA (.lca.jzon.gz). The Zipfile and SBT databases are built with scaled=1000, and then LCA databases are built with scaled=10,000. We recommend using the Zipfile databases for sourmash gather and the SBT databases for sourmash search. You must use the LCA databases for sourmash lca operations.
You can read more about the different database and index types here.
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1 and up.
Taxonomic Information (for non-LCA databases)¶
For each prepared database, we have also made taxonomic information available linking each genome with its assigned lineage (GTDB or NCBI as appropriate). For private databases, users can create their own taxonomy files: the critical columns are ident, containing the genome accession (e.g. GCA_1234567.1) and a column for each taxonomic rank, superkingdom to species. If a strain column is provided, it will also be used. As of v4.8, we can also use LIN taxonomic information in tax commands that accept the --lins flag. If used, sourmash tax commands will require a lin column in the taxonomy file which should contain ;-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). Some taxonomy commands also accept a lingroups file, which is a two-column file (name, lin) describing the name and LIN prefix of LINgroups to be used for taxonomic summarization.
Downloading and using the databases¶
All databases below can be downloaded via the command line with curl -JLO <url>, where <url> is the URL below. This will download an appropriately named file; you can name it yourself by specify '-o <output> to specify the local filename.
The databases do not need to be unpacked or prepared in any way after download.
You can verify that they've been successfully downloaded (and view database properties such as ksize and scaled) with sourmash sig summarize <output>.
Sketches for human and animal genomes¶
These sketches are of the latest releases of a number of animal genomes. Among other uses, they can be used to detect host contamination in microbial metagenomes.
Each file includes sketches at k=21, k=31, and k=51, at a scaled of 1000, and is under 50 MB.
- Human (hg38) - hg38.sig.zip
- Cow (bosTau9) - bosTau9.sig.zip
- Dog (canFam6) - canFam6.sig.zip
- Horse (equCab3) - equCab3.sig.zip
- Cat (felCat9) - felCat9.sig.zip
- Chicken (galGAl6) - galGal6.sig.zip
- Mouse (mm39) - mm39.sig.zip
- Goat (oviAri4) - oviAri4.sig.zip
- Pig (susCr11) - susScr11.sig.zip
Sketches for plant genomes¶
These sketches are for the plant genomes available in GenBank as of 2024-07.
K-mer size | Zipfile collection |
k21 | download (7G) |
k31 | download (8.8G) |
k51 | download (11G) |
Lineage spreadsheet for sourmash tax commands: download
GTDB R08-RS214 - DNA databases¶
GTDB R08-RS214 consists of 402,709 genomes organized into 85,205 species clusters.
The lineage spreadsheet (for sourmash tax commands) is available for the genome database (402k).
GTDB R08-RS214 genomic representatives (85k)¶
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes, with 85,205 species-level genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (2.2 GB) | download (4.4 GB) | download (189 MB) |
31 | download (2.2 GB) | download (4.4 GB) | download (221 MB) |
51 | download (2.2 GB) | download (4.4 GB) | download (230 MB) |
GTDB R08-RS214 all genomes (403k)¶
These are databases for the full GTDB release, each containing 402,709 genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (12 GB) | download (23 GB) | download (406 MB) |
31 | download (12 GB) | download (23 GB) | download (438 MB) |
51 | download (12 GB) | download (23 GB) | download (460 MB) |
Genbank genomes from March 2022¶
The below zip files contain different subsets of the signatures for all microbial Genbank genomes. The databases were built in March 2022, and are based on the assembly_summary files provided here.
Since some of the files are extremely large, we only provide them in Zip format (which is our smallest and most flexible format).
Note that all of the sourmash search commands support multiple databases on the command line, so you can search multiple subsets simply by providing them all on the command line, e.g. sourmash search query.sig genbank-2022.03-{viral,protozoa}-k31.zip.
Taxonomic spreadsheets for each domain are provided below as well.
Genbank viral¶
47,952 genomes:
genbank-2022.03-viral-k21.zip
genbank-2022.03-viral-k31.zip
genbank-2022.03-viral-k51.zip
genbank-2022.03-viral.lineages.csv.gz
Genbank archaeal¶
8,750 genomes:
genbank-2022.03-archaea-k21.zip
genbank-2022.03-archaea-k31.zip
genbank-2022.03-archaea-k51.zip
genbank-2022.03-archaea.lineages.csv.gz
Genbank protozoa¶
1193 genomes:
genbank-2022.03-protozoa-k21.zip
genbank-2022.03-protozoa-k31.zip
genbank-2022.03-protozoa-k51.zip
genbank-2022.03-protozoa.lineages.csv.gz
Genbank fungi¶
10,286 genomes:
genbank-2022.03-fungi-k21.zip
genbank-2022.03-fungi-k31.zip
genbank-2022.03-fungi-k51.zip
genbank-2022.03-fungi.lineages.csv.gz
Genbank bacterial:¶
1,148,011 genomes:
genbank-2022.03-bacteria-k21.zip
genbank-2022.03-bacteria-k31.zip
genbank-2022.03-bacteria-k51.zip
genbank-2022.03-bacteria.lineages.csv.gz
GTDB R07-RS207 - DNA databases¶
GTDB R07-RS207 consists of 317,542 genomes organized into 65,703 species clusters.
The lineage spreadsheet (for sourmash tax commands) is available for the species database (65k) and for the genome database (317k).
GTDB R07-RS207 genomic representatives (66k)¶
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes, with 65,703 species-level genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
31 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
51 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
GTDB R07-RS207 all genomes (318k)¶
These are databases for the full GTDB release, each containing 317,542 genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (9.4 GB) | download (19 GB) | download (351 MB) |
31 | download (9.4 GB) | download (19 GB) | download (351 MB) |
51 | download (9.4 GB) | download (19 GB) | download (351 MB) |
GTDB R06-RS202 - DNA databases¶
All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for sourmash lca index) is available here.
GTDB R06-RS202 genomic representatives (47.8k)¶
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (1.3 GB) | download (2.6 GB) | download (114 MB) |
31 | download (1.3 GB) | download (2.6 GB) | download (131 MB) |
51 | download (1.3 GB) | download (2.6 GB) | download (137 MB) |
GTDB R06-RS202 all genomes (258k)¶
These databases contain the complete GTDB collection of 258,406 genomes.
K-mer size | Zipfile collection | SBT | LCA |
21 | download (7.8 GB) | download (15 GB) | download (266 MB) |
31 | download (7.8 GB) | download (15 GB) | download (286 MB) |
51 | download (7.8 GB) | download (15 GB) | download (299 MB) |
Appendix: database use and construction details¶
Database release workflows are being archived at sourmash-bio/database-releases.
Some more details on database use and construction:
- Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.
- SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000.
- LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with the lca subcommands for taxonomic analysis. They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches.
Appendix: Memory and time requirements¶
The detailed memory usage of sourmash depends on the type of search, the query, and the database you're searching, but to help guide you here is a range of numbers:
Search type | Query | Database | Max RAM | Time |
gather | Bacterial genome | GTDB complete (280k) | 1 GB | 6 minutes |
gather | Simple metagenome | GTDB reps .zip (65k) | 2 GB | 6 minutes |
gather | Real metagenome | All Genbank (1.2m) | 100 GB | 3 hours |
lca summarize | Simple metagenome | GTDB reps .sql (65k) | 400 MB | 20 seconds |
lca summarize | Simple metagenome | GTDB reps .json (65k) | 6.2 GB | 1m 20 seconds |
Please see sourmash#1958 for detailed GTDB numbers and gather paper#47 for detailed Genbank numbers.
Appendix: legacy databases¶
Legacy databases are available here.
sourmash Python API examples¶
All of sourmash's functionality is available via its Python API. Below are both basic and advanced examples that use the API to accomplish common tasks.
Contents¶
- •
- sourmash Python API examples
- A first example: two k-mers
- Introduction: k-mers, molecule types, and hashing.
- Set operations on hashes
- Creating MinHash sketches programmatically, from genome files
- Plotting dendrograms and matrices
- Saving and loading signature files
- Going from signatures back to MinHash objects and their hashes -
- Advanced features of sourmash MinHash objects - scaled and num
- Working with indexed collections of signatures
A first example: two k-mers¶
Define two sequences:
>>> seq1 = "ATGGCA" >>> seq2 = "AGAGCA"
Create two MinHashes using 3-mers, and add the sequences:
>>> import sourmash >>> mh1 = sourmash.MinHash(n=0, ksize=3, scaled=1) >>> mh1.add_sequence(seq1)
>>> mh2 = sourmash.MinHash(n=0, ksize=3, scaled=1) >>> mh2.add_sequence(seq2)
One of the 3-mers (out of 7) overlaps, so Jaccard index is 1/7:
>>> round(mh1.jaccard(mh2), 2) 0.14
and of course the MinHashes match themselves:
>>> mh1.jaccard(mh1) 1.0
We can add sequences to the MinHash objects and query at any time --
>>> mh1.add_sequence(seq2) >>> x = mh1.jaccard(mh2) >>> round(x, 3) 0.571
Introduction: k-mers, molecule types, and hashing.¶
DNA k-mers¶
The basis of sourmash is k-mers. Suppose we have a 35 base DNA sequence, and we break it into four 31-mers:
>>> K = 31 >>> dnaseq = "ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGC" >>> for i in range(0, len(dnaseq) - K + 1): ... kmer = dnaseq[i:i+K] ... print(i, kmer) 0 ATGCGAGTGTTGAAGTTCGGCGGTACATCAG 1 TGCGAGTGTTGAAGTTCGGCGGTACATCAGT 2 GCGAGTGTTGAAGTTCGGCGGTACATCAGTG 3 CGAGTGTTGAAGTTCGGCGGTACATCAGTGG 4 GAGTGTTGAAGTTCGGCGGTACATCAGTGGC
sourmash uses a hash function (by default MurmurHash) that converts each k-mer into 64-bit numbers. These numbers form the basis of everything else sourmash does; the k-mer strings are not used internally at all.
The easiest way to access the hash function is via the seq_to_hashes method on MinHash objects, which returns a list:
>>> from sourmash import MinHash >>> mh = MinHash(n=0, ksize=K, scaled=1) >>> for i in range(0, len(dnaseq) - K + 1): ... kmer = dnaseq[i:i+K] ... print(i, kmer, mh.seq_to_hashes(kmer)) 0 ATGCGAGTGTTGAAGTTCGGCGGTACATCAG [7488148386897425535] 1 TGCGAGTGTTGAAGTTCGGCGGTACATCAGT [3674733966066518639] 2 GCGAGTGTTGAAGTTCGGCGGTACATCAGTG [2135725670290847794] 3 CGAGTGTTGAAGTTCGGCGGTACATCAGTGG [14521729668397845245] 4 GAGTGTTGAAGTTCGGCGGTACATCAGTGGC [15919051675656106963]
Note that this is the same as using the MurmurHash hash function with a seed of 42 and taking the first 64 bits.
Because DNA is double-stranded and has no inherent directionality, but computers represent DNA with only one strand, it's important for sourmash to represent both strands. sourmash does this by building a canonical representation for each k-mer so that reverse-complement sequences match to their forward sequence.
Underneath, sourmash DNA hashing does this by taking each k-mer, building the reverse complement, choosing the lexicographically lesser of the two, and then hashes it - for example, for the first and second k-mers above, you get:
>>> from sourmash.minhash import hash_murmur >>> from screed import rc >>> kmer_1 = "ATGCGAGTGTTGAAGTTCGGCGGTACATCAG" >>> hash_murmur(kmer_1) 7488148386897425535 >>> hash_murmur(kmer_1) == mh.seq_to_hashes(kmer_1)[0] True >>> kmer_2 = "TGCGAGTGTTGAAGTTCGGCGGTACATCAGT" >>> hash_murmur(kmer_2) 6388498842523164783 >>> kmer_2_rc = rc(kmer_2) >>> kmer_2_rc 'ACTGATGTACCGCCGAACTTCAACACTCGCA' >>> hash_murmur(kmer_2_rc) == mh.seq_to_hashes(kmer_2)[0] True
where the second k-mer's reverse complement starts with 'A' and is therefore chosen for hashing by sourmash. This method was chosen to be compatible with [mash](https://mash.readthedocs.io/.
Protein-based encodings¶
By default, MinHash objects work with DNA. However, sourmash supports amino acid, Dayhoff, and hydrophobic-polar (hp) encodings as well. The Dayhoff and hp encodings support degenerate matching that is less stringent than exact matches.
The simplest way to use a protein MinHash object is to create one and call add_protein on it --
>>> K = 7 >>> mh = MinHash(0, K, is_protein=True, scaled=1) >>> protseq = "MVKVYAPAS" >>> mh.add_protein(protseq)
This creates three 7-mers from the sequence and hashes them:
>>> list(sorted(mh.hashes)) [5834377656419371297, 8846570680426381265, 10273850291677879123]
As with DNA k-mers, above, you can also use seq_to_hashes to generate the hashes for protein k-mers, if you add the is_protein=True flag:
>>> for i in range(0, len(protseq) - K + 1): ... kmer = protseq[i:i+K] ... print(i, kmer, mh.seq_to_hashes(kmer, is_protein=True)) 0 MVKVYAP [5834377656419371297] 1 VKVYAPA [10273850291677879123] 2 KVYAPAS [8846570680426381265]
In this case, the k-mers are always hashed in the forward direction (because protein sequence always has an orientation, unlike DNA).
sourmash also supports the Dayhoff and hydrophobic-polar encodings; here, amino acids are first mapped to their encodings and then hashed. So, for example, the amino acid sequence CADHIF* is mapped to abcdef* in the Dayhoff encoding:
>>> mh = MinHash(0, K, dayhoff=True, scaled=1) >>> h1 = mh.seq_to_hashes('CADHIF*', is_protein=True)[0] >>> h1 12061624913065022412 >>> h1 == hash_murmur('abcdef*') True
Translating DNA into protein-based encodings.¶
If you use add_sequence(...) to add DNA sequence to a protein encoding, or call seq_to_hashes(...) on a protein encoding without is_protein=True, sourmash will translate the sequences in all possible reading frames and hash the translated amino acids. The k-mer size for the MinHash is used as the k-mer size of the amino acids, i.e. 7 aa is 21 DNA bases.
>>> dnaseq = "ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGC" >>> len(dnaseq) 35 >>> K = 7 >>> mh = MinHash(n=0, ksize=K, is_protein=True, scaled=1) >>> mh.add_sequence(dnaseq) >>> len(mh) 30
Here, 30 hashes are added to the MinHash object because we are starting with a 35 base DNA sequence, and using 21-mers of DNA (7-mer of protein), which give us 15 distinct 21-mers in the three forward frames, and 15 distinct 21-mers in the three reverse-complement frames, for a total of 30.
If a dayhoff or hp MinHash object is used, then add_sequence(...) will first translate each sequence into protein space in all six frames, and then further encode the sequences into Dayhoff or hp encodings.
Invalid characters in DNA and protein sequences¶
sourmash detects invalid DNA characters (non-ACTG) and raises an exception on add_sequence, unless force=True, in which case k-mers containing invalid characters are ignored.
>>> dnaseq = "nTGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGC" >>> K = 31 >>> mh = MinHash(n=0, ksize=K, scaled=1) >>> mh.add_sequence(dnaseq) Traceback (most recent call last):
... ValueError: invalid DNA character in input k-mer: NTGCGAGTGTTGAAGTTCGGCGGTACATCAG >>> mh.add_sequence(dnaseq, force=True) >>> len(mh) 4
For protein sequences, sourmash does not currently do any invalid character detection; k-mers are hashed as they are, and can only be matched by an identical k-mer (with the same invalid character). (Please file an issue if you'd like us to change this!)
>>> K = 7 >>> mh = MinHash(n=0, ksize=K, is_protein=True, scaled=1) >>> protseq = "XVKVYAPAS" >>> mh.add_protein(protseq) >>> len(mh) 3
For the Dayhoff and hp encodings on top of amino acids, invalid amino acids (any letter for which no encoded character exists) are replaced with 'X'.
>>> K = 7 >>> mh = MinHash(n=0, ksize=K, dayhoff=True, scaled=1) >>> protseq1 = ".VKVYAPAS" >>> hashes1 = mh.seq_to_hashes(protseq1, is_protein=True) >>> protseq2 = "XVKVYAPAS" >>> hashes2 = mh.seq_to_hashes(protseq2, is_protein=True) >>> hashes1 == hashes2 True
Extracting both k-mers and hashes for a sequence¶
As of sourmash 4.2.2, MinHash objects provide a method called kmers_and_hashes that will return the k-mers and their corresponding hashes for an input sequence --
>>> mh = MinHash(n=0, ksize=31, scaled=1) >>> dnaseq = "ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGC" >>> for kmer, hashval in mh.kmers_and_hashes(dnaseq): ... print(kmer, hashval) ATGCGAGTGTTGAAGTTCGGCGGTACATCAG 7488148386897425535 TGCGAGTGTTGAAGTTCGGCGGTACATCAGT 3674733966066518639 GCGAGTGTTGAAGTTCGGCGGTACATCAGTG 2135725670290847794 CGAGTGTTGAAGTTCGGCGGTACATCAGTGG 14521729668397845245 GAGTGTTGAAGTTCGGCGGTACATCAGTGGC 15919051675656106963
This works for protein MinHash objects as well, of course, although you have to provide the is_protein flag, since MinHash objects assume input sequence is DNA otherwise --
>>> K = 7 >>> mh = MinHash(n=0, ksize=K, is_protein=True, scaled=1) >>> protseq = "XVKVYAPAS" >>> for (kmer, hashval) in mh.kmers_and_hashes(protseq, is_protein=True): ... print(kmer, hashval) XVKVYAP 3140823561012061964 VKVYAPA 10273850291677879123 KVYAPAS 8846570680426381265
For translated MinHash, the k-mers and hashes corresponding to all six reading frames are returned.
>>> dnaseq = "ATGCGAGTGTTGAAGTTCGGCGGTA" >>> K = 7 >>> mh = MinHash(n=0, ksize=K, is_protein=True, scaled=1) >>> for (kmer, hashval) in mh.kmers_and_hashes(dnaseq): ... print(kmer, hashval) ATGCGAGTGTTGAAGTTCGGC 16652503548557650904 CGAGTGTTGAAGTTCGGCGGT 9978056796243419534 TACCGCCGAACTTCAACACTC 2748622134668949083 CGCCGAACTTCAACACTCGCA 4263227699724621735 TGCGAGTGTTGAAGTTCGGCG 14299765336094039482 GAGTGTTGAAGTTCGGCGGTA 18155608748862746902 ACCGCCGAACTTCAACACTCG 14490181201772650983 GCCGAACTTCAACACTCGCAT 17205086974168937105 GCGAGTGTTGAAGTTCGGCGG 13354527969598897281 CCGCCGAACTTCAACACTCGC 16506504121672505595
In all cases, the k-mers are taken from the sequence itself, so the k-mers will match to the input sequence, even when there are multiple k-mers that hash to the same value (e.g. in the case of reverse complements, or DNA k-mers that are translated to the same amino acid sequence).
Note that sourmash also provides a translate_codon function if you need to get the specific amino acids -
>>> from sourmash.minhash import translate_codon >>> kmer = 'ATGCGAGT' >>> for start in range(0, len(kmer) - 3 + 1, 3): ... codon = kmer[start:start+3] ... print(codon, translate_codon(codon)) ATG M CGA R
Summary¶
In sum,
- MinHash.add_sequence(...) converts DNA sequence into DNA or protein k-mers, and then hashes them and stores them.
- MinHash.add_protein(...) converts protein sequence into protein k-mers, and then hashes them and stores them.
- MinHash.seq_to_hashes(...) will give you the hash values without adding them to the MinHash object.
- MinHash.kmers_and_hashes(...) will provide tuples of (kmer, hashval) for an input sequence.
- The dayhoff and hp encodings can be calculated on amino acid k-mers as well, using MinHash objects.
Note that this is the code that is used by the command-line functionality in sourmash sketch, so the results at the command-line will match the results from the Python API.
Set operations on hashes¶
All of the hashes in a MinHash object are available via the hashes property:
>>> mh1 = sourmash.MinHash(n=0, ksize=3, scaled=1) >>> seq1 = "ATGGCA" >>> mh1.add_sequence(seq1) >>> seq2 = "AGAGCA" >>> mh1.add_sequence(seq2) >>> list(mh1.hashes) [1274996984489324440, 2529443451610975987, 3115010115530738562, 5059920851104263793, 5740495330885152257, 8652222673649005300, 18398176440806921933]
and you can easily do your own set operations with .hashes - e.g. the following calculates the Jaccard similarity (intersection over union) of two
>>> s1 = set(mh1.hashes) >>> s2 = set(mh2.hashes) >>> round(len(s1 & s2) / len(s1 | s2), 3) 0.571
However, the MinHash class also supports a number of basic operations - the following operations work directly on the hashes:
>>> combined = mh1 + mh2 >>> combined += mh1 >>> combined.remove_many(mh1.hashes) >>> combined.add_many(mh2.hashes)
You can create an empty copy of a MinHash object with copy_and_clear:
>>> new_mh = mh1.copy_and_clear()
and you can also access the various parameters of a MinHash object directly as properties --
>>> mh1.ksize 3 >>> mh1.scaled 1 >>> mh1.num 0 >>> mh1.is_dna True >>> mh1.is_protein False >>> mh1.dayhoff False >>> mh1.hp False >>> mh1.moltype 'DNA'
see the "Advanced" section, below, for a more complete discussion of MinHash objects.
Creating MinHash sketches programmatically, from genome files¶
Suppose we want to create MinHash sketches from genomes --
>>> import glob, pprint >>> genomes = glob.glob('data/GCF*.fna.gz') >>> genomes = list(sorted(genomes)) >>> pprint.pprint(genomes) ['data/GCF_000005845.2_ASM584v2_genomic.fna.gz',
'data/GCF_000006945.1_ASM694v1_genomic.fna.gz',
'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz']
We have to read them in (here using screed), but then they can be fed into add_sequence directly; here we set force=True in add_sequence to skip over k-mers containing characters other than ACTG, rather than raising an exception.
(Note, just for speed reasons, we're truncating the sequences to 50kb in length.)
>>> import screed >>> minhashes = [] >>> for g in genomes: ... mh = sourmash.MinHash(n=500, ksize=31) ... for record in screed.open(g): ... mh.add_sequence(record.sequence[:50000], True) ... minhashes.append(mh)
And now the resulting MinHash objects can be compared against each other:
>>> import sys >>> for i, e in enumerate(minhashes): ... _ = sys.stdout.write(genomes[i][:20] + ' ') ... for j, e2 in enumerate(minhashes): ... x = e.jaccard(minhashes[j]) ... _ = sys.stdout.write(str(round(x, 3)) + ' ') ... _= sys.stdout.write('\n') data/GCF_000005845.2 1.0 0.0 0.0 data/GCF_000006945.1 0.0 1.0 0.0 data/GCF_000783305.1 0.0 0.0 1.0
Note that the comparisons are quite quick; most of the time is spent in building the minhashes.
Plotting dendrograms and matrices¶
If you're interested in building comparison matrices and dendrograms, please see the notebook Building plots from sourmash compare output.
Saving and loading signature files¶
Signature files encapsulate MinHashes in JSON, and provide a way to wrap MinHash objects with some metadata (the name and filename). To save signatures, use save_signatures with a list of signatures and a Python file pointer:
>>> from sourmash import SourmashSignature, save_signatures >>> from tempfile import mkdtemp >>> sig1 = SourmashSignature(minhashes[0], name=genomes[0][:20]) >>> tempdir = mkdtemp(suffix = "temp") >>> with open(tempdir + '/genome1.sig', 'wt') as fp: ... save_signatures([sig1], fp)
Here, genome1.sig is a JSON file that can now be loaded and compared -- first, load it using load_one_signature:
>>> from sourmash import load_one_signature >>> loaded_sig = load_one_signature(tempdir + '/genome1.sig')
then compare:
>>> loaded_sig.jaccard(sig1) 1.0 >>> sig1.jaccard(loaded_sig) 1.0
There are two primary signature loading functions - load_one_signature, used above, which loads exactly one signature or else raises an exception; and the powerful and more generic load_file_as_signatures, which takes in a filename or directory containing a collection of signatures and returns the individual signatures -- for example, you can load all of the signatures under the tempdir created above like so,
>>> loaded_sigs = list(sourmash.load_file_as_signatures(tempdir))
Both load_file_as_signatures and load_one_signature take molecule type and k-mer size selectors, e.g.
>>> loaded_sigs = load_one_signature(tempdir + '/genome1.sig', select_moltype='DNA', ksize=31)
will load precisely one signature containing a DNA MinHash created at k-mer size of 31.
Going from signatures back to MinHash objects and their hashes -¶
Once you load a signature, you can go back to its MinHash object with .minhash; e.g.
First, load two signatures:
>>> sigfile1 = 'tests/test-data/genome-s10.fa.gz.sig' >>> sig1 = load_one_signature(sigfile1, ksize=21, select_moltype='DNA') >>> sigfile2 = 'tests/test-data/genome-s11.fa.gz.sig' >>> sig2 = load_one_signature(sigfile2, ksize=21, select_moltype='DNA')
Then, get the hashes, and (e.g.) calculate the union:
>>> hashes1 = set(sig1.minhash.hashes.keys()) >>> hashes2 = set(sig2.minhash.hashes.keys()) >>> hash_union = hashes1.union(hashes2) >>> print(f'{len(hash_union)} hashes in union of {len(hashes1)} and {len(hashes2)}') 1000 hashes in union of 500 and 500
Advanced features of sourmash MinHash objects - scaled and num¶
sourmash supports two basic kinds of signatures, MinHash and modulo hash signatures. MinHash signatures are equivalent to mash signatures; they are limited in size, and very effective for comparing genomes and other data sets that are of similar size. The key parameter for MinHash signatures is num, which specifies the maximum number of hashes to be collected for a given input data set.
>>> signum = sourmash.MinHash(n=500, ksize=31)
Because of this parameter, below we'll call them 'num' signatures.
Modulo hash (or 'scaled') signatures are specific to sourmash and they enable containment operations that are useful for metagenome analyses. The tradeoff is that unlike num MinHashes, they can become arbitrarily large. The key parameter for modulo hash signatures is scaled, which specifies the average sampling rate for hashes for a given input data set. A scaled factor of 1000 means that, on average, 1 in 1000 k-mers will be turned into a hash for later comparisons; this is a sort of compression factor, in that a 5 Mbp genome will yield approximately 5000 hash values with a scaled factor of 1000 (5000 x 1000 = 5,000,000).
>>> sigscaled = sourmash.MinHash(n=0, ksize=31, scaled=1000)
Note also that with a scaled factor of 1, the signature will contain all of the k-mers.
----
You can differentiate between num signatures and scaled signatures by looking at the num and scaled attributes on a MinHash object:
>>> signum.num 500 >>> signum.scaled 0 >>> sigscaled.num 0 >>> sigscaled.scaled 1000
The MinHash class is otherwise identical between the two types of signatures.
You cannot calculate Jaccard similarity or containment for MinHash objects with different num or scaled values (or different ksizes):
>>> signum2 = sourmash.MinHash(n=1000, ksize=31) >>> signum.jaccard(signum2) Traceback (most recent call last):
... TypeError: must have same num: 500 != 1000
However, you can make signatures compatible by downsampling; see the next sections.
A brief introduction to MinHash object methods and attributes¶
MinHash objects have the following methods and attributes:
- ksize, num, and scaled - the basic parameters used to create a MinHash object.
- hashes - retrieve all of the hashes contained in this object.
- add_sequence(seq) - hash sequence and add hash values.
- add(hash) and add_many(hashvals) - add hash values directly.
- similarity(other) - calculate Jaccard similarity with the other MinHash object.
- contained_by(other) - calculate the Jaccard containment of self by other.
- copy_and_clear() - make an empty copy of a MinHash object with the same parameters.
- __len__() - return the number of actual hash values. Note you can also do len(mh), where mh is a MinHash object.
Downsampling signatures¶
Num and scaled signatures can always be downsampled without referring back to the original data.
Let's start by loading 50kb of genomic sequence in to memory:
>>> genomes = glob.glob('data/GCF*.fna.gz') >>> genomes = list(sorted(genomes)) >>> genome = genomes[0] >>> record = next(iter(screed.open(genome))) >>> sequence = record.sequence[:50000]
Now, suppose we make a num signature limited to 1000 hashes:
>>> larger = sourmash.MinHash(n=1000, ksize=31) >>> larger.add_sequence(sequence) >>> len(larger) 1000
We can downsample this to 500 by extracting the hashes and using add_many to add them to a new MinHash like so:
>>> hashvals = larger.hashes.keys() >>> smaller = sourmash.MinHash(n=500, ksize=31) >>> smaller.add_many(hashvals) >>> len(smaller) 500
Also note that there's a convenience function that does the same thing, faster!
>>> smaller2 = larger.downsample(num=500) >>> smaller2 == smaller True
The same can be done with scaled MinHashes:
>>> large_scaled = sourmash.MinHash(n=0, ksize=31, scaled=100) >>> large_scaled.add_sequence(sequence) >>> len(large_scaled) 459 >>> small_scaled = sourmash.MinHash(n=0, ksize=31, scaled=500) >>> small_scaled.add_many(large_scaled.hashes.keys()) >>> len(small_scaled) 69
And, again, there's a convenience function that you can use:
>>> small_scaled2 = large_scaled.downsample(scaled=500) >>> small_scaled == small_scaled2 True
Converting between num and scaled signatures¶
(Beware, these are confusing techniques for working with hashes that are easy to get wrong! We suggest posting questions in the issue tracker as you go, if you are interested in exploring this area!)
The hashing function used is identical between num and scaled signatures, so the hash values themselves are compatible - it's the comparison between collections of them that doesn't work.
But, in some circumstances, num signatures can be extracted from scaled signatures, and vice versa. We haven't yet implemented a Python API for this in sourmash, but you can hack it together yourself quite easily, and a conversion utility is implemented through the command line in sourmash signature downsample.
To extract a num MinHash object from a scaled MinHash, first create or load your MinHash, and then extract the hash values:
>>> num_mh = sourmash.MinHash(n=1000, ksize=31) >>> num_mh.add_sequence(sequence) >>> hashvals = num_mh.hashes.keys()
Now, create the new scaled MinHash object and add the hashes to it:
>>> scaled_mh = sourmash.MinHash(n=0, ksize=31, scaled=10000) >>> scaled_mh.add_many(hashvals)
and you are done!
The same works in reverse, of course:
>>> scaled_mh = sourmash.MinHash(n=0, ksize=31, scaled=50) >>> scaled_mh.add_sequence(sequence) >>> hashvals = scaled_mh.hashes.keys() >>> num_mh = sourmash.MinHash(n=500, ksize=31) >>> num_mh.add_many(hashvals)
So... when can you do this extraction reliably?
You can extract num MinHashes from scaled MinHashes whenever the maximum hash value in the num MinHash is greater than or equal to the max_hash attribute of the scaled MinHash.
You can extract scaled MinHashes to num MinHashes whenever there are more hash values in the scaled MinHash than num.
Yoda sayeth: When understand these two sentences you can, use this code may you.
(You can also take a look at the logic in sourmash signature downsample if you are interested.)
Working with indexed collections of signatures¶
If you want to search large collections of signatures, sourmash provides two different indexing strategies, together with a generic Index class that supports a common API for searching the collections.
The first indexing strategy is a Sequence Bloom Tree, which is designed to support fast and efficient containment operations on large collections of signatures. SBTs are an on disk search structure, so they are a low-memory way to search collections.
To use SBTs from the command line, we first need to create some scaled signatures:
sourmash sketch dna -p scaled=10000 data/GCF*.fna.gz --outdir data/
and then build a Sequence Bloom Tree (SBT) index with sourmash index, like so:
sourmash index foo.sbt.zip data/GCF*.sig -k 31
Here, sourmash is storing the entire SBT in a single portable Zip file.
Creating an on-disk SBT in Python¶
Let's start by using 'glob' to grab some example signatures from the test data in the sourmash repository:
>>> import glob >>> input_filenames = glob.glob('tests/test-data/doctest-data/GCF*.sig')
Now, create an SBT:
>>> import sourmash.sbtmh >>> tree = sourmash.sbtmh.create_sbt_index()
Load each signature, and add it to the tree:
>>> from sourmash.sbtmh import SigLeaf >>> for filename in input_filenames: ... sig = sourmash.load_one_signature(filename, ksize=31) ... leaf = SigLeaf(sig.md5sum(), sig) ... tree.add_node(leaf)
(note, you'll need to make sure that all of the signatures are compatible with each other! The sourmash index command does all of the necessary checks, but the Python API doesn't.)
Now, save the tree:
>>> filename = tree.save(tempdir + '/test.sbt.zip')
Loading and searching SBTs¶
How do we load the SBT and search it with a DNA sequence, from within Python?
The SBT filename is test.sbt.zip, as above:
>>> SBT_filename = tempdir + '/test.sbt.zip'
and with it we can load the SBT:
>>> tree = sourmash.load_file_as_index(SBT_filename)
Now, load a DNA sequence:
>>> filename = 'data/GCF_000005845.2_ASM584v2_genomic.fna.gz' >>> query_seq = next(iter(screed.open(filename))).sequence >>> print(f'got {len(query_seq)} DNA characters to query') got 4641652 DNA characters to query
and create a scaled signature:
>>> minhash = sourmash.MinHash(ksize=31, n=0, scaled=10000) >>> minhash.add_sequence(query_seq) >>> query_sig = sourmash.SourmashSignature(minhash, name='my favorite query')
Now do a search --
>>> for similarity, found_sig, filename in tree.search(query_sig, threshold=0.1): ... print(query_sig) ... print(found_sig) ... print(similarity) my favorite query NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome 1.0
et voila!
In-memory databases: the LCA or "reverse index" database.¶
The LCA database lets you work with large collections of signatures in memory.
The LCA database was initially designed to support individual hash queries for taxonomic operations - hence its name, which stands for "Lowest Common Ancestor." However, it supports all of the standard Index operations, just like the SBT.
First, let's create an LCA database programmatically.
>>> from sourmash.lca import LCA_Database >>> db = LCA_Database(ksize=31, scaled=10000, moltype='DNA')
Now, let's load in all of the signatures from the test directory:
>>> for sig in sourmash.load_file_as_signatures('tests/test-data/doctest-data', ksize=31): ... hashes_inserted = db.insert(sig) ... print(f"Inserted {hashes_inserted} hashes into db.") Inserted 493 hashes into db. Inserted 490 hashes into db. Inserted 525 hashes into db.
and now you have an Index class that supports all the generic index operations (below). You can save an LCA Database to disk with db.save(filename), and load it with sourmash.load_file_as_index, below.
The Index class API.¶
The Index class supports a generic API for SBTs, LCAs, and other collections of signatures.
To load an SBT or an LCA database from a file, use sourmash.load_file_as_index:
>>> sbt_db = sourmash.load_file_as_index('tests/test-data/prot/protein.sbt.zip') >>> lca_db = sourmash.load_file_as_index('tests/test-data/prot/protein.lca.json.gz')
Index objects provide search, insert, load, save, and __len__. The signatures can be accessed directly via the .signatures() method, which returns an iterable. Last but not least, Index.select(ksize=..., moltype=...) will return a view on the Index object that contains only signatures with the desired k-mer size/molecule type.
AUTHOR¶
C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward
COPYRIGHT¶
2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward
January 20, 2025 | 4.8 |