vcftools(1) | vcftools man page | vcftools(1) |
NAME¶
vcftools - Utilities for the variant call format (VCF) and binary variant call format (BCF)
SYNOPSIS¶
vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS ]
DESCRIPTION¶
vcftools is a suite of functions for use on genetic variation data in the form of VCF and BCF files. The tools provided will be used mainly to summarize data, run calculations on data, filter out data, and convert data into other useful file formats.
EXAMPLES¶
Output allele frequency for all sites in the input vcf file from chromosome 1
Output a new vcf file from the input vcf file that removes any indel sites
Output file comparing the sites in two vcf files
Output a new vcf file to standard out without any sites that have a filter tag, then compress it with gzip
Output a Hardy-Weinberg p-value for every site in the bcf file that does not have any missing genotypes
Output nucleotide diversity at a list of positions
BASIC OPTIONS¶
These options are used to specify the input and output files.
INPUT FILE OPTIONS¶
--gzvcf <input_filename>
--bcf <input_filename>
OUTPUT FILE OPTIONS¶
--stdout
-c
--temp <temporary_directory>
SITE FILTERING OPTIONS¶
These options are used to include or exclude certain sites from any analysis being performed by the program.
POSITION FILTERING¶
--not-chr <chromosome>
--from-bp <integer>
--to-bp <integer>
--positions <filename>
--exclude-positions <filename>
--positions-overlap <filename>
--exclude-positions-overlap <filename>
--bed <filename>
--exclude-bed <filename>
--thin <integer>
--mask <filename>
--invert-mask <filename>
--mask-min <integer>
An example mask file would look like:
0000011111222...
>2
2222211111000...
The "--invert-mask" option takes the same format mask file as the "--mask" option, however it inverts the mask file before filtering with it.
And the "--mask-min" option specifies a threshold mask value between 0 and 9 to filter positions by. The default threshold is 0, meaning only sites with that value or lower will be kept.
SITE ID FILTERING¶
--snps <filename>
--exclude <filename>
VARIANT TYPE FILTERING¶
--remove-indels
FILTER FLAG FILTERING¶
--keep-filtered <string>
--remove-filtered <string>
INFO FIELD FILTERING¶
--remove-INFO <string>
ALLELE FILTERING¶
--max-maf <float>
--non-ref-af <float>
--max-non-ref-af <float>
--non-ref-ac <integer>
--max-non-ref-ac <integer>
--non-ref-af-any <float>
--max-non-ref-af-any <float>
--non-ref-ac-any <integer>
--max-non-ref-ac-any <integer>
--mac <integer>
--max-mac <integer>
--min-alleles <integer>
--max-alleles <integer>
For example, to include only bi-allelic sites, one could use:
GENOTYPE VALUE FILTERING¶
--max-meanDP <float>
--hwe <float>
--max-missing <float>
--max-missing-count <integer>
--phased
MISCELLANEOUS FILTERING¶
INDIVIDUAL FILTERING OPTIONS¶
These options are used to include or exclude certain individuals
from any analysis being performed by the program.
--remove-indv <string>
--keep <filename>
--remove <filename>
--max-indv <integer>
GENOTYPE FILTERING OPTIONS¶
These options are used to exclude genotypes from any analysis
being performed by the program. If excluded, these values will be treated as
missing.
--remove-filtered-geno <string>
--minGQ <float>
--minDP <float>
--maxDP <float>
OUTPUT OPTIONS¶
These options specify which analyses or conversions to perform on the data that passed through all specified filters.
OUTPUT ALLELE STATISTICS¶
--freq2
--counts
--counts2
--derived
OUTPUT DEPTH STATISTICS¶
--site-depth
--site-mean-depth
--geno-depth
OUTPUT LD STATISTICS¶
--geno-r2
--geno-chisq
--hap-r2-positions <positions list file>
--geno-r2-positions <positions list file>
--ld-window <integer>
--ld-window-bp <integer>
--ld-window-min <integer>
--ld-window-bp-min <integer>
--min-r2 <float>
--interchrom-hap-r2
--interchrom-geno-r2
OUTPUT TRANSITION/TRANSVERSION STATISTICS¶
--TsTv-summary
--TsTv-by-count
--TsTv-by-qual
--FILTER-summary
OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS¶
--window-pi <integer>
--window-pi-step <integer>
OUTPUT FST STATISTICS¶
--fst-window-size <integer>
--fst-window-step <integer>
OUTPUT OTHER STATISTICS¶
--hardy
--TajimaD <integer>
--indv-freq-burden
--LROH
--relatedness
--relatedness2
--site-quality
--missing-indv
--missing-site
--SNPdensity <integer>
--kept-sites
--removed-sites
--singletons
--hist-indel-len
--hapcount <BED file>
--mendel <PED file>
--extract-FORMAT-info <string>
--get-INFO <string>
OUTPUT VCF FORMAT¶
--recode-bcf
--recode-INFO <string>
--recode-INFO-all
--contigs <string>
OUTPUT OTHER FORMATS¶
--IMPUTE
--ldhat
--ldhelmet
--ldhat-geno
--BEAGLE-GL
--BEAGLE-PL
--plink
--plink-tped
--chrom-map
Note: The first option can be very slow on large datasets. Using the --chr option to divide up the dataset is advised, or alternatively use the --plink-tped option which outputs the files in the PLINK transposed format with suffixes ".tped" and ".tfam".
For usage with variant sites in species other than humans, the --chrom-map option may be used to specify a file name that has a tab-delimited mapping of chromosome name to a desired integer value with one line per chromosome. This file must contain a mapping for every chromosome value found in the file.
COMPARISON OPTIONS¶
These options are used to compare the original variant file to another variant file and output the results. All of the diff functions require both files to contain the same chromosomes and that the files be sorted in the same order. If one of the files contains chromosomes that the other file does not, use the --not-chr filter to remove them from the analysis.
DIFF VCF FILE¶
--gzdiff <filename>
--diff-bcf <filename>
DIFF OPTIONS¶
--diff-indv
--diff-site-discordance
--diff-indv-discordance
--diff-indv-map <filename>
--diff-discordance-matrix
--diff-switch-error
AUTHORS¶
Adam Auton
Anthony Marcketta
2 August 2018 | 0.1.16 |