NAME¶
vcftools - analyse VCF files
SYNOPSIS¶
vcftools [OPTIONS]
DESCRIPTION¶
The vcftools program is run from the command line. The interface is inspired by
PLINK, and so should be largely familiar to users of that package. Commands
take the following form:
vcftools --vcf file1.vcf --chr 20 --freq
The above command tells vcftools to read in the file file1.vcf, extract sites on
chromosome 20, and calculate the allele frequency at each site. The resulting
allele frequency estimates are stored in the output file, out.freq. As in the
above example, output from vcftools is mainly sent to output files, as opposed
to being shown on the screen.
Note that some commands may only be available in the latest version of vcftools.
To obtain the latest version, you should use SVN to checkout the latest code,
as described on the home page.
Also note that polyploid genotypes are not currently supported.
Basic Options¶
- --vcf <filename>
- This option defines the VCF file to be processed. The files
need to be decompressed prior to use with vcftools. vcftools expects files
in VCF format v4.0, a specification of which can be found here.
- --gzvcf <filename>
- This option can be used in place of the --vcf option to
read compressed (gzipped) VCF files directly. Note that this option can be
quite slow when used with large files.
- --out <prefix>
- This option defines the output filename prefix for all
files generated by vcftools. For example, if <prefix> is set to
output_filename, then all output files will be of the form
output_filename.*** . If this option is omitted, all output files will
have the prefix 'out.'.
Site Filter Options¶
- --chr <chromosom>
- Only process sites with a chromosome identifier matching
<chromosome>
- --from-bp <integer>
- --to-bp <integer>
- These options define the physical range of sites will be
processed. Sites outside of this range will be excluded. These options can
only be used in conjunction with --chr.
- --snp <string>
- Include SNP(s) with matching ID. This command can be used
multiple times in order to include more than one SNP.
- --snps <filename>
- Include a list of SNPs given in a file. The file should
contain a list of SNP IDs, with one ID per line.
- --exclude <filename>
- Exclude a list of SNPs given in a file. The file should
contain a list of SNP IDs, with one ID per line.
- --positions <filename>
- Include a set of sites on the basis of a list of positions.
Each line of the input file should contain a (tab-separated) chromosome
and position. The file should have a header line. Sites not included in
the list are excluded.
- --bed <filename>
- --exclude-bed <filename>
- Include or exclude a set of sites on the basis of a BED
file. Only the first three columns (chrom, chromStart and chromEnd) are
required. The BED file should have a header line.
- --remove-filtered-all
- --remove-filtered <sting>
- --keep-filtered <sting>
- These options are used to filter sites on the basis of
their FILTER flag. The first option removes all sites with a FILTER flag.
The second option can be used to exclude sites with a specific filter
flag. The third option can be used to select sites on the basis of
specific filter flags. The second and third options can be used multiple
times to specify multiple FILTERs. The --keep-filtered option is applied
before the --remove-filtered option.
- --minQ <float>
- Include only sites with Quality above this threshold.
- --min-meanDP <float>
- --max-meanDP <float>
- Include sites with mean Depth within the thresholds defined
by these options.
- --maf <float>
- --max-maf <float>
- Include only sites with Minor Allele Frequency within the
specified range.
- --non-ref-af <float>
- --max-non-ref-af <float>
- Include only sites with Non-Reference Allele Frequency
within the specified range.
- --hue <float>
- Assesses sites for Hardy-Weinberg Equilibrium using an
exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites
with a p-value below the threshold defined by this option are taken to be
out of HWE, and therefore excluded.
- --geno <float>
- Exclude sites on the basis of the proportion of missing
data (defined to be between 0 and 1).
- --min-alleles <int>
- --max-alleles <int>
- Include only sites with a number of alleles within the
specified range. For example, to include only bi-allelic sites, one could
use:
vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2
- --mask <filename>
- --invert-mask <filename>
- --mask-min <filename>
- Include sites on the basis of a FASTA-like file. The
provided file contains a sequence of integer digits (between 0 and 9) for
each position on a chromosome that specify if a site at that position
should be filtered or not. An example mask file would look like:
>1
0000011111222...
In this example, sites in the VCF file located within the first 5 bases of
the start of chromosome 1 would be kept, whereas sites at position 6
onwards would be filtered out. The threshold integer that determines if
sites are filtered or not is set using the --mask-min option, which
defaults to 0. The chromosomes contained in the mask file must be sorted
in the same order as the VCF file. The --mask option is used to specify
the mask file to be used, whereas the --invert-mask option can be used to
specify a mask file that will be inverted before being applied.
Individual Filters¶
- --indv <string>
- Specify an individual to be kept in the analysis. This
option can be used multiple times to specify multiple individuals.
- --keep <filename>
- Provide a file containing a list of individuals to include
in subsequent a nalysis. Each individual ID (as defined in the VCF
headerline) should be included on a separate line.
- --remove-indv <string>
- Specify an individual to be removed from the analysis. This
option can be used multiple times to specify multiple individuals. If the
--indv option is also specified, then the --indv option is executed before
the --remove-indv option.
- --remove <filename>
- Provide a file containing a list of individuals to exclude
in subsequent analysis. Each individual ID (as defined in the VCF
headerline) should be included on a separate line. If both the --keep and
the --remove options are used, then the --keep option is execute before
the --remove option.
- --mon-indv-meanDP <float>
- --max-indv-meanDP <float>
- Calculate the mean coverage on a per-individual basis. Only
individuals with coverage within the range specified by these options are
included in subsequent analyses.
- --mind <float>
- Specify the minimum call rate threshold for each
individual.
- --phased
- First excludes all individuals having all genotypes
unphased, and subsequently excludes all sites with unphased genotypes. The
remaining data therefore consists of phased data only.
Genotype Filters¶
- --remove-filtered-geno-all
- --remove-filtered-geno <string>
- The first option removes all genotypes with a FILTER flag.
The second option can be used to exclude genotypes with a specific filter
flag.
- --minGQ <float>
- Exclude all genotypes with a quality below the threshold
specified by this option (GQ).
- --minDP <float>
- Exclude all genotypes with a sequencing depth below that
specified by this option (DP)
Output Statistics¶
- --freq
- --counts
- --freq2
- --counts2
- Output per-site frequency information. The --freq outputs
the allele frequency in a file with the suffix '.frq'. The --counts option
outputs a similar file with the suffix '.frq.count', that contains the raw
allele counts at each site. The --freq2 and --count2 options are used to
suppress allele information in the output file. In this case, the order of
the freqs/counts depends on the numbering in the VCF file.
- --depth
- Generates a file containing the mean depth per individual.
This file has the suffix '.idepth'.
- --site-depth
- --site-mean-depth
- Generates a file containing the depth per site. The
--site-depth option outputs the depth for each site summed across
individuals. This file has the suffix '.ldepth'. Likewise, the
--site-mean-depth outputs the mean depth for each site, and the output
file has the suffix '.ldepth.mean'.
- --geno-depth
- Generates a (possibly very large) file containing the depth
for each genotype in the VCF file. Missing entries are given the value -1.
The file has the suffix '.gdepth'.
- --site-quality
- Generates a file containing the per-site SNP quality, as
found in the QUAL column of the VCF file. This file has the suffix
'.lqual'.
- --het
- Calculates a measure of heterozygosity on a per-individual
basis. Specfically, the inbreeding coefficient, F, is estimated for each
individual using a method of moments. The resulting file has the suffix
'.het'.
- --hardy
- Reports a p-value for each site from a Hardy-Weinberg
Equilibrium test (as defined by Wigginton, Cutler and Abecasis (2005)).
The resulting file (with suffix '.hwe') also contains the Observed numbers
of Homozygotes and Heterozygotes and the corresponding Expected numbers
under HWE.
- --missing
- Generates two files reporting the missingness on a
per-individual and per-site basis. The two files have suffixes '.imiss'
and '.lmiss' respectively.
- --hap-r2
- --geno-r2
- --ld-window <int>
- --ld-window-bp <int>
- --min-r2 <float>
- These options are used to report Linkage Disequilibrium
(LD) statistics as summarised by the r2 statistic. The --hap-r2 option
informs vcftools to output a file reporting the r2 statistic using phased
haplotypes. This is the traditional measure of LD often reported in the
population genetics literature. If phased haplotypes are unavailable then
the --geno-r2 option may be used, which calculates the squared correlation
coefficient between genotypes encoded as 0, 1 and 2 to represent the
number of non-reference alleles in each individual. This is the same as
the LD measure reported by PLINK. The haplotype version outputs a file
with the suffix '.hap.ld', whereas the genotype version outputs a file
with the suffix '.geno.ld'. The haplotype version implies the option
--phased.
The --ld-window option defines the maximum SNP separation for the
calculation of LD. Likewise, the --ld-window-bp option can be used to
define the maximum physical separation of SNPs included in the LD
calculation. Finally, the --min-r2 sets a minimum value for r2 below which
the LD statistic is not reported.
- --SNPdnsity <int>
- Calculates the number and density of SNPs in bins of size
defined by this option. The resulting output file has the suffix
'.snpden'.
- --TsTv <int>
- Calculates the Transition / Transversion ratio in bins of
size defined by this option. The resulting output file has the suffix
'.TsTv'. A summary is also supplied in a file with the suffix
'.TsTv.summary'.
- --FILTER-summary
- Generates a summary of the number of SNPs and Ts/Tv ratio
for each FILTER category. The output file has the suffix
'.FILTER.summary.
- --filtered-sites
- Creates two files listing sites that have been kept or
removed after filtering. The first file, with suffix '.kept.sites', lists
sites kept by vcftools after filters have been applied. The second file,
with the suffix '.removed.sites', list sites removed by the applied
filters.
- --singletons
- This option will generate a file detailing the location of
singletons, and the individual they occur in. The file reports both true
singletons, and private doubletons (i.e. SNPs where the minor allele only
occurs in a single individual and that individual is homozygotic for that
allele). The output file has the suffix '.singletons'.
- --site-pi
- --window-pi <int>
- These options are used to estimate levels of nucleotide
diversity. The first option does this on a per-site basis, and the output
file has the suffix '.sites.pi'. The second option calculates the
nucleotide diversity in windows, with the window size defined in the
option argument. Output for this option has the suffix '.windowed.pi'. The
windowed version requires phased data, and hence use of this option
implies the --phased option.
- --O12
- This option outputs the genotypes as a large matrix. Three
files are produced. The first, with suffix '.012', contains the genotypes
of each individual on a separate line. Genotypes are represented as 0, 1
and 2, where the number represent that number of non-reference alleles.
Missing genotypes are represented by -1. The second file, with suffix
'.012.indv' details the individuals included in the main file. The third
file, with suffix '.012.pos' details the site locations included in the
main file.
- --IMPUTE
- This option outputs phased haplotypes in IMPUTE
reference-panel format. As IMPUTE requires phased data, using this option
also implies --phased. Unphased individuals and genotypes are therefore
excluded. Only bi-allelic sites are included in the output. Using this
option generates three files. The IMPUTE haplotype file has the suffix
'.impute.hap', and the IMPUTE legend file has the suffix
'.impute.hap.legend'. The third file, with suffix '.impute.hap.indv',
details the individuals included in the haplotype file, although this file
is not needed by IMPUTE.
- --ldhat
- --ldhat-geno
- These options output data in LDhat format. Use of these
options also require the --chr option to by used. The --ldhat option
outputs phased data only, and therefore also implies --phased, leading to
unphased individuals and genotypes being excluded. Alternatively, the
--ldhat-geno option treats all of the data as unphased, and therefore
outputs LDhat files in genotype/unphased format. In either case, two files
are generated with the suffixes '.ldhat.sites' and '.ldhat.locs', which
correspond to the LDhat 'sites' and 'locs' input files respectively.
- --BEAGLE-GL
- This option outputs genotype likelihood information for
input into the BEAGLE program. This option requires the VCF file to
contain the FORMAT GL tag, which can generally be output by SNP callers
such as the GATK. Use of this option requires a chromosome to be specified
via the --chr option. The resulting output file (with the suffix
'.BEAGLE.GL') contains genotype likelihoods for biallelic sites, and is
suitable for input into BEAGLE via the 'like=' argument.
- --plink
- This option outputs the genotype data in PLINK PED format.
Two files are generated, with suffixes '.ped' and '.map'. Note that only
bi-allelic loci will be output. Further details of these files can be
found in the PLINK documentation.
Note: This option can be very slow on large datasets. Using the --chr option
to divide up the dataset is advised.
- --plink-tped
- The --plink option above can be extremely slow on large
datasets. An alternative that might be considerably quicker is to output
in the PLINK transposed format. This can be achieved using the
--plink-tped option, which produces two files with suffixes '.tped' and
'.tfam'.
- --recode
- The --recode option is used to generate a VCF file from the
input VCF file having applied the options specified by the user. The
output file has the suffix '.recode.vcf'.
By default, the INFO fields are removed from the output file, as the INFO
values may be invalidated by the recoding (e.g. the total depth may need
to be recalculated if individuals are removed). This default functionality
can be overridden by using the --keep-INFO <string> option, where
<string> defines the INFO key to keep in the output file. The
--keep-INFO flag can be used multiple times. Alternatively, the option
--keep-INFO-all can be used to retain all INFO fields.
Miscellaneous¶
- --extract-FORMAT-info <string>
- Extract information from the genotype fields in the VCF
file relating to a specfied FORMAT identifier. For example, using the
option '--extract-FORMAT-info GT' would extract the all of the GT (i.e.
Genotype) entries. The resulting output file has the suffix
'.<FORMAT_ID>.FORMAT'.
- --get-INFO <string>
- This option is used to extract information from the INFO
field in the VCF file. The <string> argument specifies the INFO tag
to be extracted, and the option can be used multiple times in order to
extract multiple INFO entries. The resulting file, with suffix '.INFO',
contains the required INFO information in a tab-separated table. For
example, to extract the NS and DB flags, one would use the command:
vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB
VCF File Comparison Options¶
The file comparison options are currently in a state of flux and likely buggy.
If you find a bug, please report it. Note that genotype-level filters are not
supported in these options.
- --diff <filename>
- --gzdiff <filename>
- Select a VCF file for comparison with the file specified by
the --vcf option. Outputs two files describing the sites and individuals
common / unique to each file. These files have the suffixes
'.diff.sites_in_files' and '.diff.indv_in_files' respectively. The
--gzdiff version can be used to read compressed VCF files.
- --diff-site-discordance
- Used in conjunction with the --diff option to calculate
discordance on a site by site basis. The resulting output file has the
suffix '.diff.sites'.
- --diff-indv-discordance
- Used in conjunction with the --diff option to calculate
discordance on a per-individual basis. The resulting output file has the
suffix '.diff.indv'.
- --diff-discordance-matrix
- Used in conjunction with the --diff option to calculate a
discordance matrix. This option only works with bi-allelic loci with
matching alleles that are present in both files. The resulting output file
has the suffix '.diff.discordance.matrix'.
- --diff-switch-error
- Used in conjunction with the --diff option to calculate
phasing errors (specifically 'switch errors'). This option generates two
output files describing switch errors found between sites, and the average
switch error per individual. These two files have the suffixes
'.diff.switch' and '.diff.indv.switch' respectively.
Options still in development¶
The following options are yet to be finalised, are likely to contain bugs, and
are likely to change in the future.
- --fst <filename>
- --gzfst <filename>
- Calculate FST for a pair of VCF files, with the second file
being specified by this option. FST is currently calculated using the
formula described in the supplementary material of the Phase I HapMap
paper. Currently, only pairwise FST calculations are supported, although
this will likely change in the future. The --gzfst option can be used to
read compressed VCF files.
- --LROH
- Identify Long Runs of Homozygosity.
- --relatedness
- Output Individual Relatedness Statistics.