- stretch 1.3.1-1+b1
- testing 1.9-1
- stretch-backports 1.8-1~bpo9+1
- unstable 1.9-1
BCFTOOLS(1) | BCFTOOLS(1) |
NAME¶
bcftools - utilities for variant calling and manipulating VCFs and BCFs.SYNOPSIS¶
bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]DESCRIPTION¶
BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most, but not all situations. In general, whenever multiple VCFs are read simultaneously, they must be indexed and therefore also compressed.
BCFtools is designed to work on a stream. It regards an input file "-" as the standard input (stdin) and outputs to the standard output (stdout). Several commands can thus be combined with Unix pipes.
VERSION¶
This manual page was last updated 2018-04-03 and refers to bcftools git version 1.8.BCF1¶
The BCF1 format output by versions of samtools <= 0.1.19 is not compatible with this version of bcftools. To read BCF1 files one can use the view command from old versions of bcftools packaged with samtools versions <= 0.1.19 to convert to VCF, which can then be read by this version of bcftools.samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
VARIANT CALLING¶
See bcftools call for variant calling from the output of the samtools mpileup command. In versions of samtools <= 0.1.19 calling was done with bcftools view. Users are now required to choose between the old samtools calling model (-c/--consensus-caller) and the new multiallelic calling model (-m/--multiallelic-caller). The multiallelic calling model is recommended for most tasks.LIST OF COMMANDS¶
For a full list of available commands, run bcftools without arguments. For a full list of available options, run bcftools COMMAND without arguments.LIST OF SCRIPTS¶
Some helper scripts are bundled with the bcftools code.COMMANDS AND OPTIONS¶
Common Options¶
The following options are common to many bcftools commands. See usage for specific commands to see if they apply.FILE
-c, --collapse snps|indels|both|all|some|none|id
none
some
all
snps
indels
both
id
-f, --apply-filters LIST
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
-s, --samples [^]LIST
bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
-S, --samples-file FILE
sample1 1 sample2 2 sample3 2 or sample1 M sample2 F sample3 F or a .ped file (here is shown a minimum working example, the first column is ignored and the last indicates sex: 1=male, 2=female) ignored daughterA fatherA motherA 2 ignored sonB fatherB motherB 1
-t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file [^]FILE
bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
--threads INT
bcftools annotate [OPTIONS] FILE¶
Add or remove annotations.-a, --annotations file
# Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG 1 752566 SomeString 5 1 798959 SomeOtherString 6 # etc.
--collapse snps|indels|both|all|some|none
-c, --columns list
If the annotation file is a VCF/BCF, only the edited columns/tags must be present and their order does not matter. The columns ID, QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
To carry over all INFO annotations, use "INFO". To add all INFO annotations except "TAG", use "^INFO/TAG". By default, existing values are replaced.
To add annotations without overwriting existing values (that is, to add missing tags or add values to existing tags with missing values), use "+TAG" instead of "TAG". To append to existing values (rather than replacing or leaving untouched), use "=TAG" (instead of "TAG" or "+TAG"). To replace only existing values without modifying missing annotations, use "-TAG".
If the annotation file is not a VCF/BCF, all new annotations must be defined via -h, --header-lines.
-e, --exclude EXPRESSION
-h, --header-lines file
##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line"> ##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
-I, --set-id [+]FORMAT
bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
-i, --include EXPRESSION
-k, --keep-sites
-m, --mark-sites TAG
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--rename-chrs file
-s, --samples [^]LIST
-S, --samples-file FILE
--threads INT
-x, --remove list
Examples:
# Remove three fields bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz # Remove all INFO fields and all FORMAT fields except for GT and PL bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf # Add ID, QUAL and INFO/TAG, not replacing TAG if already present bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf # Carry over all INFO and FORMAT annotations except FORMAT/GT bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf # Annotate from a tab-delimited file with six columns (the fifth is ignored), # first indexing with tabix. The coordinates are 1-based. tabix -s1 -b2 -e2 annots.tab.gz bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive) tabix -s1 -b2 -e3 annots.tab.gz bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals) bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
bcftools call [OPTIONS] FILE¶
This command replaces the former bcftools view caller. Some of the original functionality has been temporarily lost in the process of transition under htslib, but will be added back on popular demand. The original calling model can be invoked with the -c option.File format options:
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
--ploidy ASSEMBLY[?]
--ploidy-file FILE
X 1 60000 M 1 X 2699521 154931043 M 1 Y 1 59373566 M 1 Y 1 59373566 F 0 MT 1 16569 M 1 MT 1 16569 F 1 * * * M 2 * * * F 2
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
--threads INT
Input/output options:
-A, --keep-alts
-f, --format-fields list
-F, --prior-freqs AN,AC
# Extract AN,AC values from an existing VCF, such 1000Genomes bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz # If the tags AN,AC are not already present, use the +fill-AN-AC plugin bcftools +fill-AN-AC 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz tabix -s1 -b2 -e2 AFs.tab.gz # Create a VCF header description, here we name the tags REF_AN,REF_AC cat AFs.hdr ##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes"> ##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele"> # Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
-g, --gvcf INT
-i, --insert-missed INT
-M, --keep-masked-ref
-V, --skip-variants snps|indels
-v, --variants-only
Consensus/variant calling options:
-c, --consensus-caller
-C, --constrain alleles|trio
alleles
trio
-m, --multiallelic-caller
-n, --novel-rate float[,...]
-p, --pval-threshold float
-P, --prior float
-t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
-X, --chromosome-X
-Y, --chromosome-Y
bcftools cnv [OPTIONS] FILE¶
Copy number variation caller, requires a VCF annotated with the Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR) values. The HMM considers the following copy number states: CN 2 (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy gain).General Options:
-c, --control-sample string
-f, --AF-file file
-o, --output-dir path
-p, --plot-threshold float
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --query-sample string
-t, --targets LIST
-T, --targets-file FILE
HMM Options:
-a, --aberrant float[,float]
-b, --BAF-weight float
-d, --BAF-dev float[,float]
-e, --err-prob float
-l, --LRR-weight float
-L, --LRR-smooth-win int
-O, --optimize float
-P, --same-prob float
-x, --xy-prob float
bcftools concat [OPTIONS] FILE1 FILE2 [...]¶
Concatenate or combine VCF/BCF files. All source files must have the same sample columns appearing in the same order. Can be used, for example, to concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel VCF into one. The input files must be sorted by chr and position. The files must be given in the correct order to produce sorted VCF on output unless the -a, --allow-overlaps option is specified. With the --naive option, the files are concatenated without being recompressed, which is very fast but dangerous if the BCF headers differ.-a, --allow-overlaps
-c, --compact-PS
-d, --rm-dups snps|indels|both|all|none
-D, --remove-duplicates
-f, --file-list FILE
-l, --ligate
--no-version
-n, --naive
-o, --output FILE
-O, --output-type b|u|z|v
-q, --min-PQ INT
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
--threads INT
bcftools consensus [OPTIONS] FILE¶
Create consensus sequence by applying VCF variants to a reference fasta file. By default, the program will apply all ALT variants to the reference fasta to obtain the consensus sequence. Using the --sample (and, optionally, --haplotype) option will apply genotype (haplotype) calls from FORMAT/GT. Note that the program does not act as a primitive variant caller and ignores allelic depth information, such as INFO/AD or FORMAT/AD. For that, consider using the setGT plugin.-c, --chain FILE
-e, --exclude EXPRESSION
-f, --fasta-ref FILE
-H, --haplotype 1|2|R|A|LR|LA|SR|SA
1
2
R
A
LR, LA
SR, SA
This option requires *-s*, unless exactly one sample is present in the VCF
-i, --include EXPRESSION
-I, --iupac-codes
-m, --mask FILE
-o, --output FILE
-s, --sample NAME
Examples:
# Apply variants present in sample "NA001", output IUPAC codes for hets bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa # Create consensus for one region. The fasta header lines are then expected # in the form ">chr:from-to". samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz -o out.fa
bcftools convert [OPTIONS] FILE¶
VCF input options:-e, --exclude EXPRESSION
-i, --include EXPRESSION
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
VCF output options:
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
--threads INT
GEN/SAMPLE conversion:
-G, --gensample2vcf prefix or gen-file,sample-file
-g, --gensample prefix or gen-file,sample-file
.gen ---- 1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0 1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1 .sample ------- ID_1 ID_2 missing 0 0 0 sample1 sample1 0 sample2 sample2 0
--tag STRING
--chrom
--sex FILE
MaleSample M FemaleSample F
--vcf-ids
gVCF conversion:
--gvcf2vcf
-f, --fasta-ref file
HAP/SAMPLE conversion:
--hapsample2vcf prefix or hap-file,sample-file
.hap ---- 1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1:111494194_C_T rsID2 111494194 C T 0 1 0 0 1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
--hapsample prefix or hap-file,sample-file
--haploid2diploid
--sex FILE
MaleSample M FemaleSample F
--vcf-ids
HAP/LEGEND/SAMPLE conversion:
-H, --haplegendsample2vcf prefix or hap-file,legend-file,sample-file
-h, --haplegendsample prefix or hap-file,legend-file,sample-file
.hap ----- 0 1 0 0 1 0 0 1 0 0 0 1 .legend ------- id position a0 a1 1:111485207_G_A 111485207 G A 1:111494194_C_T 111494194 C T .sample ------- sample population group sex sample1 sample1 sample1 2 sample2 sample2 sample2 2
--haploid2diploid
--sex FILE
MaleSample M FemaleSample F
--vcf-ids
TSV conversion:
--tsv2vcf file
-c, --columns list
-f, --fasta-ref file
-s, --samples LIST
-S, --samples-file FILE
Example:
# Convert 23andme results into VCF bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz
bcftools csq [OPTIONS] FILE¶
Haplotype aware consequence predictor which correctly handles combined variants such as MNPs split over multiple VCF records, SNPs separated by an intron (but adjacent in the spliced transcript) or nearby frame-shifting indels which in combination in fact are not frame-shifting.The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag (configurable with the -c option). The latter is a bitmask of indexes to INFO/BCSQ, with interleaved haplotypes. See the usage examples below for using the %TBCSQ converter in query for extracting a more human readable form from this bitmask. The contruction of the bitmask limits the number of consequences that can be referenced in the FORMAT/BCSQ tags. By default this is 16, but if more are required, see the --ncsq option.
The program requires on input a VCF/BCF file, the reference genome in fasta format (--fasta-ref) and genomic features in the GFF3 format downloadable from the Ensembl website (--gff-annot), and outputs an annotated VCF/BCF file. Currently, only Ensembl GFF3 files are supported.
By default, the input VCF should be phased. If phase is unknown, or only partially known, the --phase option can be used to indicate how to handle unphased data. Alternatively, haplotype aware calling can be turned off with the --local-csq option.
If conflicting (overlapping) variants within one haplotype are detected, a warning will be emitted and predictions will be based on only the first variant in the analysis.
Symbolic alleles are not supported. They will remain unannotated in the output VCF and are ignored for the prediction analysis.
-c, --custom-tag STRING
-e, --exclude EXPRESSION
-f, --fasta-ref FILE
-g, --gff-annot FILE
# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines, # looks up their parent transcript (determined from the "Parent=transcript:" attribute), # the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype # (the most interesting is "protein_coding"). # # Attributes required for # gene lines: # - ID=gene:<gene_id> # - biotype=<biotype> # - Name=<gene_name> [optional] # # transcript lines: # - ID=transcript:<transcript_id> # - Parent=gene:<gene_id> # - biotype=<biotype> # # other lines (CDS, exon, five_prime_UTR, three_prime_UTR): # - Parent=transcript:<transcript_id> # # Supported biotypes: # - see the function gff_parse_biotype() in bcftools/csq.c 1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName 1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding 1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId 1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId 1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId 1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
-i, --include EXPRESSION
-l, --local-csq
-n, --ncsq INT
-o, --output FILE
-O, --output-type b|t|u|z|v
-p, --phase a|m|r|R|s
a
m
r
R
s
-q, --quiet
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
Examples:
# Basic usage bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf # Extract the translated haplotype consequences. The following TBCSQ variations # are recognised: # %TBCSQ .. print consequences in all haplotypes in separate columns # %TBCSQ{0} .. print the first haplotype only # %TBCSQ{1} .. print the second haplotype only # %TBCSQ{*} .. print a list of unique consquences present in either haplotype bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
Examples of BCSQ annotation:
# Two separate VCF records at positions 2:122106101 and 2:122106102 # change the same codon. This UV-induced C>T dinucleotide mutation # has been annotated fully at the position 2:122106101 with # - consequence type # - gene name # - ensembl transcript ID # - coding strand (+ fwd, - rev) # - amino acid position (in the coding strand orientation) # - list of corresponding VCF variants # The annotation at the second position gives the position of the full # annotation BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A BCSQ=@122106101 # A frame-restoring combination of two frameshift insertions C>CG and T>TGG BCSQ=@46115084 BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG # Stop gained variant BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T # The consequence type of a variant downstream from a stop are prefixed with * BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
bcftools filter [OPTIONS] FILE¶
Apply fixed-threshold filters.-e, --exclude EXPRESSION
-g, --SnpGap INT
The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not: 0123456789 ref .G.GT..G.. del .A.G-..A.. Here the positions 1 and 6 are filtered, 0 and 7 are not: 0123-456789 ref .G.G-..G.. ins .A.GT..A..
-G, --IndelGap INT
The second indel is filtered: 012345678901 ref .GT.GT..GT.. del .G-.G-..G-.. And similarly here, the second is filtered: 01 23 456 78 ref .A-.A-..A-.. ins .AT.AT..AT..
-i, --include EXPRESSION
-m, --mode [+x]
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --soft-filter STRING|+
-S, --set-GTs .|0
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--threads INT
bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz¶
Checks sample identity. The program can operate in two modes. If the -g option is given, the identity of the -s sample from query.vcf.gz is checked against the samples in the -g file. Without the -g option, multi-sample cross-check of samples in query.vcf.gz is performed.-a, --all-sites
-c, --cluster FLOAT,FLOAT
The first "min" argument controls the typical error rate in multiplexed runs ("lanelets") from the same sample. Lanelets with error rate less than this will always be considered as coming from the same sample. The second "max" argument is the reverse: lanelets with error rate greater than the absolute value of this parameter will always be considered as different samples. When the value is negative, the cutoff may be heuristically lowered by the clustering engine. If positive, the value is interpreted as a fixed cutoff.
-g, --genotypes genotypes.vcf.gz
-G, --GTs-only INT
-H, --homs-only
-p, --plot PREFIX
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --query-sample STRING
-S, --target-sample STRING
-t, --targets file
-T, --targets-file file
Output files format:
CN, Discordance
\sum_s { min_G { PL_a(G) + PL_b(G) } },
ERR, error rate
CLUSTER, TH, DOT
bcftools index [OPTIONS] in.bcf|in.vcf.gz¶
Creates index for bgzip compressed VCF/BCF files for random access. CSI (coordinate-sorted index) is created by default. The CSI format supports indexing of chromosomes up to length 2^31. TBI (tabix index) index files, which support chromosome lengths up to 2^29, can be created by using the -t/--tbi option or using the tabix program packaged with htslib. When loading an index file, bcftools will try the CSI first and then the TBI.Indexing options:
-c, --csi
-f, --force
-m, --min-shift INT
-o, --output-file FILE
-t, --tbi
--threads INT
Stats options:
-n, --nrecords
-s, --stats
bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]¶
Creates intersections, unions and complements of VCF files. Depending on the options, the program can output records from one (or more) files which have (or do not have) corresponding records with the same position in the other files.-c, --collapse snps|indels|both|all|some|none
-C, --complement
-e, --exclude -|EXPRESSION
-f, --apply-filters LIST
-i, --include EXPRESSION
-n, --nfiles [+-=]INT|~BITMAP
-o, --output FILE
-O, --output-type b|u|z|v
-p, --prefix DIR
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
-w, --write LIST
Examples:
Create intersection and complements of two sets saving the output in dir/*
bcftools isec -p dir A.vcf.gz B.vcf.gz
Filter sites in A and B (but not in C) and create intersection
bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -p dir
Extract and write records from A shared by both A and B using exact allele match
bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
Extract records private to A or B comparing by position only
bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
Print a list of records which are present in A and B but not in C and D
bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]¶
Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. For example, when merging file A.vcf.gz containing samples S1, S2 and S3 and file B.vcf.gz containing samples S3 and S4, the output file will contain four samples named S1, S2, S3, 2:S3 and S4.Note that it is responsibility of the user to ensure that the sample names are unique across all files. If they are not, the program will exit with an error unless the option --force-samples is given. The sample names can be also given explicitly using the --print-header and --use-header options.
Note that only records from different files can be merged, never from the same file. For "vertical" merge take a look at bcftools norm instead.
--force-samples
--print-header
--use-header FILE
-0 --missing-to-ref
-f, --apply-filters LIST
-F, --filter-logic x|+
-g, --gvcf -|FILE
-i, --info-rules -|TAG:METHOD[,...]
-l, --file-list FILE
-m, --merge snps|indels|both|all|none|id
-m none .. no new multiallelics, output multiple records instead -m snps .. allow multiallelic SNP records -m indels .. allow multiallelic indel records -m both .. both SNP and indel records can be multiallelic -m all .. SNP records can be merged with indel records -m id .. merge by ID
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--threads INT
bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]¶
Generate VCF or BCF containing genotype likelihoods for one or multiple alignment (BAM or CRAM) files. This is based on the original samtools mpileup command (with the -v or -g options) producing genotype likelihoods in VCF or BCF format, but not the textual pileup output. The mpileup command was transferred to bcftools in order to avoid errors resulting from use of incompatible versions of samtools and bcftools when using in the mpileup+bcftools call pipeline.Individuals are identified from the SM tags in the @RG header lines. Multiple individuals can be pooled in one alignment file, also one individual can be separated into multiple files. If sample identifiers are absent, each input file is regarded as one sample.
Note that there are two orthogonal ways to specify locations in the input file; via -r region and -t positions. The former uses (and requires) an index to do random access while the latter streams through the file contents filtering out the specified regions, requiring no index. The two may be used in conjunction. For example a BED file containing locations of genes in chromosome 20 could be specified using -r 20 -t chr20.bed, meaning that the index is used to find chromosome 20 and then it is filtered for the regions listed in the BED file. Also note that the -r option can be much slower than -t with many regions and can require more memory when multiple regions and many alignment files are processed.
Input options
-6, --illumina1.3+
-A, --count-orphans
-b, --bam-list FILE
-B, --no-BAQ
-C, --adjust-MQ INT
-d, --max-depth INT
-E, --redo-BAQ
-f, --fasta-ref FILE
--no-reference
-G, --read-groups FILE
RG_ID_1 RG_ID_2 SAMPLE_A RG_ID_3 SAMPLE_A RG_ID_4 SAMPLE_B RG_ID_5 FILE_1.bam SAMPLE_A RG_ID_6 FILE_2.bam SAMPLE_A * FILE_3.bam SAMPLE_C ? FILE_3.bam SAMPLE_D
-q, -min-MQ INT
-Q, --min-BQ INT
-r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
-R, --regions-file FILE
--ignore-RG
--rf, --incl-flags STR|INT
--ff, --excl-flags STR|INT
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
-x, --ignore-overlaps
Output options
-a, --annotate LIST
*FORMAT/AD* .. Allelic depth (Number=R,Type=Integer) *FORMAT/ADF* .. Allelic depths on the forward strand (Number=R,Type=Integer) *FORMAT/ADR* .. Allelic depths on the reverse strand (Number=R,Type=Integer) *FORMAT/DP* .. Number of high-quality bases (Number=1,Type=Integer) *FORMAT/SP* .. Phred-scaled strand bias P-value (Number=1,Type=Integer)
*INFO/AD* .. Total allelic depth (Number=R,Type=Integer) *INFO/ADF* .. Total allelic depths on the forward strand (Number=R,Type=Integer) *INFO/ADR* .. Total allelic depths on the reverse strand (Number=R,Type=Integer)
*FORMAT/DV* .. Deprecated in favor of FORMAT/AD; Number of high-quality non-reference bases, (Number=1,Type=Integer) *FORMAT/DP4* .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR; Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases (Number=4,Type=Integer) *FORMAT/DPR* .. Deprecated in favor of FORMAT/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer) *INFO/DPR* .. Deprecated in favor of INFO/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
-g, --gvcf INT[,...]
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
--threads INT
Options for SNP/INDEL genotype likelihood computation
-e, --ext-prob INT
-F, --gap-frac FLOAT
-h, --tandem-qual INT
-I, --skip-indels
-L, --max-idepth INT
-m, --min-ireads INT
-o, --open-prob INT
-p, --per-sample-mF
-P, --platforms STR
Examples:
Call SNPs and short INDELs, then mark low quality sites and sites with the read depth exceeding a limit. (The read depth should be adjusted to about twice the average read depth as higher read depths usually indicate problematic regions which are often enriched for artefacts.) One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps for BWA-backtrack alignments, but may not other aligners.
bcftools mpileup -Ou -f ref.fa aln.bam | \ bcftools call -Ou -mv | \ bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
bcftools norm [OPTIONS] file.vcf.gz¶
Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows. Left-alignment and normalization will only be applied if the --fasta-ref option is supplied.-c, --check-ref e|w|x|s
-d, --rm-dup snps|indels|both|all|none
-D, --remove-duplicates
-f, --fasta-ref FILE
-m, --multiallelics -|+[snps|indels|both|any]
--no-version
-N, --do-not-normalize
-o, --output FILE
-O, --output-type b|u|z|v
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --strict-filter
-t, --targets LIST
-T, --targets-file FILE
--threads INT
-w, --site-win INT
bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]¶
A common framework for various utilities. The plugins can be used the same way as normal commands only their name is prefixed with "+". Most plugins accept two types of parameters: general options shared by all plugins followed by a separator, and a list of plugin-specific options. There are some exceptions to this rule, some plugins do not accept the common options and implement their own parameters. Therefore please pay attention to the usage examples that each plugin comes with.VCF input options:
-e, --exclude EXPRESSION
-i, --include EXPRESSION
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
VCF output options:
--no-version
-o, --output FILE
-O, --output-type b|u|z|v
--threads INT
Plugin options:
-h, --help
-l, --list-plugins
By default, appropriate system directories are searched for installed plugins. You can override this by setting the BCFTOOLS_PLUGINS environment variable to a colon-separated list of directories to search. If BCFTOOLS_PLUGINS begins with a colon, ends with a colon, or contains adjacent colons, the system directories are also searched at that position in the list of directories.
-v, --verbose
-V, --version
List of plugins coming with the distribution:
GTisec
GTsubset
ad-bias
af-dist
check-ploidy
check-sparsity
color-chrs
counts
dosage
fill-AN-AC
fill-from-fasta
fill-tags
fix-ploidy
fixref
frameshifts
guess-ploidy
impute-info
isecGT
mendelian
missing2ref
prune
setGT
tag2tag
trio-switch-rate
Examples:
# List options common to all plugins bcftools plugin # List available plugins bcftools plugin -l # Run a plugin bcftools plugin counts in.vcf # Run a plugin using the abbreviated "+" notation bcftools +counts in.vcf # The input VCF can be streamed just like in other commands cat in.vcf | bcftools +counts # Print usage information of plugin "dosage" bcftools +dosage -h # Replace missing genotypes with 0/0 bcftools +missing2ref in.vcf # Replace missing genotypes with 0|0 bcftools +missing2ref in.vcf -- -p
Plugins troubleshooting:
Things to check if your plugin does not show up in the bcftools plugin -l output:
Plugins API:
// Short description used by 'bcftools plugin -l' const char *about(void); // Longer description used by 'bcftools +name -h' const char *usage(void); // Called once at startup, allows initialization of local variables. // Return 1 to suppress normal VCF/BCF header output, -1 on critical // errors, 0 otherwise. int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr); // Called for each VCF record, return NULL to suppress the output bcf1_t *process(bcf1_t *rec); // Called after all lines have been processed to clean up void destroy(void);
bcftools polysomy [OPTIONS] file.vcf.gz¶
Detect number of chromosomal copies in VCFs annotates with the Illumina’s B-allele frequency (BAF) values. Note that this command is not compiled in by default, see the section Optional Compilation with GSL in the INSTALL file for help.General options:
-o, --output-dir path
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --sample string
-t, --targets LIST
-T, --targets-file FILE
-v, --verbose
Algorithm options:
-b, --peak-size float
-c, --cn-penalty float
-f, --fit-th float
-i, --include-aa
-m, --min-fraction float
-p, --peak-symmetry float
bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]¶
Extracts fields from VCF or BCF files and outputs them in user-defined format.-e, --exclude EXPRESSION
-f, --format FORMAT
-H, --print-header
-i, --include EXPRESSION
-l, --list-samples
-o, --output FILE
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
-u, --allow-undef-tags
-v, --vcf-list FILE
Format:
%CHROM The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER) %INFO/TAG Any tag in the INFO column %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER) %MASK Indicates presence of the site in other files (with multiple files) %TAG{INT} Curly brackets to subscript vectors (0-based) %FIRST_ALT Alias for %ALT{0} [] Format fields must be enclosed in brackets to loop over all samples %GT Genotype (e.g. 0/1) %TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples. %TGT Translated genotype (e.g. C/A) %IUPACGT Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A) %LINE Prints the whole line %SAMPLE Sample name %POS0 POS in 0-based coordinates %END End position of the REF allele %END0 End position of the REF allele in 0-based cordinates \n new line \t tab character
Everything else is printed verbatim.
Examples:
# Print chromosome, position, ref allele and the first alternate allele bcftools query -f '%CHROM %POS %REF %ALT{0}\n' file.vcf.gz
# Similar to above, but use tabs instead of spaces, add sample name and genotype bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
# Print FORMAT/GT fields followed by FORMAT/GT fields bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
# Make a BED file: chr, pos (0-based), end pos (1-based), id bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
# Print only samples with alternate (non-reference) genotypes bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
# Print all samples at sites with at least one alternate genotype bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
bcftools reheader [OPTIONS] file.vcf.gz¶
Modify header of VCF/BCF files, change sample names.-h, --header FILE
-o, --output FILE
-s, --samples FILE
bcftools roh [OPTIONS] file.vcf.gz¶
A program for detecting runs of homo/autozygosity. Only bi-allelic sites are considered.The HMM model:
Notation: D = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity), f = non-ref allele frequency Emission probabilities: oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA) oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA) Transition probabilities: tAZ = P(AZ|HW) .. from HW to AZ, the -a parameter tHW = P(HW|AZ) .. from AZ to HW, the -H parameter ci = P_i(C) .. probability of cross-over at site i, from genetic map AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1 HWi = P_i(HW) P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})] P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
General Options:
--AF-dflt FLOAT
--AF-tag TAG
--AF-file FILE
bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
-b, --buffer-size INT[,INT]
-e, --estimate-AF FILE
--exclude EXPRESSION
-G, --GTs-only FLOAT
--include EXPRESSION
-I, --skip-indels
-m, --genetic-map FILE
-M, --rec-rate FLOAT
-o, --output FILE
-O, --output-type s|r[z]
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
HMM Options:
-a, --hw-to-az FLOAT
-H, --az-to-hw FLOAT
-V, --viterbi-training FLOAT
bcftools sort [OPTIONS] file.bcf¶
-m, --max-mem FLOAT[kMG]-o, --output FILE
-O, --output-type b|u|z|v
-T, --temp-dir DIR
bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]¶
Parses VCF or BCF and produces text file stats which is suitable for machine processing and can be plotted using plot-vcfstats. When two files are given, the program generates separate stats for intersection and the complements. By default only sites are compared, -s/-S must given to include also sample columns. When one VCF file is specified on the command line, then stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc. are printed. When two VCF files are given, then stats such as concordance (Genotype concordance by non-reference allele frequency, Genotype concordance by sample, Non-Reference Discordance) and correlation are also printed. Per-site discordance (PSD) is also printed in --verbose mode.--af-bins LIST|FILE
--af-tag TAG
-1, --1st-allele-only
-c, --collapse snps|indels|both|all|some|none
-d, --depth INT,INT,INT
--debug
-e, --exclude EXPRESSION
-E, --exons file.gz
tabix -s1 -b2 -e3 file.gz
-f, --apply-filters LIST
-F, --fasta-ref ref.fa
-i, --include EXPRESSION
-I, --split-by-ID
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
-u, --user-tstv <TAG[:min:max:n]>
-v, --verbose
bcftools view [OPTIONS] file.vcf.gz [REGION [...]]¶
View, subset and filter VCF or BCF files by position and filtering expression. Convert between VCF and BCF. Former bcftools subset.Output options
-G, --drop-genotypes
-h, --header-only
-H, --no-header
-l, --compression-level [0-9]
--no-version
-O, --output-type b|u|z|v
-o, --output-file FILE: output file name. If not present, the default is to print to standard output (stdout).
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--threads INT
Subset options:
-a, --trim-alt-alleles
--force-samples
-I, --no-update
-s, --samples LIST
-S, --samples-file FILE
Filter options:
Note that filter options below dealing with counting the number of alleles will, for speed, first check for the values of AC and AN in the INFO column to avoid parsing all the genotype (FORMAT/GT) fields in the VCF. This means that a filter like --min-af 0.1 will be based ‘AC/AN’ where AC and AN come from either INFO/AC and INFO/AN if available or FORMAT/GT if not. It will not filter on another field like INFO/AF. The --include and --exclude filter expressions should instead be used to explicitly filter based on fields in the INFO column, e.g. --exclude AF<0.1.
-c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
-C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
-e, --exclude EXPRESSION
-f, --apply-filters LIST
-g, --genotype [^][hom|het|miss]
-i, --include EXPRESSION
-k, --known
-m, --min-alleles INT
-M, --max-alleles INT
-n, --novel
-p, --phased
-P, --exclude-phased
-q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
-Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
-u, --uncalled
-U, --exclude-uncalled
-v, --types snps|indels|mnps|other
-V, --exclude-types snps|indels|mnps|ref|bnd|other
-x, --private
-X, --exclude-private
bcftools help [COMMAND] | bcftools --help [COMMAND]¶
Display a brief usage message listing the bcftools commands available. If the name of a command is also given, e.g., bcftools help view, the detailed usage message for that particular command is displayed.bcftools [--version|-v]¶
Display the version numbers and copyright information for bcftools and the important libraries used by bcftools.bcftools [--version-only]¶
Display the full bcftools version number in a machine-readable format.EXPRESSIONS¶
These filtering expressions are accepted by most of the commands.Valid expressions may contain:
1, 1.0, 1e-4 "String" @file_name
+,*,-,/
== (same as =), >, >=, <=, <, !=
INFO/HAYSTACK ~ "needle" INFO/HAYSTACK ~ "NEEDless/i"
(, )
&& (same as &), ||, |
INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, POS, REF, ALT[0]
FlagA=1 && FlagB=0
DP=".", DP!=".", ALT="."
GT~"\.", GT!~"\."
GT=".|.", GT="./.", GT="."
GT="ref" GT="alt" GT="mis" GT="hom" GT="het" GT="hap" GT="RR" GT="AA" GT="RA" or GT="AR" GT="Aa" or GT="aA" GT="R" GT="A"
TYPE="snp" TYPE~"snp" TYPE!="snp" TYPE!~"snp"
INFO/AF[0] > 0.3 .. first AF value bigger than 0.3 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30 FORMAT/AD[0:1] .. first sample, second AD value FORMAT/AD[1:0] .. second sample, first AD value DP4[*] == 0 .. any DP4 value FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30 FORMAT/DP[1-3] > 10 .. samples 2-4 FORMAT/DP[1-] < 7 .. all samples but the first FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5 FORMAT/AD[0:1] .. first sample, second AD field FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3 CSQ[*] ~ "missense_variant.*deleterious"
MAX, MIN, AVG, SUM, STRLEN, ABS, COUNT
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING
perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
Notes:
-i 'TAG[*]=1' .. true, the record will be printed -i 'TAG[*]!=1' .. true -e 'TAG[*]=1' .. false, the record will be discarded -e 'TAG[*]!=1' .. false -i 'TAG[0]=1' .. true -i 'TAG[0]!=1' .. false -e 'TAG[0]=1' .. false -e 'TAG[0]!=1' .. true
Examples:
MIN(DV)>5
MIN(DV/DP)>0.3
MIN(DP)>10 & MIN(DV)>3
FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample
FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
COUNT(GT="hom")=0
MIN(DP)>35 && AVG(GQ)>50
ID=@file .. selects lines with ID present in the file
ID!=@~/file .. skip lines with ID present in the ~/file
MAF[0]<0.05 .. select rare variants at 5% cutoff
POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
Shell expansion:
Note that expressions must often be quoted because some characters have special meaning in the shell. An example of expression enclosed in single quotes which cause that the whole expression is passed to the program as intended:
bcftools view -i '%ID!="." & MAF[0]<0.01'
Please refer to the documentation of your shell for details.
SCRIPTS AND OPTIONS¶
plot-vcfstats [OPTIONS] file.vchk [...]¶
Script for processing output of bcftools stats. It can merge results from multiple outputs (useful when running the stats for each chromosome separately), plots graphs and creates a PDF presentation.-m, --merge
-p, --prefix DIR
-P, --no-PDF
-r, --rasterize
-s, --sample-names
-t, --title STRING
-T, --main-title STRING
PERFORMANCE¶
HTSlib was designed with BCF format in mind. When parsing VCF files, all records are internally converted into BCF representation. Simple operations, like removing a single column from a VCF file, can be therefore done much faster with standard UNIX commands, such as awk or cut. Therefore it is recommended to use BCF as input/output format whenever possible to avoid large overhead of the VCF → BCF → VCF conversion.BUGS¶
Please report any bugs you encounter on the github website: http://github.com/samtools/bcftoolsAUTHORS¶
Heng Li from the Sanger Institute wrote the original C version of htslib, samtools and bcftools. Bob Handsaker from the Broad Institute implemented the BGZF library. Petr Danecek, Shane McCarthy and John Marshall are maintaining and further developing bcftools. Many other people contributed to the program and to the file format specifications, both directly and indirectly by providing patches, testing and reporting bugs. We thank them all.RESOURCES¶
BCFtools GitHub website: http://github.com/samtools/bcftoolsSamtools GitHub website: http://github.com/samtools/samtools
HTSlib GitHub website: http://github.com/samtools/htslib
File format specifications: http://samtools.github.io/hts-specs
BCFtools documentation: http://samtools.github.io/bcftools
BCFtools wiki page: https://github.com/samtools/bcftools/wiki
COPYING¶
The MIT/Expat License or GPL License, see the LICENSE document for details. Copyright (c) Genome Research Ltd.2018-04-03 |