.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.40.9.
.TH FASTX_BARCODE_SPLITTER.PL "1" "May 2012" "fastx_barcode_splitter.pl 0.0.13.2" "User Commands"
.SH NAME
fastx_barcode_splitter.pl \- FASTX Barcode Splitter
.SH DESCRIPTION
Barcode Splitter, by Assaf Gordon (gordon@cshl.edu), 11sep2008
.PP
This program reads FASTA/FASTQ file and splits it into several smaller files,
Based on barcode matching.
FASTA/FASTQ data is read from STDIN (format is auto\-detected.)
Output files will be writen to disk.
Summary will be printed to STDOUT.
.PP
usage: r.pl \fB\-\-bcfile\fR FILE \fB\-\-prefix\fR PREFIX [\-\-suffix SUFFIX] [\-\-bol|\-\-eol]
.IP
[\-\-mismatches N] [\-\-exact] [\-\-partial N] [\-\-help] [\-\-quiet] [\-\-debug]
.PP
Arguments:
.PP
\fB\-\-bcfile\fR FILE   \- Barcodes file name. (see explanation below.)
\fB\-\-prefix\fR PREFIX \- File prefix. will be added to the output files. Can be used
.IP
to specify output directories.
.PP
\fB\-\-suffix\fR SUFFIX \- File suffix (optional). Can be used to specify file
.IP
extensions.
.PP
\fB\-\-bol\fR           \- Try to match barcodes at the BEGINNING of sequences.
.IP
(What biologists would call the 5' end, and programmers
would call index 0.)
.PP
\fB\-\-eol\fR           \- Try to match barcodes at the END of sequences.
.IP
(What biologists would call the 3' end, and programmers
would call the end of the string.)
NOTE: one of \fB\-\-bol\fR, \fB\-\-eol\fR must be specified, but not both.
.PP
\fB\-\-mismatches\fR N  \- Max. number of mismatches allowed. default is 1.
\fB\-\-exact\fR         \- Same as '\-\-mismatches 0'. If both \fB\-\-exact\fR and \fB\-\-mismatches\fR
.IP
are specified, '\-\-exact' takes precedence.
.PP
\fB\-\-partial\fR N     \- Allow partial overlap of barcodes. (see explanation below.)
.IP
(Default is not partial matching)
.PP
\fB\-\-quiet\fR         \- Don't print counts and summary at the end of the run.
.IP
(Default is to print.)
.PP
\fB\-\-debug\fR         \- Print lots of useless debug information to STDERR.
\fB\-\-help\fR          \- This helpful help screen.
.PP
Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is
the barcodes file):
.IP
\f(CW$ cat s_2_100.txt | /build/fastx-toolkit-FIOi2t/fastx-toolkit-0.0.13.2/debian/fastx-toolkit/usr/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 \e\fR
.HP
\fB\-\-prefix\fR /tmp/bla_ \fB\-\-suffix\fR ".txt"
.PP
Barcode file format
\fB\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\fR
Barcode files are simple text files. Each line should contain an identifier
(descriptive name for the barcode), and the barcode itself (A/C/G/T),
separated by a TAB character. Example:
.IP
#This line is a comment (starts with a 'number' sign)
BC1 GATCT
BC2 ATCGT
BC3 GTGAT
BC4 TGTCT
.PP
For each barcode, a new FASTQ file will be created (with the barcode's
identifier as part of the file name). Sequences matching the barcode
will be stored in the appropriate file.
.PP
Running the above example (assuming "mybarcodes.txt" contains the above
barcodes), will create the following files:
.IP
/tmp/bla_BC1.txt
/tmp/bla_BC2.txt
/tmp/bla_BC3.txt
/tmp/bla_BC4.txt
/tmp/bla_unmatched.txt
.PP
The 'unmatched' file will contain all sequences that didn't match any barcode.
.PP
Barcode matching
\fB\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\fR
.PP
** Without partial matching:
.PP
Count mismatches between the FASTA/Q sequences and the barcodes.
The barcode which matched with the lowest mismatches count (providing the
count is small or equal to '\-\-mismatches N') 'gets' the sequences.
.PP
Example (using the above barcodes):
Input Sequence:
.IP
GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
.SS "Matching with '--bol --mismatches 1':"
.IP
GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
GATCT (1 mismatch, BC1)
ATCGT (4 mismatches, BC2)
GTGAT (3 mismatches, BC3)
TGTCT (3 mismatches, BC4)
.PP
This sequence will be classified as 'BC1' (it has the lowest mismatch count).
If '\-\-exact' or '\-\-mismatches 0' were specified, this sequence would be
classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches).
.PP
Matching with '\-\-eol' (end of line) does the same, but from the other side
of the sequence.
.PP
** With partial matching (very similar to indels):
.PP
Same as above, with the following addition: barcodes are also checked for
partial overlap (number of allowed non\-overlapping bases is '\-\-partial N').
.PP
Example:
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)
.SS "Matching (without partial overlapping) against BC1 yields 4 mismatches:"
.IP
ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
GATCT (4 mismatches)
.SS "Partial overlapping would also try the following match:"
.HP
\fB\-ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG\fR
.IP
GATCT (1 mismatch)
.PP
Note: scoring counts a missing base as a mismatch, so the final
mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with '\-\-mismatches 2' (meaning allowing upto 2 mismatches) \- this
seqeunce will be classified as BC1.
.SH SEE ALSO
The quality of this automatically generated manpage might be
insufficient.  It is suggested to visit
.IP
http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
.P
to get a better layout as well as an overview about connected
FASTX tools.