table of contents
PSI-CD-HIT-2D-G1.PL(1) | User Commands | PSI-CD-HIT-2D-G1.PL(1) |
NAME¶
psi-cd-hit-2d-g1.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 formatDESCRIPTION¶
Usage psi-cd-hit-2d [Options] Options- -i
- in_dbname, required
- -o
- out_dbname, required
- -c
- clustering threshold (sequence identity), default 0.3
-ce
clustering threshold (blast expect), default -1,
- it means by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either identity threshold or expect threshold
- -L
- coverage of shorter sequence ( aligned / full), default 0.0
- -M
- coverage of longer sequence ( aligned / full), default 0.0
- -R
- (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search
- -G
- (1/0) use global identity? default 1 sequence identity calculated as
- total identical residues of local alignments / length of shorter seq
- if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches.
- -d
- length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space
- -l
- length_of_throw_away_sequences, default 10
- -p
- profile search para, default
- "-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500"
-bfdb
profile database, default nr80
- -s
- blast search para, default
- "-F F -e 0.000001 -b 100000 -v 100000"
-be
blast expect cutoff, default 0.000001
- -b
- filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts
-pbs
No of jobs to send each time by PBS querying system
- you can not use both ssh and pbs at same time
-k
(1/0) keep blast raw output file, default 1
-rs
steps of save restart file and clustering output, default 5000
- everytime after process 5000 sequences, program write a restart file and current clustering information
-restart
restart file, readin a restart file
- if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart"
-rf
steps of re format blast database, default 200,000
- if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time
-local
dir of local blast db,
- when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY NOT FASTER
- -J
- job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself
-single
files of ids those you known that they are singletons
- so I won't run them as queries
-i2
second input database
-blastn
run blastn, default 0
-lo
how long can seq in db2 > db1 in a cluster, default 0
- means, that seq in db2 should <= seqs in db1 in a cluster
- ============================== by Weizhong Li, liwz@sdsc.edu ==============================
- If you find cd-hit useful, please kindly cite:
- "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659
April 2012 | psi-cd-hit-2d-g1.pl 4.6-2012-04-25 |