NAME¶

psi-cd-hit-2d-g1.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format

DESCRIPTION¶

Usage psi-cd-hit-2d [Options]

Options

-ce clustering threshold (blast expect), default -1,

: it means by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either identity threshold or expect threshold

-R: (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search

: if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches.

-d: length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space

-bfdb profile database, default nr80

-be blast expect cutoff, default 0.000001

-b: filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts

-pbs No of jobs to send each time by PBS querying system

-k (1/0) keep blast raw output file, default 1

-rs steps of save restart file and clustering output, default 5000

: everytime after process 5000 sequences, program write a restart file and current clustering information

-restart restart file, readin a restart file

: if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart"

-rf steps of re format blast database, default 200,000

: if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time

-local dir of local blast db,

: when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY NOT FASTER

-J: job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself

-single files of ids those you known that they are singletons

-i2 second input database

-blastn run blastn, default 0

-lo how long can seq in db2 > db1 in a cluster, default 0

: ============================== by Weizhong Li, liwz@sdsc.edu ==============================

: "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659

April 2012

psi-cd-hit-2d-g1.pl 4.6-2012-04-25