Sophie: EMBOSS-2.6.0-1mdk ppc

EMBOSS-2.6.0-1mdk.ppc.rpm

Version 2.6.0

	Added twofeat - Finds neighbouring pairs of features in sequences

	Extractfeat - added option (-featinname) to include the name of
	the feature as part of the ID name of the sequence that is
	written out. 

	Added sirna - designs siRNA probes in mRNA

	Sigcleave sorts results highest score first

	Helixturnhelix sorts results highest score first and reports the
	score position as an integer

	Added pestfind

	Moved the following programs into the "domainatrix" embassy
	package:
	 contacts, domainer, fraggle, hetparse, hmmgen, interface,
	 pdbparse, pdbtosp, profgen, scopalign, scopnr, scopparse,
	 scoprep, scopreso, scopseqs, seqalign, seqnr, seqsearch,
	 seqsort, seqwords, siggen, sigplot, sigscan
	
	Palindrome no longer reports palindromes that are only composed
	of N's

	Msbar can now check that the result doesn't match a set of
	input other sequences.  For example you could specify that it
	doesn't match the input sequence or a set of previously produced
	mutation results. 

        Getorf reporting of circular genome positions tidied up - it now
	reports positions starting in the range 1 to the sequence length
	and indicates if the ORF goes through the breakpoint.  A clear
	indication of when ORFs are in the reverse sense has been added.

	Pasteseq now behaves correctly when -sask2, -sbegin2 or -send2
	are used. 

Version 2.5.1

	Whichdb new option -showall to see which databases are being
	searched for use where searches hang. The order of searching is
	undefined - it depends on the order in which databases are
	returned from the internal table, which is unrelated to the order
	in which they were defined.

	Wordmatch alignments save the entire sequence but use part only.
	Fixed all alignment formats to work with these by adding a
	SubOffset attribute.
	
	Duplicate IDs fix. The database indexing programs skipped
	duplicate IDs but did not reset the size of the entryname index
	file so some queries could fail to find the later IDs in the
	databases. Duplicate IDs are illegal for -nosystemsort (no easy
	way to correct because entry numbers are stored internally). For
	the default case duplicate IDs are merged even if they are
	different. REFSEQ is the main problem area.
	
	Writing data files used EMBOSS_DATA, or by default the install
	directory. Earlier versions, if not installed, could write to the
	source tree emboss/data directory. Fixed to continue if there is
	no install data directory, and to check EMBOSS_DATA (if defined) is
	a real directory.
	
	sigcleave options pval and nval hardcoded. They depend on the
	weight matrix size - which is hardcoded as 15 in the ACD file and
	is not checked in the program. They were introduced in EGCG in
	1988 but never used because no other weight matrix length was
	tried.
Version 2.5.0
	"fasta" format now uses the "ncbi" parser, so both formats report
	"fasta" as the format. "pearson" is the old "fasta" format for a few
	cases (empty IDs for example) there ncbi parsing fails completely.

	SPLITTER changed to match documentation. Old behaviour is
	now selectable by using the -addoverlap command line
	option.

	Configuration modifications. --without-x works. Removed odd
	but harmless -I definitions. PNG detection improved.

	Corrected EMBLCD index searching for queries that start with a
	wildcard. For example, tembl-key:?* should search for all entries
	that have a keyword (key:* is regarded as 'all entries'). Entries
	with no keyword (in PIR's pir4.ref file for example) will be
	ignored.
	
	Updated source code docs for EFUNC and EDATA. Corrected all bad
	headers. efunc.out has no errors. efunc.check only reports
	'missing headers' for duplicated function names (#ifdef code)
	which is a known 'feature'.

	Updated source code to fix most lines over 80 bytes.
	
	Calculated ACD attributes now QA tested. Feature attributes will
	be correctly set, although none are used in the ACD files at present.
	
	purify.pl has a new option -block=n where n is a number from 1
	upwards.  1 runs the first 10 tests, 2 runs the next 10
	(blocksize=10 is hardcoded for now).
	
	Cleaned up string position code. Inspections showed ajStrPos and
	related functions gave results from 0 to length of a string. This
	caused confusion in many other functions and applications. These
	functions are now static strPos functions because only ajstr.c had
	calls to them (though the ajStrPos versions are still available).
	All calls were checked for positions out of range. As a result,
	many calls to ajStrAssSub and AjStrCut were fixed. ajStrInsertC
	requires a value from 0 to length (start position to insert can be
	before or after the string, or any position in bwteen). Fixed by
	passing length+1 to strPosII.

	Added a functions ajUtilCatch for use in debugging with gdb. When
	a nasty special case occurs, call ajUtilCatch and make it a
	breakpoint in gdb. The resulting backtrace will give the call stack
	and all variable values.
	
	Cleaned up code for chunk HTML input. Added a new variable
	EMBOSS_HTTPVERSION which defaults to 1.0 (so HTTP is not chunked)
	and a DB attribute httpversion. This must be a floating point
	number, and is included in the HTTP header to specify the HTTP
	protocol version to be used. There is no check in the code to
	change behaviour for different versions. This is used in the
	SRSWWW and URL access methods.

	Added check to qatest.pl to report any EMBOSS (rather than
	EMBASSY) applications for which there is no defined test. The
	EMBASSY test uses wossname results, checked against the names of
	ACD files in the source tree, as qatest always runs in the test/qa
	directory.
	
	Allowed sequences as values for EMBL rpt_unit feature qualifiers
	because so many entries have them. They are illegal according to
	the Version 4.0 (current) feature table document.

	Allow ? before from and to feature locations in SwissProt. For
	now, these are ignored, though we could add something to hold them
	for accurate output.
	
	Added modified Harrison solubility probability to PEPSTATS
	
	ACD attributes now have descriptions in the ajacd.c code which are
	reported by 'entrails'. All ACD attributes have been checked by
	inspection of the code to note those which are used/unused by ACD.
	The ACD "type" attribute for files is renamed "standardtype" to
	reflect its intended use to note standard file types for linking
	applications. Sequences and alignments still have a "type"
	attribute for protein or dna sequence types.
	
	aaindexextract (new) reads the AAINDEX database and writes each
	entry to data/AAINDEX directory. New function ajFileDataDirNew to
	read data files from a named directory. New ACD datafile attribute
	'directory' passed to ajFileDataDirNew. AAINDEX directory defined
	for pepwindow and pepwindowall.
	
	palindrome can now read in multiple sequences

        palindrome now does not print a '|' in an alignment where there
        is a mismatched pair of bases.

	Added filelist datatype to ACD

	mwcontam program added. Displays molecular weights that are common
	across a set of files.

	showfeat - added '-sort join' to display joined features on one line.

	diffseq - don't give summary of SNPs if the sequences are proteins.

	Inclusion of stat64 and readdir64 for offsetbits=64 (ajfile.c
	and ajsys.c)

	Workaround for broken Solaris readdir64_r (jembossctl)
	
        infoseq can now optionally display GI and Sequence Version numbers.

	notseq can now read in a file of sequence names.

	Added '-alternative' qualifier to transeq to allow reverse frame
	translations to be done using the codons counted from the start
	of the reversed sequence, rather than, by default, using the
	codons of the corresponding forward frame. 

	Added the qualifier '-join' to the program extractfeat.
	If '-join' is set then joined features, such as 'CDS' and 'mRNA'
	are output as a single concatenated sequence.

	Changed the default output filename from 'stdout' to a file for
	the following:
	    infoalign
	    megamerger
	    merger
	    showalign
	    showfeat
	    showseq
	    textsearch

	Lindna/cirdna can now draw filled boxes and the user can change the
 	text size on the command-line. They can also read and display
 	complete genomic sequences.

        Major new revision of protein structure applications - w/o full
 	documentation.

	New applications have been added:  
	     pdbparse.c / acd
	     scopseqs.c / acd
	     scopnr.c / acd
	     seqsearch.c / acd 
	     seqwords.c / acd
	     seqalign.c / acd 
	     hetparse.c / acd
	     scopreso.c / acd
	     scoprep.c / acd
	     profgen.c / acd
	     funky.c / acd
	     hmmgen.c / acd
	     fraggle.c / acd
	     
	Some applications have been deleted: 
	     scope.c / acd
	     nrscope.c / acd
	     psiblasts.c / acd
	     swissparse.c / acd
	     alignwrap.c / acd
	     dichet.c / acd

	The deleted applications have been replaced as follows: 
	     coordenew  --> pdbparse (coordnew was deleted a while back)
	     scope --> scopparse
	     nrscope --> scopnr
	     psiblasts --> seqsearch
	     swissparse --> seqwords
	     alignwrap  --> seqalign

	New versions of code have been committed: 
	     pdbparse.c / acd
	     domainer.c / acd
	     contacts.c / acd
	     interface.c / acd
	     pdbtosp.c / acd
	     scopparse.c / acd
	     scopreso.c / acd
	     scopseqs.c / acd
	     scopnr.c / acd
	     scoprep.c / acd
	     scopalign.c / acd
	     seqsearch.c / acd
	     seqwords.c / acd
	     seqsort.c / acd
	     seqnr.c / acd
	     seqalign.c / acd
	     siggen.c / acd
	     sigscan.c / acd
	     sigplot.c / acd
	     hetparse.c / acd
	     profgen.c / acd
	     funky.c / acd
	     hmmgen.c / acd
	Plus     
	     ajxyz.c / ajxyz.h

	Short summaries of the applications are as follows:
	     pdbparse - Parses pdb files and writes cleaned-up protein 
	                coordinate files.
	     domainer - Reads protein coordinate files and writes 
	                domains coordinate files.
	     contacts - Reads coordinate files and writes files of 
	                intra-chain residue-residue contact data.
	     interface- Reads coordinate files and writes files of 
	                inter-chain residue-residue contact data.
	     pdbtosp  - Convert raw swissprot:pdb equivalence file to 
		        embl-like format.
	     scopparse- Converts raw scop classification files to a 
	                file in embl-like format.
	     scopreso - Removes low resolution domains from a scop 
	                classification file.
	     scopseqs - Adds pdb and swissprot sequence records to a 
	                scop classification file.
	     scopnr   - Removes redundant domains from a scop 
	                classification file.
	     scoprep  - Reorder scop classificaiton file so that the 
	                representative structure of each family is 
			given first.
	     scopalign- Generate alignments for families in a scop 
	                classification file by using STAMP.
	     seqsearch- Generate files of hits for families in a scop
	                classification file by using PSI-BLAST with 
			seed alignments.
	     seqwords - Generate files of hits for scop families by 
	                searching swissprot with keywords.
	     seqsort  - Reads multiple files of hits and writes a 
	                non-ambiguous file of hits (scop families file) 
			plus a validation file.
	     seqnr    - Removes redundant hits from a scop families file.
	     seqalign - Generate extended alignments for families in 
	                a scop families file by using CLUSTALW with seed 
			alignments.
	     siggen   - Generates a sparse protein signature from an 
	                alignment and residue contact data.
	     sigscan  - Scans a signature against swissprot and writes 
	                a signature hits files.
	     sigplot  - Reads a signature hits file and validation file 
	                and generates gnuplot data files of signature 
			performance.
	     profgen  - Generates various profiles for each alignment 
	                in a directory.
	     hmmgen   - Generates a hidden Markov model for each alignment 
	                in a directory.
	     hetparse - Converts raw dictionary of heterogen groups to 
	                a file in embl-like format.
	     funky    -	Reads clean coordinate files and writes file 
	                of protein-heterogen contact data.

	Updated "make check" program entrails. Corrected sequence format
	reports, added report and alignment formats and database access
	methods.
	
	Added scripts/logreport1.pl to report EMBOSS usage from the
	logfile. Takes the logfile name on the command line. Reports
	total use, most active user, and total user count.

	extractseq now only reads one sequence as input. 

Version 2.4.1
	Fixed error reading multiple databases
	Fixed MacOSX reading of incomplete sequence files
	Fixed indexing of REFSEQ

Version 2.4.0
        New Jemboss authorising server code. This uses a new set-uid
	program (jembossctl) to perform tasks as the user.

	New alignment output format "match" for wordmatch, reports the
	length, sequence names, and range in each sequence.

	emboss.default.template has been changed to include the new SRSWWW
	access method and the fields definitions for the test databases.

	In dbiblast, renamed the -filename option -filenames to match the
	other dbi indexing programs, and because wildcard filenames are
	supported.
	
	Removed the -staden option for the dbi indexing programs. This had
	no efect (it was originally included to rename files as
	division.lookup for use by internal utilities at the Sanger
	Centre).
	
	In qatest.pl test script, added test for missing expected file.
	Only seen for obsolete secondary output files, no tests were passing
	that should have failed.

	Script (scripts/dbilist.pl) to report the contents of EMBLCD
	database indices created by dbiflat, dbigcg, dbifasta or dbiblast.

	Proxy HTTP access for remote servers. Define EMBOSS_PROXY as an
	environment variable, or in emboss.defaults. Can also be set for
	any database as proxy: "hostname:port" or overridden with
	proxy: ":" to use a local server for a database. This is used by
	both the URL and SRSWWW access methods.

	New ajListUnique function to remove duplicate nodes in a list.

	New embxyz.c / .h embXyzSeqsetNRRange functions added

	Report format "table" is the default for several applications. In
	this format, the sequence USA has been removed because it already
	appears in the sequenec header part of the report. A new format
	"-rformat nametable" will produce the previous report output for
	users who are relying on parsing it.

	Output files defined with the "nullok" attribute in ACD are not
	created unless requested. The file name and extension are ignored.
	It is possible to add a new associated qualifier to control this
	behaviour, but its use may be confusing with more than one output
	file.
	
	Precision attribute for report score (default is 3). Other
	floating point report values are written as strings by the
	original application so their precision is defined in the
	code. The score is a float, as part of the internal (GFF) feature
	structure.  A zero value produces an integer score (strictly, it
	uses %.0f as the format). Set precision for etandem, fuzznuc,
	fuzzpro, fuzztran, patmatdb, patmatmotifs (integer scores) and
	restrict (no score)
	
	Report output for equicktandem and etandem, with -origfile to
	write the original output format for sites (Sanger for example)
	who still require it. By default, the origfile output file is
	not created.

	Report output for patmatdb and patmatmotifs. For patmatmotifs the
	prosite documentation appears in the report footer, with the
	addition of the motif name and the number of matches in the
	sequence.

	Report headers and footers automatically trim last newline.
	
	Reports in -rformat SeqTable right-align numbers.

	Report output for marscan (-rformat GFF by default)

	Report output for fuzztran (-rformat table with the translation
	included as a report field). Using -rformat seqtable with fuzztran
	now also shows the original DNA sequence.
	
	Report output for fuzznuc and fuzzpro (-rformat SeqTable by default)

	New report qualifiers -raccshow to include accession in header
	and -rdesshow to include description in header
	
	Two access methods "file" and "offset" were defined as valid in
	database definitions, but are really reserved for simple file reading.
	They are removed from the database access methods list.

	Two access methods "cmd" and "nbrf" are obsolete (cmd was never
	implemented, nbrf is replaced by gcg which includes a query
	mechanism). Both are removed from the database access methods list,
	and the source code is commented out.
	
	SRS, SRSFASTA and SRSWWW database access can read all entries This
	is not recommended for SRSWWW access because it will read
	everything into memory - all of EMBL for example - then strip out
	HTML tags before reading. For SRS it is not recommended because
	"methodall: direct" is faster. For SRSFASTA it is necessary
	because using SRSFASTA implies EMBOSS does not read the original
	data format. However, not iomplementing an "all" search left a gap
	in the SRS access methods which would generate a bad SRS command
	line or URL.
	
	NBRF sequence reading trims last character only if it is '*'
	to catch cases where SRS reports the sequence as 'plain'

	GCG database text has the spaces in ". ." strings removed.

	Database entry text and sequence saved for binary formats (GCG, BLAST)
	for use by entret and other applications

	dbiblast indices with split databases (formatdb -v) fixed for reading
	all entries (was only reading the first file)

	dbiblast and dbigcg indices support exclude and file definitions
	to create database subsets

	Database include and file definitions can use the simple filename.
	In some cases the full path was used. Database files are checked
	both with and without the directory path for back-compatibility.

	srswww access method created to query a remote web server.
	Preferred to using URL access as SRS queries can be built

	Sequence objects include the SeqVersion, Keyword list and Taxonomy
	list.

	The GI number is read as an alternative SeqVersion where it is
	available (GenBank and some NCBI formats). The GI number is
	reported in GenBank format if available, but the GenBank VERSION
	line may have only the SeqVersion if, for example, the sequence
	was reead from an EMBL entry. "sv" queries check both the
	SeqVersion and GI number.

	Accession numbers have a strict definition, which covers the old
	and new EMBL/GenBank format, SwissProt, PIR, and REFSEQ
	(NM_nnnnnn). Earlier versions would accept any "accession number"
	in some sequence formats, especially NCBI format.

	SeqVersion (EMBL SV line, GenBank VERSION line) is used in preference
	to accession number where available. Can also be read in FASTA
	and NCBI formats. Where only the SeqVersion is available, the
	accession number is generated.
	
	USA queries implement searches by SV, DES, ORG and KEY. These work
	with SRS access methods (SRS, SRSFASTA, SRSWWW) by building SRS
	queries, and with direct access (simple file reading) by
	testing the sequence object.

	Key and Org queries are for full keywords (including spaces) and
	for each level of the taxonomy.

	Des queries, if the access method does not provide a mechanism,
	(if the access method does not have its own index) are applied to
	words within the description. Words start with a letter or number,
	and end with a letter or number. SRS typically does the same, but
	allows a single quote at the end. This catches words such as 3'
	and 5' but is a problem with some quoted text.

	Queries for ID ACC SV DES ORG and KEY are valid for all file
	access methods, including URL, external, cmd, app, file and by
	default any new method added. If the internal query data is not
	flagged by the access method (to show the database has been
	queried) the sequence object is automatically tested.

	Missing description, keyword, organism, or seqversion fields cause
	queries to fail if they are used on inappropriate data.

	dbiflat, dbigcg dbifasta and dbiblast can index the new
	fields. All fields are available in dbiflat and dbigcg. The sv and
	des fields are available in dbifasta and dbiblast. If any specific
	formats make it possible to parse the org (or key) field they can
	be added as new formats.

	The new EMBLCD index files are named as follows: des for the
	descriptions (no obvious standard name), seqvn for the seqversion
	(no obvious standard name), keyword for keywords (EMBLCD
	distribution name) and taxon to organism (EMBCD distribution
	name). The EMBLCD distribution also included a freetext index
	which is similar to the SRS alltext search so we did not use the
	name for the desctription index.

	We are working through the EMBLCD format documentation to make
	EMBOSS indices more compatible. For example, all tokens in the TRG
	index files should have trailing spaces. We use a NULL to mark the
	end of the string.

	EMBLCD index files now expand to fit the longest token, including
	the entryname index which was limited to 12 characters (only one
	site reported a problem with this in dbifasta with long ID names).

	A new qualifier -maxindex sets an upper limit (25 is recommended)
	to limit the size of all index files. Currently this applies to
	all indices. We can add separate maxima for each field if
	needed. We expect very few sites to use the extra index fields
	as SRS is a simpler alternative.
	
	New database definition token 'fields' with a list of indexed fields
	can be set to 'sv des org key' for SRS databases. 

	USAs check the query field against the database 'fields'
	definition. ID and ACC are always allowed. dbname:name still
	searches ID and ACC (no change from previous version)

	USAs with a filename can include the new query fields. The syntax is
	filename:field:query for example empro.dat:id:eclaci (the extended
	syntax is because empro.dat-id:eclaci looks like a filename ending
	in -id)

	Application 'tranalign' added. 
	This aligns nucleic coding regions based on a set of aligned proteins.

	
Version 2.3.1

	est2genome fixed for large alignments (over 40Mbase for
	est * genomic sequence length)
	sequence reading for ABI files fixed (and selex files tested)
	genbank feature input working
	pepinfo PNG output larger to make the text readable (only affects
	PNG output)
	empty sequence file input fails gracefully
	empty sequence input fails gracefully (and only needs one
	^D from stdin)
	
Version 2.3.0

	Seqretall, seqretallfeat and seqretset moved to 'make check'.
	Seqret has all the functionality of the above.
	Fix for NBRF accession number reading (ajseqread.c)
	Whichdb program added.
	Fix for dbifasta and wormpep
	Fix for problem reading plain format sequences by primer3.
	Primer3 renamed eprimer3 to avoid conflicts with the Whitehead's
	primer3 version 3.0.6.
	transeq's '-frame' can have a list of values, as: '-frame=1,2,3'
	Non-existent files in lists are again ignored
	Various wildcard database search fixes
	ESIM4 added as an embassy package