Version 2.6.0 Added twofeat - Finds neighbouring pairs of features in sequences Extractfeat - added option (-featinname) to include the name of the feature as part of the ID name of the sequence that is written out. Added sirna - designs siRNA probes in mRNA Sigcleave sorts results highest score first Helixturnhelix sorts results highest score first and reports the score position as an integer Added pestfind Moved the following programs into the "domainatrix" embassy package: contacts, domainer, fraggle, hetparse, hmmgen, interface, pdbparse, pdbtosp, profgen, scopalign, scopnr, scopparse, scoprep, scopreso, scopseqs, seqalign, seqnr, seqsearch, seqsort, seqwords, siggen, sigplot, sigscan Palindrome no longer reports palindromes that are only composed of N's Msbar can now check that the result doesn't match a set of input other sequences. For example you could specify that it doesn't match the input sequence or a set of previously produced mutation results. Getorf reporting of circular genome positions tidied up - it now reports positions starting in the range 1 to the sequence length and indicates if the ORF goes through the breakpoint. A clear indication of when ORFs are in the reverse sense has been added. Pasteseq now behaves correctly when -sask2, -sbegin2 or -send2 are used. Version 2.5.1 Whichdb new option -showall to see which databases are being searched for use where searches hang. The order of searching is undefined - it depends on the order in which databases are returned from the internal table, which is unrelated to the order in which they were defined. Wordmatch alignments save the entire sequence but use part only. Fixed all alignment formats to work with these by adding a SubOffset attribute. Duplicate IDs fix. The database indexing programs skipped duplicate IDs but did not reset the size of the entryname index file so some queries could fail to find the later IDs in the databases. Duplicate IDs are illegal for -nosystemsort (no easy way to correct because entry numbers are stored internally). For the default case duplicate IDs are merged even if they are different. REFSEQ is the main problem area. Writing data files used EMBOSS_DATA, or by default the install directory. Earlier versions, if not installed, could write to the source tree emboss/data directory. Fixed to continue if there is no install data directory, and to check EMBOSS_DATA (if defined) is a real directory. sigcleave options pval and nval hardcoded. They depend on the weight matrix size - which is hardcoded as 15 in the ACD file and is not checked in the program. They were introduced in EGCG in 1988 but never used because no other weight matrix length was tried. Version 2.5.0 "fasta" format now uses the "ncbi" parser, so both formats report "fasta" as the format. "pearson" is the old "fasta" format for a few cases (empty IDs for example) there ncbi parsing fails completely. SPLITTER changed to match documentation. Old behaviour is now selectable by using the -addoverlap command line option. Configuration modifications. --without-x works. Removed odd but harmless -I definitions. PNG detection improved. Corrected EMBLCD index searching for queries that start with a wildcard. For example, tembl-key:?* should search for all entries that have a keyword (key:* is regarded as 'all entries'). Entries with no keyword (in PIR's pir4.ref file for example) will be ignored. Updated source code docs for EFUNC and EDATA. Corrected all bad headers. efunc.out has no errors. efunc.check only reports 'missing headers' for duplicated function names (#ifdef code) which is a known 'feature'. Updated source code to fix most lines over 80 bytes. Calculated ACD attributes now QA tested. Feature attributes will be correctly set, although none are used in the ACD files at present. purify.pl has a new option -block=n where n is a number from 1 upwards. 1 runs the first 10 tests, 2 runs the next 10 (blocksize=10 is hardcoded for now). Cleaned up string position code. Inspections showed ajStrPos and related functions gave results from 0 to length of a string. This caused confusion in many other functions and applications. These functions are now static strPos functions because only ajstr.c had calls to them (though the ajStrPos versions are still available). All calls were checked for positions out of range. As a result, many calls to ajStrAssSub and AjStrCut were fixed. ajStrInsertC requires a value from 0 to length (start position to insert can be before or after the string, or any position in bwteen). Fixed by passing length+1 to strPosII. Added a functions ajUtilCatch for use in debugging with gdb. When a nasty special case occurs, call ajUtilCatch and make it a breakpoint in gdb. The resulting backtrace will give the call stack and all variable values. Cleaned up code for chunk HTML input. Added a new variable EMBOSS_HTTPVERSION which defaults to 1.0 (so HTTP is not chunked) and a DB attribute httpversion. This must be a floating point number, and is included in the HTTP header to specify the HTTP protocol version to be used. There is no check in the code to change behaviour for different versions. This is used in the SRSWWW and URL access methods. Added check to qatest.pl to report any EMBOSS (rather than EMBASSY) applications for which there is no defined test. The EMBASSY test uses wossname results, checked against the names of ACD files in the source tree, as qatest always runs in the test/qa directory. Allowed sequences as values for EMBL rpt_unit feature qualifiers because so many entries have them. They are illegal according to the Version 4.0 (current) feature table document. Allow ? before from and to feature locations in SwissProt. For now, these are ignored, though we could add something to hold them for accurate output. Added modified Harrison solubility probability to PEPSTATS ACD attributes now have descriptions in the ajacd.c code which are reported by 'entrails'. All ACD attributes have been checked by inspection of the code to note those which are used/unused by ACD. The ACD "type" attribute for files is renamed "standardtype" to reflect its intended use to note standard file types for linking applications. Sequences and alignments still have a "type" attribute for protein or dna sequence types. aaindexextract (new) reads the AAINDEX database and writes each entry to data/AAINDEX directory. New function ajFileDataDirNew to read data files from a named directory. New ACD datafile attribute 'directory' passed to ajFileDataDirNew. AAINDEX directory defined for pepwindow and pepwindowall. palindrome can now read in multiple sequences palindrome now does not print a '|' in an alignment where there is a mismatched pair of bases. Added filelist datatype to ACD mwcontam program added. Displays molecular weights that are common across a set of files. showfeat - added '-sort join' to display joined features on one line. diffseq - don't give summary of SNPs if the sequences are proteins. Inclusion of stat64 and readdir64 for offsetbits=64 (ajfile.c and ajsys.c) Workaround for broken Solaris readdir64_r (jembossctl) infoseq can now optionally display GI and Sequence Version numbers. notseq can now read in a file of sequence names. Added '-alternative' qualifier to transeq to allow reverse frame translations to be done using the codons counted from the start of the reversed sequence, rather than, by default, using the codons of the corresponding forward frame. Added the qualifier '-join' to the program extractfeat. If '-join' is set then joined features, such as 'CDS' and 'mRNA' are output as a single concatenated sequence. Changed the default output filename from 'stdout' to a file for the following: infoalign megamerger merger showalign showfeat showseq textsearch Lindna/cirdna can now draw filled boxes and the user can change the text size on the command-line. They can also read and display complete genomic sequences. Major new revision of protein structure applications - w/o full documentation. New applications have been added: pdbparse.c / acd scopseqs.c / acd scopnr.c / acd seqsearch.c / acd seqwords.c / acd seqalign.c / acd hetparse.c / acd scopreso.c / acd scoprep.c / acd profgen.c / acd funky.c / acd hmmgen.c / acd fraggle.c / acd Some applications have been deleted: scope.c / acd nrscope.c / acd psiblasts.c / acd swissparse.c / acd alignwrap.c / acd dichet.c / acd The deleted applications have been replaced as follows: coordenew --> pdbparse (coordnew was deleted a while back) scope --> scopparse nrscope --> scopnr psiblasts --> seqsearch swissparse --> seqwords alignwrap --> seqalign New versions of code have been committed: pdbparse.c / acd domainer.c / acd contacts.c / acd interface.c / acd pdbtosp.c / acd scopparse.c / acd scopreso.c / acd scopseqs.c / acd scopnr.c / acd scoprep.c / acd scopalign.c / acd seqsearch.c / acd seqwords.c / acd seqsort.c / acd seqnr.c / acd seqalign.c / acd siggen.c / acd sigscan.c / acd sigplot.c / acd hetparse.c / acd profgen.c / acd funky.c / acd hmmgen.c / acd Plus ajxyz.c / ajxyz.h Short summaries of the applications are as follows: pdbparse - Parses pdb files and writes cleaned-up protein coordinate files. domainer - Reads protein coordinate files and writes domains coordinate files. contacts - Reads coordinate files and writes files of intra-chain residue-residue contact data. interface- Reads coordinate files and writes files of inter-chain residue-residue contact data. pdbtosp - Convert raw swissprot:pdb equivalence file to embl-like format. scopparse- Converts raw scop classification files to a file in embl-like format. scopreso - Removes low resolution domains from a scop classification file. scopseqs - Adds pdb and swissprot sequence records to a scop classification file. scopnr - Removes redundant domains from a scop classification file. scoprep - Reorder scop classificaiton file so that the representative structure of each family is given first. scopalign- Generate alignments for families in a scop classification file by using STAMP. seqsearch- Generate files of hits for families in a scop classification file by using PSI-BLAST with seed alignments. seqwords - Generate files of hits for scop families by searching swissprot with keywords. seqsort - Reads multiple files of hits and writes a non-ambiguous file of hits (scop families file) plus a validation file. seqnr - Removes redundant hits from a scop families file. seqalign - Generate extended alignments for families in a scop families file by using CLUSTALW with seed alignments. siggen - Generates a sparse protein signature from an alignment and residue contact data. sigscan - Scans a signature against swissprot and writes a signature hits files. sigplot - Reads a signature hits file and validation file and generates gnuplot data files of signature performance. profgen - Generates various profiles for each alignment in a directory. hmmgen - Generates a hidden Markov model for each alignment in a directory. hetparse - Converts raw dictionary of heterogen groups to a file in embl-like format. funky - Reads clean coordinate files and writes file of protein-heterogen contact data. Updated "make check" program entrails. Corrected sequence format reports, added report and alignment formats and database access methods. Added scripts/logreport1.pl to report EMBOSS usage from the logfile. Takes the logfile name on the command line. Reports total use, most active user, and total user count. extractseq now only reads one sequence as input. Version 2.4.1 Fixed error reading multiple databases Fixed MacOSX reading of incomplete sequence files Fixed indexing of REFSEQ Version 2.4.0 New Jemboss authorising server code. This uses a new set-uid program (jembossctl) to perform tasks as the user. New alignment output format "match" for wordmatch, reports the length, sequence names, and range in each sequence. emboss.default.template has been changed to include the new SRSWWW access method and the fields definitions for the test databases. In dbiblast, renamed the -filename option -filenames to match the other dbi indexing programs, and because wildcard filenames are supported. Removed the -staden option for the dbi indexing programs. This had no efect (it was originally included to rename files as division.lookup for use by internal utilities at the Sanger Centre). In qatest.pl test script, added test for missing expected file. Only seen for obsolete secondary output files, no tests were passing that should have failed. Script (scripts/dbilist.pl) to report the contents of EMBLCD database indices created by dbiflat, dbigcg, dbifasta or dbiblast. Proxy HTTP access for remote servers. Define EMBOSS_PROXY as an environment variable, or in emboss.defaults. Can also be set for any database as proxy: "hostname:port" or overridden with proxy: ":" to use a local server for a database. This is used by both the URL and SRSWWW access methods. New ajListUnique function to remove duplicate nodes in a list. New embxyz.c / .h embXyzSeqsetNRRange functions added Report format "table" is the default for several applications. In this format, the sequence USA has been removed because it already appears in the sequenec header part of the report. A new format "-rformat nametable" will produce the previous report output for users who are relying on parsing it. Output files defined with the "nullok" attribute in ACD are not created unless requested. The file name and extension are ignored. It is possible to add a new associated qualifier to control this behaviour, but its use may be confusing with more than one output file. Precision attribute for report score (default is 3). Other floating point report values are written as strings by the original application so their precision is defined in the code. The score is a float, as part of the internal (GFF) feature structure. A zero value produces an integer score (strictly, it uses %.0f as the format). Set precision for etandem, fuzznuc, fuzzpro, fuzztran, patmatdb, patmatmotifs (integer scores) and restrict (no score) Report output for equicktandem and etandem, with -origfile to write the original output format for sites (Sanger for example) who still require it. By default, the origfile output file is not created. Report output for patmatdb and patmatmotifs. For patmatmotifs the prosite documentation appears in the report footer, with the addition of the motif name and the number of matches in the sequence. Report headers and footers automatically trim last newline. Reports in -rformat SeqTable right-align numbers. Report output for marscan (-rformat GFF by default) Report output for fuzztran (-rformat table with the translation included as a report field). Using -rformat seqtable with fuzztran now also shows the original DNA sequence. Report output for fuzznuc and fuzzpro (-rformat SeqTable by default) New report qualifiers -raccshow to include accession in header and -rdesshow to include description in header Two access methods "file" and "offset" were defined as valid in database definitions, but are really reserved for simple file reading. They are removed from the database access methods list. Two access methods "cmd" and "nbrf" are obsolete (cmd was never implemented, nbrf is replaced by gcg which includes a query mechanism). Both are removed from the database access methods list, and the source code is commented out. SRS, SRSFASTA and SRSWWW database access can read all entries This is not recommended for SRSWWW access because it will read everything into memory - all of EMBL for example - then strip out HTML tags before reading. For SRS it is not recommended because "methodall: direct" is faster. For SRSFASTA it is necessary because using SRSFASTA implies EMBOSS does not read the original data format. However, not iomplementing an "all" search left a gap in the SRS access methods which would generate a bad SRS command line or URL. NBRF sequence reading trims last character only if it is '*' to catch cases where SRS reports the sequence as 'plain' GCG database text has the spaces in ". ." strings removed. Database entry text and sequence saved for binary formats (GCG, BLAST) for use by entret and other applications dbiblast indices with split databases (formatdb -v) fixed for reading all entries (was only reading the first file) dbiblast and dbigcg indices support exclude and file definitions to create database subsets Database include and file definitions can use the simple filename. In some cases the full path was used. Database files are checked both with and without the directory path for back-compatibility. srswww access method created to query a remote web server. Preferred to using URL access as SRS queries can be built Sequence objects include the SeqVersion, Keyword list and Taxonomy list. The GI number is read as an alternative SeqVersion where it is available (GenBank and some NCBI formats). The GI number is reported in GenBank format if available, but the GenBank VERSION line may have only the SeqVersion if, for example, the sequence was reead from an EMBL entry. "sv" queries check both the SeqVersion and GI number. Accession numbers have a strict definition, which covers the old and new EMBL/GenBank format, SwissProt, PIR, and REFSEQ (NM_nnnnnn). Earlier versions would accept any "accession number" in some sequence formats, especially NCBI format. SeqVersion (EMBL SV line, GenBank VERSION line) is used in preference to accession number where available. Can also be read in FASTA and NCBI formats. Where only the SeqVersion is available, the accession number is generated. USA queries implement searches by SV, DES, ORG and KEY. These work with SRS access methods (SRS, SRSFASTA, SRSWWW) by building SRS queries, and with direct access (simple file reading) by testing the sequence object. Key and Org queries are for full keywords (including spaces) and for each level of the taxonomy. Des queries, if the access method does not provide a mechanism, (if the access method does not have its own index) are applied to words within the description. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text. Queries for ID ACC SV DES ORG and KEY are valid for all file access methods, including URL, external, cmd, app, file and by default any new method added. If the internal query data is not flagged by the access method (to show the database has been queried) the sequence object is automatically tested. Missing description, keyword, organism, or seqversion fields cause queries to fail if they are used on inappropriate data. dbiflat, dbigcg dbifasta and dbiblast can index the new fields. All fields are available in dbiflat and dbigcg. The sv and des fields are available in dbifasta and dbiblast. If any specific formats make it possible to parse the org (or key) field they can be added as new formats. The new EMBLCD index files are named as follows: des for the descriptions (no obvious standard name), seqvn for the seqversion (no obvious standard name), keyword for keywords (EMBLCD distribution name) and taxon to organism (EMBCD distribution name). The EMBLCD distribution also included a freetext index which is similar to the SRS alltext search so we did not use the name for the desctription index. We are working through the EMBLCD format documentation to make EMBOSS indices more compatible. For example, all tokens in the TRG index files should have trailing spaces. We use a NULL to mark the end of the string. EMBLCD index files now expand to fit the longest token, including the entryname index which was limited to 12 characters (only one site reported a problem with this in dbifasta with long ID names). A new qualifier -maxindex sets an upper limit (25 is recommended) to limit the size of all index files. Currently this applies to all indices. We can add separate maxima for each field if needed. We expect very few sites to use the extra index fields as SRS is a simpler alternative. New database definition token 'fields' with a list of indexed fields can be set to 'sv des org key' for SRS databases. USAs check the query field against the database 'fields' definition. ID and ACC are always allowed. dbname:name still searches ID and ACC (no change from previous version) USAs with a filename can include the new query fields. The syntax is filename:field:query for example empro.dat:id:eclaci (the extended syntax is because empro.dat-id:eclaci looks like a filename ending in -id) Application 'tranalign' added. This aligns nucleic coding regions based on a set of aligned proteins. Version 2.3.1 est2genome fixed for large alignments (over 40Mbase for est * genomic sequence length) sequence reading for ABI files fixed (and selex files tested) genbank feature input working pepinfo PNG output larger to make the text readable (only affects PNG output) empty sequence file input fails gracefully empty sequence input fails gracefully (and only needs one ^D from stdin) Version 2.3.0 Seqretall, seqretallfeat and seqretset moved to 'make check'. Seqret has all the functionality of the above. Fix for NBRF accession number reading (ajseqread.c) Whichdb program added. Fix for dbifasta and wormpep Fix for problem reading plain format sequences by primer3. Primer3 renamed eprimer3 to avoid conflicts with the Whitehead's primer3 version 3.0.6. transeq's '-frame' can have a list of values, as: '-frame=1,2,3' Non-existent files in lists are again ignored Various wildcard database search fixes ESIM4 added as an embassy package