Supercomputing Networking Research Education Ohio Supercomputer Center Site Map Staff Directory Support

Statewide Software

ExtractProp 1.0

ExtractProp is a Java based application for processing output from programs for computational prediction of biological properties. The program has been designed for high-throughput processing, with an auto-sense capability to detect output types from a number of prediction programs and formats and process them appropriately. The resulting output is a tagged, near XML like document containing many properties of interest.

Note, In some cases, the prediction programs do not contain sufficient information to ascribe property values to a specific sequence. In this case, it is necessary to provide additional information to the output file where sequence information is not generated in the application. Examples are included below for the program Mitoprot.

ExtractProp has been developed at the Ohio Supercomputer Center with support from the National Science Foundation through a project 2010 award to Drs. Meier and Rose of the Ohio State University. Correspondence regarding the program, including features suggestions and requests, should be directed to the author at eas@osc.edu.

Click here to download the jar file.

System Prerequesites:
A Java runtime environment version 1.4 or later is required on the target machine.

Accessing the application:
The ExtractProp Java jar file can be downloaded from here.

Application Installation:
Installation of the ExtractProp application is simple. Simply place the ExtractProp.jar file in a known location and update the CLASSPATH variable for your environment to include this location. In many cases, it is sufficient to simply copy the ExtractProp.jar file into the working directory of interest.

Running the application:
ExtractProp works using named files for processing. The application has been designed to support remote file access (provided system access permissions have been established), and consequently uses a full path specification and URI to identify an input file. Because ExtractProp uses intelligent detection and automatically scans for known file types, no additional command line options are required. Output from the program is delivered to the standard output stream for the computer.

1). For Unix/Linux:

Assuming CLASSPATH variable has been set and exported, the file at /usr/tmp/file.text will be processed and output will be written to standard out.

% java –jar ExtractProp.jar //usr/tmp/file.text

If the CLASSPATH variable has not been set and exported, the following command will process the same file. The ExtractProp.jar file is located in /var/Extract/ExtractProp.jar.

% java –jar /var/Extract/ExtractProp.jar //usr/tmp/file.text

2). For Windows:

Assuming CLASSPATH variable has been set and exported, the file at c:\usr\tmp\file.text will be processed and output will be written to standard out.

C:> java –jar ExtractProp.jar //usr/tmp/file.text

If the CLASSPATH variable has not been set and exported, the following command will process the same file. The ExtractProp.jar file is located in c:\var\Extract\ExtractProp.jar.

C:> java –jar /var/Extract/ExtractProp.jar //usr/tmp/file.text

ExtractProp Output Format:

The following tagged fields are generated by the program:

<seqprop></seqprop> : the identifiers for an ExtractProp record
<key> </key> : sequence identifier
<application></application> : the application ExtractProp has associated with the examined output file
<name> </name> : the name of the property (no whitespace)
<value> </value> : the value for the property
<start> </start> : the starting domain for the property (use 'start' for global properties)
<end> </end> : the ending domain for the property (use 'end' for global properties)
<description> </description> : the text description to associate with the named property
<date> </date> : the date the ExtractProgram was run to create the record

Example:

<seqprop> <key> Q9SU47</key>
<application> Predotar</application>
<name> CHLOROPLAST_VALUE</name>
<description> Predotar predicted chloroplast value</description>
<value> 0.3504</value>
<start> start</start><end> end</end>
<date> 2004-9-9</date> </seqprop>

The output file is not fully XML compliant by intent in version 1.0. To change the current output file to an XML compliant document, simply add a start and end tag of your choice at the beginning and end of the file.

Special processing notes:

MULTICOIL Analysis
Full residue detail output file from a MULTICOIL run is required for processing by ExtractProp. The program will extract information about coiled-coils, dimers and trimers into the resulting XML file. Additional details regarding overall percentage coverage are also calculated for each protein sequence. Inter-coil gaps of 20 or less residues are treated as contiguous with a minor gap.

Input: The detailed output file is used as input for extracting coiled-coil coverage information from the MULTICOIL program. It is very important that the FASTA file used as input to the MULTICOIL analysis have the sequence identifier as the second token in the header description. This insures an identifier is available for use in the detailed output. This second field in the FASTA header will be used for the sequence identifier in the resulting XML output file.

HMMTOP Analysis
Input: No special detail is required for processing HMMTOP generated output. Sequence identifiers should be the first token in the header of the FASTA file.

Mitoprot Analysis
Mitoprot requires sequences to be run individually. Additionally, the Mitoprot program does not provide sequence identifier information in the generated output. Consequently, special enhancements are required to the Mitoprot output file to prepare the stream for processing with ExtractProp to create the XML record.

These three enhancements are required:

The string [BEGIN MITOPROT] should be on a separate line at the beginning of each the Mitoprot output for a given sequence.

The string [SEQUENCE KEY] should be placed after the [BEGIN MITOPROT] line and must be followed by the sequence identifier chosen for the sequences processed.

The string [END MITOPROT] signals the end of an individual Mitoprot output file.

An example enhanced output file is shown below. This output was generated for the sequence with the identifier Q9ZWA5:

[BEGIN MITOPROT]
[SEQUENCE KEY] (Q9ZWA5)
Complete results are written in mitoprot.mitoprot

  MitoProt II 1.0a4
File : /tmp/mitoprot.16274
Sequence name : mitoprot
Sequence length : 991

VALUES OF COMPUTED PARAMETERS
Coef20 :
3.434
CoefTot :
-0.252
ChDiff :
-18
ZoneTo :
5
KR :
1
DE :
0
CleavSite :
0

    HYDROPHOBIC SCALE USED
 
GES
KD
GVH1
ECS
H17 :
1.141
1.200
-0.014
0.468
MesoH :
-0.837
0.175
-0.477
0.177
MuHd_075 :
14.329
6.502
3.897
1.390
MuHd_095 :
46.064
15.577
10.343
7.407
MuHd_100 :
38.926
16.369
9.148
6.663
MuHd_105 :
29.202
16.845
7.328
5.555
Hmax_075 :
-1.517
5.017
-1.391
0.898
Hmax_095 :
9.900
8.400
0.889
3.660
Hmax_100 :
11.800
9.000
1.192
3.350
Hmax_105 :
-0.900
3.900
-1.213
1.940

CLASS NOT-MITO MITO(/CHLORO)
DFM : 0.8981 0.1019
DFMC : 0.7663 0.2337

[SEQUENCE KEY] (Q9ZWA5)
[END MITOPROT]

Predotar Analysis
The Predotar application does not provide sufficient information to identify the output. As a result, a minor enhancement is added to the output file.

The string [BEGIN PREDOTAR] is added at the beginning of the file, with the string [END PREDOTAR] added after the final output line generated by the application.

The additional line [SEQUENCE ID] [MITOCHONDRIAL] [CHLOROPLAST] is added as column headers to assist the ExtractProp application to confirm correct selection of the values.

The following is an example Predotar file which has been enhanced.

[BEGIN PREDOTAR]
[SEQUENCE ID] [MITOCHONDRIAL] [CHLOROPLAST]

Q9FJ79 9.894315198356839E-7 0.0010273272490437212
Q9FGD8 9.787792254608695E-7 0.004473320922528689
Q9SH47 9.743161706275187E-6 6.890358235439495E-4
Q9CAC9 9.743161706275187E-6 6.890358235439495E-4
Q9C9N6 9.720276978840923E-4 0.0037400932138684615
Q9SX40 9.411239481288856E-4 0.022178452291817547
Q9LIQ9 9.023725788156347E-5 0.10612034625623434
Q9SUE1 8.660733106178467E-7 0.001367730946961291
O23064 8.587771705289504E-7 0.004845230881323344
Q9MAA6 8.578129642935098E-4 0.2612661772272155
Q9SZK7 8.564411744542646E-4 0.03359518642675987

[END PREDOTAR]

General Format Analysis:

Support for a general format is provided for in the application. The general format looks like the following:

EXTRACTPROP GENERAL
sequenceid start end property_name property_value property_description
.
(additional property descriptions)
.

Fields are whitespace separated, with sequenceid, start, end, property_name property_value all using a single token. The property_description field will use all remaining fields for its definition. Multiple property lines may be present in a given file.

For example:

EXTRACTPROP GENERAL
Q9AW89 56 88 max_multicoil_prob 0.9988

The maximum multicoil probability in the domain represents a property named 'max_multicoil_prob' with a value .9988 in the range 56 to 88 for sequence with an identifier Q9AW89.

Multicolumn Format Analysis:

The general format, while very general, can be very verbose and lengthy to create. Consequently, a condensed multicolumn format has been developed to support input from a more compact tabular property representation. It is important to note that this analysis does not retain information for the property description in the current release.

It is also important to note that an optional delimiter may be specified as the field separator.. This is particularly useful when files contain whitespace.

An example is shown:

EXTRACTPROP MULTICOLUMN
DELIMITER &
KEY % PROPNAME_A % PROPNAME_B
QSDV245 % high & 0.4567

Where the application will create two property records for identifier QSDV245, one for PROPNAME_A and one for PROPNAME_B. Multiple sequence property combinations can be placed in the same file.

Additional capabilities and formats
Additional capabilities are provided for TimeLogic Smith-Waterman comparisons generated using the TimeLogic system. These are not fully supported in the present release.

Documentation Last Updated – September 10, 2004

Citation

Annkatrin Rose, Sankaraganesh Manikantan, Shannon J. Schraegle, Michael A. Maloy, Eric A. Stahlberg, and Iris Meier (2004). Genome-wide Identification of Arabidopsis Coiled-Coil Proteins and the Establishment of the ARABI-COIL Database. Plant Physiology, 134, 927-939.