DDF2GO

DDF2GO uses high throughput differential detergent fractionation (DDF) proteomics data to assign experimentally based Gene Ontology (GO) cellular component (CC) annotations. For more information, see paper “A high-throughput experiment-based method to identify subcellular localization,” Lakshmi R Pillai, et al.

For more information about DDF, see paper McCarthy FM, et al., “Differential Detergent Fractionation for Non-electrophoretic Eukaryote Cell Proteomics,” Journal of Proteome Research 2005, 4:316-324.

DOWNLOAD TOOL

DDF2GO.zip

INSTALLATION

Requirements

  1. PERL v5.10.1 or greater
  2. LWP v6.0 or greater
  3. TMPred v1.0
  4. PSORT II v981201

Configuration

DDF2GO can be configured in two ways:

Using the --configure option will display a series of questions, answer each one and the tool will configure itself.

Only modify the DDF2GO file if you have experience with PERL. The configuration section is well documented and should be easy to decipher.

Testing

Run the following command and check if the file mapped.test is identical to t/g_gallus.stroma.mapped.txt:
./DDF2GO.pl  --frac 'cyt|CYT' 'mem|MEM' 'nuc|NUC' 'csm|CSM' \
              -i t/Stroma.filtered -d t/G_gallus.fasta.stroma \
              -p mapped.test \
              -r margin.test \
              --id_mod '[^|]*$'

If any error messages are displayed or the files are not identical, then either DDF2GO is not configured properly, or the TMD prediction tools are not the correct versions.

RUNNING DDF2GO

Inputs

DDF2GO needs at least one protein file (the output file from PepMOD) and a fasta file containing the amino acid sequences for the proteins in the protein file.

Protein File -i

PepMOD is a data filtering tool designed to filter X!tandem search engine results based on p-values calculated from a random score distribution. It can be found on the iPlant Collaborative Discovery Environment. If you don't wish to use PepMOD, then the protein file needs to meet the following format specifications:

  1. A line that begins with "Description", anything above this line will be ignored
  2. At least one protein/peptide section including the following lines:
    1. A line with the name of the protein as the first column
    2. A line containing the peptide headers beginning with "Scan", "HyperScore" for the name of the column containing the test statistic, and the last column used to get DDF Fraction
    3. At least one line containing the peptide information outlined in the peptide headers. "Scan", the first column, must be an integer, "HyperScore" must be a number, and the last column must give the DDF Fraction
Example:
Description
protein1
Scan	HyperScore	Fraction
1	2639.333	4
2	1568.778	4
3	2639.333	1
4	1465.831	2
5	2639.333	2
6	1759.877	2
7	1335.717	2
8	2639.333	3
9	1759.877	3
protein2
Scan	HyperScore	Fraction
10	3059.875	4
11	1148.622	3
protein3
Scan	HyperScore	Fraction
12	1880.937	2
13	1249.633	3
14	1880.937	3

To specify more than one protein file, either enter them as an array (i.e. -i file1 file2 ...) or use multiple -i (i.e -i file1 -i file2 ...)

Fasta File -d

The Fasta file needs to contain an amino acid sequences for every protein in the protein file. If a fasta file is not given, then DDF2GO tries to parse the GI numbers from the protein names in the protein file and download the sequences from NCBI using EUtils. For DDF2GO to parse the GI numbers, the protein names need to be in the NCBI header format (gi|{gi_num|{db}|{db_acc}). Otherwise, this process will fail and no sequences will be found. A warning will be printed for every protein in the protein file that couldn't be found in the fasta file or couldn't be downloaded from NCBI.

Example:
>protein1
ASASPSPPRLFPLVLCSPSDSVYTVGCAAFDFQPSSIAFTWFDSNNSSVSGMDVIPKVISGPPYRAVSRI
QMNQSEGKEKQPFRCRAAHPRGNVEVSVMNPGPIPTPNGIPLFVTMHPPSREDFEGPFRNASILCQTRGR
RRPTEVTWYKNGSPVAAAATTATTVGPEVVAESRISVTESEWDTGATFSCVVEGEMRNTSKRMECGLEPV
VQQDIAIRVITPSFVDIFISKSATLTCRVSNMVNADGLEVSWWKEKGGKLETALGKRVLQSNGLYTVDGV
ATVCASEWDGGDGYVCKVNHPDLLFPMEEKMRKTKASNARPPSVYVFPPPTEQLNGNQRLSVTCMAQGFN
PPHLFVRWMRNGEPLPQSQSVTSAPMAENPENESYVAYSVLGVGAEEWGAGNVYTCLVGHEALPLQLAQK
SVDRASGKASAVNVSLVLADSAAACY

>protein2
MAPNIRKSHPLLKMINNSLIDLPAPSNISAWWNFGSLLAVCLMTQILTGLLLAMHYTADTSLAFSSVAHT
CRNVQYGWLIRNLHANGASFFFICIFLHIGRGLYYGSYLYKETWNTGVILLLTLMATAFVGYVLPWGQMS
FWGATVITNLFSAIPYIGHTLVEWAWGGFSVDNPTLTRFFALHFLLPFAIAGITIIHLTFLHESGSNNPL
GISSDSDKIPFHPYYSFKDILGLTLMLTPFLTLALFSPNLLGDPENFTPANPLVTPPHIKPEWYFLFAYA
ILRSIPNKLGGVLALAASVLILFLIPFLHKSKQRTMTFRPLSQTLFWLLVANLLILTWIGSQPVEHPFII
IGQMASLSYFTILLILFPTIGTLENKMLNY

>protein3
MSWSRSILCLLGAFANARSIPYYPPLSSDLVNHINKLNTTGRAGHNFHNTDMSYVKKLCGTFLGGPKAPE
RVDFAEDMDLPDTFDTRKQWPNCPTISEIRDQGSCGSCWAFGAVEAISDRICVHTNAKVSVEVSAEDLLS
CCGFECGMGCNGGYPSGAWRYWTERGLVSGGLYDSHVGCRAYTIPPCEHHVNGSRPPCTGEGGETPRCSR
HCEPGYSPSYKEDKHYGITSYGVPRSEKEIMAEIYKNGPVEGAFIVYEDFLMYKSGVYQHVSGEQVGGHA
IRILGWGVENGTPYWLAANSWNTDWGITGFFKILRGEDHCGIESEIVAGVPRMEQYWTRV

Execution

To run the above examples through DDF2GO, the following command was run:
DDF2GO.pl -i example.protein.txt -d example.fasta -p output.mapped -r output.margin
-i is the protein file, -d is the fasta file, -p is the output mapped file, and -r is the output marginal file.

If the fraction names don't follow the above example, then the --frac parameter needs to be used. It takes four regular expressions that correspond to the four DDF Fractions. Make sure that these expressions do not overlap, which will cause fractions to be assigned incorrectly. Using the command in the "Testing" sub-section above as an example, the --frac parameter is set to 'cyt|CYT' 'mem|MEM' 'nuc|NUC' 'csm|CSM'. This will match anything with 'cyt' or 'CYT' to the first DDF Fraction, anything with 'mem' or 'MEM' to the second DDF Fraction, and so on.

If the protein names don't match exactly in both the protein file and the fasta file, then use the --id_mod. It takes a regular expression and truncates the protein names to match them. Anything that matches the regular expression is removed. Using the same example, the --id_mod parameter is set to '[^|]*$'. This will remove everything from the last '|' to the end of the string (e.g. gi|127513|sp|P01875.2|IGHM_CHICK RecName: Full=Ig mu chain C region becomes gi|127513|sp|P01875.2|).

To get a full list of parameters run DDF2GO.pl --help

Outputs

Mapped File

The mapped file is a tab-delimited text file that contains the proteins that were mapped to GO CC terms.

Column descriptions:

ID
The protein ID that is matched to fasta sequence headers
Protein
The protein name from the fasta file
D|T|TD|S
The DDF Profile
Average TMD
The average predicted TMD from PSORT II and TMPred
GO CC
The mapped GO CC terms, separated by a '|'

Example:
ID	Protein	D|T|TD|S	Average TMD	GO CC
protein2	protein2	0|0|49|51	8.5	GO:0016021 integral to membrane|GO:0043229 intracellular organelle
protein3	protein3	0|34|66|0	0.5	GO:0031224 intrinsic to membrane|GO:0005783 endoplasmic reticulum
        

Marginal File

The marginal file is a tab-delimited text file that contains the proteins that could not be mapped to a GO CC term with certainty. A protein can't be mapped if the DDF Profile is within 2 percentage points of mapping boundary. These proteins need to be manually annotated.

Column descriptions:

ID
The protein ID that is matched to fasta sequence headers
Protein
The protein name from the fasta file
D|T|TD|S
The DDF Profile
Average TMD
The average predicted TMD from PSORT II and TMPred

Example:
ID	Protein	D|T|TD|S	Average TMD	
protein1	protein1	10|43|24|23	0.5	
        

SAMPLE

The sample subdirectory contains the cow data used in the manuscript mentioned above. The following is a description of the directory contents:
Cow_proteome_OVA.filtered:PepMOD output file
bos_taurus.proteome.fa:FASTA database
cow.ova.mapped.txt:DDF2GO mapped output file
cow.ova.margin.txt:DDF2GO marginal output file

The following command was used to get the DDF2GO output files above:
./DDF2GO.pl  --frac '[OC]1' '[OC]2' '[OC]3' '[OC]4' \
              -i sample/Cow_proteome_OVA.filtered \
              -d sample/bos_taurus.proteome.fa \
              -p sample/cow.ova.mapped.txt \
              -r sample/cow.ova.margin.txt

HELP

To see a list of parameters and their description, run DDF2GO.pl --help

MANDATORY PARAMETERS:
    -i, --input_files FILE...       A list of files with protein/peptide 
                                        information in the format of PepMOD 
                                        output

DATABASE PARAMETERS:
    -d, --fasta FILE                FASTA database file containing the proteins 
                                        used in search. The headers must match 
                                        the input file protein names. If a FASTA
                                        database is not given, DDF2GO will try to 
                                        parse GI numbers from the protein names 
                                        and download the sequences from NCBI.
                                        (Skipping is NOT recommended)
    --save_fasta FILE               If sequences are downloaded from NCBI, this 
                                        option save them in a file for later use.

OUTPUT PARAMETERS:
    -p, --map, --mapped_file FILE   Output file where all the clearly mapped GO 
                                        CC terms are written. (Default: STDOUT)
    -r, --mar, --margin_file FILE   Output file where all the marginal DDF 
                                        Profiles/TMD are written. A DDF 
                                        Profile/TMD is marginal if it is within 
                                        2 percentage points of a category boundary 
                                        (Default: STDERR)

MISCELLANEOUS PARAMETERS:
    --frac REGEX REGEX REGEX REGEX  4 regular expressions that are used to map the 
                                        sample names to the DDF fractions 
                                        (Default: 1 2 3 4)
    --id_mod                        Regular expression used to truncate the 
                                        protein ids to match the fasta headers. 
                                        The sub-string matched by the regular 
                                        expression is removed.
    --debug                         Keeps the program from deleting the temporary
                                        files on exit
    --phosphorylation               Marks DDF Fraction that contain phosphorylated
                                        amino acids 
                                        (Y +79.9663, T +79.9663, S +79.9663)
    --configure                     Reconfigure DDF2GO