DDF2GO uses high throughput differential detergent fractionation (DDF) proteomics data to assign experimentally based Gene Ontology (GO) cellular component (CC) annotations. For more information, see paper “A high-throughput experiment-based method to identify subcellular localization,” Lakshmi R Pillai, et al.
For more information about DDF, see paper McCarthy FM, et al., “Differential Detergent Fractionation for Non-electrophoretic Eukaryote Cell Proteomics,” Journal of Proteome Research 2005, 4:316-324.
DDF2GO.zip
Requirements
Configuration
DDF2GO can be configured in two ways:
- Running
DDF2GO --configure
- Modifying the DDF2GO file
Using the
--configure
option will display a series of questions, answer each one and the tool will configure itself.Only modify the DDF2GO file if you have experience with PERL. The configuration section is well documented and should be easy to decipher.
Testing
Run the following command and check if the filemapped.test
is identical tot/g_gallus.stroma.mapped.txt
:./DDF2GO.pl --frac 'cyt|CYT' 'mem|MEM' 'nuc|NUC' 'csm|CSM' \ -i t/Stroma.filtered -d t/G_gallus.fasta.stroma \ -p mapped.test \ -r margin.test \ --id_mod '[^|]*$'If any error messages are displayed or the files are not identical, then either DDF2GO is not configured properly, or the TMD prediction tools are not the correct versions.
Inputs
DDF2GO needs at least one protein file (the output file from PepMOD) and a fasta file containing the amino acid sequences for the proteins in the protein file.
Protein File
-i
PepMOD is a data filtering tool designed to filter X!tandem search engine results based on p-values calculated from a random score distribution. It can be found on the iPlant Collaborative Discovery Environment. If you don't wish to use PepMOD, then the protein file needs to meet the following format specifications:
Example:
- A line that begins with "Description", anything above this line will be ignored
- At least one protein/peptide section including the following lines:
- A line with the name of the protein as the first column
- A line containing the peptide headers beginning with "Scan", "HyperScore" for the name of the column containing the test statistic, and the last column used to get DDF Fraction
- At least one line containing the peptide information outlined in the peptide headers. "Scan", the first column, must be an integer, "HyperScore" must be a number, and the last column must give the DDF Fraction
Description protein1 Scan HyperScore Fraction 1 2639.333 4 2 1568.778 4 3 2639.333 1 4 1465.831 2 5 2639.333 2 6 1759.877 2 7 1335.717 2 8 2639.333 3 9 1759.877 3 protein2 Scan HyperScore Fraction 10 3059.875 4 11 1148.622 3 protein3 Scan HyperScore Fraction 12 1880.937 2 13 1249.633 3 14 1880.937 3To specify more than one protein file, either enter them as an array (i.e.
-i file1 file2 ...
) or use multiple-i
(i.e-i file1 -i file2 ...
)Fasta File
-d
The Fasta file needs to contain an amino acid sequences for every protein in the protein file. If a fasta file is not given, then DDF2GO tries to parse the GI numbers from the protein names in the protein file and download the sequences from NCBI using EUtils. For DDF2GO to parse the GI numbers, the protein names need to be in the NCBI header format (
Example:gi|{gi_num|{db}|{db_acc}
). Otherwise, this process will fail and no sequences will be found. A warning will be printed for every protein in the protein file that couldn't be found in the fasta file or couldn't be downloaded from NCBI.>protein1 ASASPSPPRLFPLVLCSPSDSVYTVGCAAFDFQPSSIAFTWFDSNNSSVSGMDVIPKVISGPPYRAVSRI QMNQSEGKEKQPFRCRAAHPRGNVEVSVMNPGPIPTPNGIPLFVTMHPPSREDFEGPFRNASILCQTRGR RRPTEVTWYKNGSPVAAAATTATTVGPEVVAESRISVTESEWDTGATFSCVVEGEMRNTSKRMECGLEPV VQQDIAIRVITPSFVDIFISKSATLTCRVSNMVNADGLEVSWWKEKGGKLETALGKRVLQSNGLYTVDGV ATVCASEWDGGDGYVCKVNHPDLLFPMEEKMRKTKASNARPPSVYVFPPPTEQLNGNQRLSVTCMAQGFN PPHLFVRWMRNGEPLPQSQSVTSAPMAENPENESYVAYSVLGVGAEEWGAGNVYTCLVGHEALPLQLAQK SVDRASGKASAVNVSLVLADSAAACY >protein2 MAPNIRKSHPLLKMINNSLIDLPAPSNISAWWNFGSLLAVCLMTQILTGLLLAMHYTADTSLAFSSVAHT CRNVQYGWLIRNLHANGASFFFICIFLHIGRGLYYGSYLYKETWNTGVILLLTLMATAFVGYVLPWGQMS FWGATVITNLFSAIPYIGHTLVEWAWGGFSVDNPTLTRFFALHFLLPFAIAGITIIHLTFLHESGSNNPL GISSDSDKIPFHPYYSFKDILGLTLMLTPFLTLALFSPNLLGDPENFTPANPLVTPPHIKPEWYFLFAYA ILRSIPNKLGGVLALAASVLILFLIPFLHKSKQRTMTFRPLSQTLFWLLVANLLILTWIGSQPVEHPFII IGQMASLSYFTILLILFPTIGTLENKMLNY >protein3 MSWSRSILCLLGAFANARSIPYYPPLSSDLVNHINKLNTTGRAGHNFHNTDMSYVKKLCGTFLGGPKAPE RVDFAEDMDLPDTFDTRKQWPNCPTISEIRDQGSCGSCWAFGAVEAISDRICVHTNAKVSVEVSAEDLLS CCGFECGMGCNGGYPSGAWRYWTERGLVSGGLYDSHVGCRAYTIPPCEHHVNGSRPPCTGEGGETPRCSR HCEPGYSPSYKEDKHYGITSYGVPRSEKEIMAEIYKNGPVEGAFIVYEDFLMYKSGVYQHVSGEQVGGHA IRILGWGVENGTPYWLAANSWNTDWGITGFFKILRGEDHCGIESEIVAGVPRMEQYWTRVExecution
To run the above examples through DDF2GO, the following command was run:DDF2GO.pl -i example.protein.txt -d example.fasta -p output.mapped -r output.margin-i
is the protein file,-d
is the fasta file,-p
is the output mapped file, and-r
is the output marginal file.If the fraction names don't follow the above example, then the
--frac
parameter needs to be used. It takes four regular expressions that correspond to the four DDF Fractions. Make sure that these expressions do not overlap, which will cause fractions to be assigned incorrectly. Using the command in the "Testing" sub-section above as an example, the--frac
parameter is set to'cyt|CYT' 'mem|MEM' 'nuc|NUC' 'csm|CSM'
. This will match anything with 'cyt' or 'CYT' to the first DDF Fraction, anything with 'mem' or 'MEM' to the second DDF Fraction, and so on.If the protein names don't match exactly in both the protein file and the fasta file, then use the
--id_mod
. It takes a regular expression and truncates the protein names to match them. Anything that matches the regular expression is removed. Using the same example, the--id_mod
parameter is set to'[^|]*$'
. This will remove everything from the last '|' to the end of the string (e.g.gi|127513|sp|P01875.2|IGHM_CHICK RecName: Full=Ig mu chain C region
becomesgi|127513|sp|P01875.2|
).To get a full list of parameters run
DDF2GO.pl --help
Outputs
Mapped File
The mapped file is a tab-delimited text file that contains the proteins that were mapped to GO CC terms.
Column descriptions:
Example:
- ID
- The protein ID that is matched to fasta sequence headers
- Protein
- The protein name from the fasta file
- D|T|TD|S
- The DDF Profile
- Average TMD
- The average predicted TMD from PSORT II and TMPred
- GO CC
- The mapped GO CC terms, separated by a '|'
ID Protein D|T|TD|S Average TMD GO CC protein2 protein2 0|0|49|51 8.5 GO:0016021 integral to membrane|GO:0043229 intracellular organelle protein3 protein3 0|34|66|0 0.5 GO:0031224 intrinsic to membrane|GO:0005783 endoplasmic reticulumMarginal File
The marginal file is a tab-delimited text file that contains the proteins that could not be mapped to a GO CC term with certainty. A protein can't be mapped if the DDF Profile is within 2 percentage points of mapping boundary. These proteins need to be manually annotated.
Column descriptions:
Example:
- ID
- The protein ID that is matched to fasta sequence headers
- Protein
- The protein name from the fasta file
- D|T|TD|S
- The DDF Profile
- Average TMD
- The average predicted TMD from PSORT II and TMPred
ID Protein D|T|TD|S Average TMD protein1 protein1 10|43|24|23 0.5
The sample subdirectory contains the cow data used in the manuscript mentioned above. The following is a description of the directory contents:
The following command was used to get the DDF2GO output files above:
Cow_proteome_OVA.filtered: PepMOD output file bos_taurus.proteome.fa: FASTA database cow.ova.mapped.txt: DDF2GO mapped output file cow.ova.margin.txt: DDF2GO marginal output file ./DDF2GO.pl --frac '[OC]1' '[OC]2' '[OC]3' '[OC]4' \ -i sample/Cow_proteome_OVA.filtered \ -d sample/bos_taurus.proteome.fa \ -p sample/cow.ova.mapped.txt \ -r sample/cow.ova.margin.txt
To see a list of parameters and their description, run
DDF2GO.pl --help
MANDATORY PARAMETERS: -i, --input_files FILE... A list of files with protein/peptide information in the format of PepMOD output DATABASE PARAMETERS: -d, --fasta FILE FASTA database file containing the proteins used in search. The headers must match the input file protein names. If a FASTA database is not given, DDF2GO will try to parse GI numbers from the protein names and download the sequences from NCBI. (Skipping is NOT recommended) --save_fasta FILE If sequences are downloaded from NCBI, this option save them in a file for later use. OUTPUT PARAMETERS: -p, --map, --mapped_file FILE Output file where all the clearly mapped GO CC terms are written. (Default: STDOUT) -r, --mar, --margin_file FILE Output file where all the marginal DDF Profiles/TMD are written. A DDF Profile/TMD is marginal if it is within 2 percentage points of a category boundary (Default: STDERR) MISCELLANEOUS PARAMETERS: --frac REGEX REGEX REGEX REGEX 4 regular expressions that are used to map the sample names to the DDF fractions (Default: 1 2 3 4) --id_mod Regular expression used to truncate the protein ids to match the fasta headers. The sub-string matched by the regular expression is removed. --debug Keeps the program from deleting the temporary files on exit --phosphorylation Marks DDF Fraction that contain phosphorylated amino acids (Y +79.9663, T +79.9663, S +79.9663) --configure Reconfigure DDF2GO