The Proteomic-Genomic Nexus is a software package that is designed to integrate genomic and transcriptomic data generated from next-generation sequencing with proteomic data generated from protein mass spectrometry.
The primary users of this product will be biologists who would like to integrate their genomics and proteomics data, and be able to visualize them. The tool will be deployed in a number of pilot projects in collaboration with several research groups in Australia and internationally. The research domains span across basic science, primary industry, and medical research.
By using the Proteomic-Genomic Nexus, users will be able to co-visualise genomics, transcriptomic and proteomic data using the Integrative Genomics Viewer. We will be able to validate the existence of genes using peptides identified from mass spectrometry experiments. We will use this to verify alternatively spliced genes by searching for peptides that span across exon-exon junctions.
The tools from developed as a part of the Proteomics-Genomics Nexus project allowed us to visualize all the peptides from the proteome of an organism and efficiently analyze the data.
Using the Samifier tool, we co-visualized the genomics, transcriptomics, and proteomics data of Saccharomyces cerevisiae using the Integrative Genomics Viewer (IGV). An example of peptides visualized using IGV is shown in the figure below. The Integrative Genomics Viewer was used to visualize experimental peptides for the yeast 40S ribosomal protein S7-B (YNL096C). A peptide which spans exon-exon junction is highlighted in the red box. This analysis has also been done on a genome / proteome scale.
The Results Analyser was used to verify proteins coded in the Campylobacter concisus (an emergent gut pathogen) and Saccharomyces cerevisiae (Baker’s yeast) genome. Proteins were verified on the basis of two or more peptide ‘hits’, with Mascot scores exceeding an identity threshold. Firstly, for Campylobacter concisus, 66% (1320/2002) of previously known proteins in Uniprot were verified with peptides identified from mass spectrometry experiments. Secondly, for Saccharomyces cerevisiae, 61% (4046/6621) of the proteins as well as 25% (78/284) of all splice junctions in the yeast proteome were verified from a comprehensive proteome analysis (de Godoy et al., 2008, Nature 455:1251-1254). We are in the process of performing analysis on human proteomics data to find evidence for peptides that spans across exon-exon junctions. This will validate the existence of known and novel alternatively spliced transcripts.
Please refer to the deployment guide on instructions on how to download or build the tools.
There are a number of manual and document resources available:
Github repositories:
Technical documentation:
– AP11 tools - includes links to deployment and developer guides
– Web application - includes links to deployment and developer guides
Overview diagram
The following diagram (from the wiki,) outlines where the different tools fit within the genomics/proteomics research paths.
A number of Java classes where implemented to parse the specific elements of information required by the tools. Their implementation maybe of interest to the genomics and/or proteomics community. They are found in here and include:
A FASTA file reader with caching that can read contig files (see fastaParserImpl.java)
A GFF parser (see GenomeParserImpl.java)
A mzIdentML reader (see mzidentml package), that although specific (it extracts the peptide results required by other parts of the system) is flexible enough to work with version 1.0 and 1.1 of the standard.
There are also a couple of command line utilities
CodonFinder <FASTA-file> <Translation-table> {+|-} [<start>] [<end>]
Prints codons in whole fasta file or between (optional) <start> and <end> parameters. The + or - parameter represents forward or reverse strand translation.
NucleotideToAminoacid <Translation-table> <nucleotide-sequence> [<frame>]
Produces the aminoacid sequence for a given nucleotide sequence, optionally applying a <frame> before translation.
All code is licensed under the GNU GPL v3 license - see LICENSE.txt in each code repository for license text. Documentation (contained in the Github wiki) is licensed under Creative Commons Attribution-Share Alike
Due to the extensive testing by the research stakeholders, the software is now robust, mature and fit for purpose. The ongoing maintenance of the software may occur in a number of ways, as appropriate:
Further enhancements and fixes may be done by Intersect, under the support and maintenance agreement between UNSW and Intersect. T he sustainability of the product has been considered throughout the project and the software has been designed to maximise future maintainability:
The software has an extensive suite of automated unit tests that clearly describe the expected behaviour of the code.
An overview of the application can be found on our wiki at Application Description.
Source code can be found at:
Samifier Source Code - Github - this is the java command line components described above
and
Webapp Source Code - Github - this is the webapp that’s used to record experimental details and provide the feed to RDA.
We also have a User Manual, and Developers Guide. Check out the Wiki Home Page for further useful guides and documents.
Ignatius Pang and Aidan Tay from UNSW have written up a fantastic account of the testing they performed - this can be found at User assisted testing report
]]>This is a great success for the project and we look forward to hearing more as people start trying out the software.
]]>We installed the java command line tools and here you can see their output when no parameters are given:
The samifier
1 2 3 4 5 6 7 8 9 10 |
|
and the protein generator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
although the option -i is still pending, it is meant to be for virtual proteins for next sprint.
The samifier code is in here and the protein generator here.
We also demoed in the web application the new look & feel, based on the SBI website. It’s still work in progress as we need to define how the menus are going to change for this application. That will be a discussion for our next planning meeting. We also showed how to enter an input collection in the system, although there were also questions pending in terms of the metadata model.
For this sprint we want to focus in working with the samifier (reverse strand proteins, the mzIdentML format,) and the protein generator.
]]>First, given an accession table and mascot results (in DAT) format, the following bash command outputs proteins of interest in the mascot search
1
|
|
We get lines as the followings:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Let’s pick a protein match there ARB1_YEAST. We search for that protein in UniProtKB and we find it as shown in image below
We see there the locus name YER036C, which we would find in the accession.txt if we looked for our protein of interest. In the UniProtKB there is a cross-refs link which we click. It just points to one of the sections at the bottom of the page as shown here
We follow the link to the EMBL sequence database, particularly the translation code shown. Follow that in new window as it is another website, EMBL-EBI, particularly the page describing the sequence for that protein
as the preview above should confirm. There we just go to sequence (see red arrow), a link to the bottom, where we see the sequence as displayed in here
So, just click the “Show full sequence” link (JavaScript) to reveal the full sequence if it is too long. Voila, the amino acid sequence!
Bear in mind that this whole exercise assumes a reference genome. You can see it just below the locus name in the first screenshot above. Hope this instructions help to navigate the complex web of proteomic/genomic data.
]]>This is a screenshot of the login page (click to see larger picture)
and the self-registration page
and the experiments page
Then we discussed the virtual protein generator. We realised it should be called Protein Generator, as it is meant to generate two kinds of protein DB files: one for predicted proteins, using gene scanner software as glimmer; and virtual proteins, just generated by splitting the genome in contiguous blocks of certain size. We are aiming to have a first go with this product this sprint.
The samifier was presented at the end. We had issues with the genome being split in several files, but at the end showed the prototype, which was written in Ruby. We are aiming to produce a Java version by the end of this week to test it with researchers in other kinds of datasets.
Researchers were happy with the demostration despite little issues, and are looking forward to the evolution of these initial prototypes.
]]>On the other hand, development just started on the samifier and the team is getting closer to a first cut. It is a prototype written in Ruby. It won’t have reverse strand handling, but will be able to generate fragments of forward encoded proteins, even if they are split by introns.
]]>Here they are:
We will be updating these entries as we go and are a key reference in the evolution of the project.
]]>