Tuesday, April 11, 2017

Curating the C. elegans kinome in UniProt

One of the key strengths of the UniProt Knowledgebase is the expert curation that goes into the entries in the reviewed Swiss-Prot section. Here we give you an insight into a recent curation project to review and annotate the kinome of the nematode worm Caenorhabditis elegans. The image below shows an overview of the project, including the proportion of Swiss-Prot entries for C. elegans kinases at the start and end of the project, the breakdown of the kinome into kinase families and a word cloud of the most prevalent GO terms found in the C. elegans kinome. 

This project builds on previous work in the group to curate the human and mouse kinomes which was completed and published in 2008. In addition to the ongoing update of the human and mouse kinomes as new information becomes available, we decided to extend curation efforts to the C. elegans kinome. C. elegans contains 438 kinases and almost half have been functionally characterized, highlighting that C. elegans is a valuable model organism to understand the role of kinases in biological processes. In addition, studies in C. elegans can shed light on human biology and disease. For example, genetic studies of C. elegans lrk-1, a homolog of human kinase LRRK2 which is involved in Parkinson’s disease, have helped to shed light on its role in the development of the nervous system and provided some clues to help understand the progressive neurodegeneration caused by LRRK2 mutations.

Some key characteristics of the C.elegans kinome are:

  • It contains the same proportion of kinases as the human proteome (approximately 2% of both proteomes)
  • It contains members from all 10 kinase groups
  • Kinase domains are not only found in cytoplasmic proteins but also in transmembrane proteins while one C. elegans kinase, H03A11.1, is thought to be secreted based on similarity to human FAM20C which has been shown experimentally to be secreted
  • Pseudokinases represent 9% of the C. elegans kinome
  • C. elegans has many unique kinases including members of the CK1 group which have not been studied experimentally but understanding the function of these kinases could provide valuable information for developing strategies to eliminate parasitic worms

We recently published an article describing this effort in the Biochemical Journal so you can read the results of the project here http://www.biochemj.org/content/474/4/493!

Thursday, November 17, 2016

Being FAIR at UniProt

We are living in the times of Big Data, with high-throughput genomics leading to massive biological data sets. While this data presents opportunities for innovation and discovery, it also creates immense challenges for open access, data handling, processing and analysis. One of the ways to ensure that the scientific community can get the most out of the data available is to ensure our data is FAIR.

What is FAIR?

Good data management is essential to facilitate knowledge discovery, innovation, integration and reuse by the community after the data publication process. The FAIR Data Principles present a guideline to standardise and improve data management with four foundational principles - Findability, Accessibility, Interoperability, and Reusability. The FAIR Guiding Principles were originally described in full in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/. The FAIR guidelines have been developed keeping in mind the requirements of data use for human readers as well for machine access.

A FAIR UniProt

As one of the world's largest freely available biological data resources, providing key life science data in the most open and accessible manner to the scientific community is at the heart of our mission. Good data management is essential for us to continue to support cutting-edge research in a sustainable and reliable manner. We see first hand the challenges of data management and dissemination and welcome the FAIR guiding principles for data resources. 

UniProt was one of the case studies presented in the original FAIR publication. What makes UniProt FAIR?

All entries are uniquely identified by a stable URL, that provides access to the entry in a variety of formats including a web page, XML, plain-text, RDF and REST services (‘F’ and ‘A’).

Interlinking with more than 150 different databases, every UniProt entry has extensive links into, for example, PubMed, enabling rich citation. These links are key to our user experience in human and machine readable formats ('I').

The entries contain rich metadata (‘F’) that is both human-readable (HTML, text format) and machine-readable (XML and RDF). All our representations use shared vocabularies and ontologies such as GO and ECO (‘I’). Our RDF representation additionally uses the UniProt RDF Schema Ontology and FALDO ('I'&'R').

UniProt strives for inter-operability by representing data that is common with another database in exactly the same way (‘R’). For example, the information about GO terms in UniProt RDF is a pure subset of the information about GO terms in the GO consortium's database. This kind of common data representations allows FAIR RDF databases to fit together as puzzle pieces in the larger life science data world.

Being FAIR

We're not the only fans of FAIR data management. FAIR principles have been adopted as a touchstone for funders and policy groups including the NIH Data Commons, G20 Hangzhao Concensus, the Amsterdam Call for Action on Open Science and the European Open Science Cloud

Challenges ahead

Being FAIR is not without its challenges. Not all formats of data might be FAIR for humans and machine readers alike. At UniProt, we handle this issue by ensuring that we also provide all our data in formats that are FAIR to complement any that might not be. This is an on-going effort with every new data type and data service we provide but is essential to make sure our data is valuable and actionable so that the community can make the most of it. 

Monday, October 10, 2016

Automatic learning based annotation in UniProt

Have you ever wondered how data mining and machine learning techniques might help in knowledge curation? Let us introduce you to the Statistical Automatic Annotation System (SAAS) in UniProt!

UniProt has an automatic annotation project that enhances unreviewed TrEMBL entries in the UniProt Knowledgeable (UniProtKB) by enriching them with automatically predicted annotations. SAAS is one of the systems that contribute to this project. 

SAAS is an automatic system with quality validation input from curators, such as exclusion of some data types as not appropriate for propagation. It learns on the properties present in the reviewed UniProtKB (Swiss-Prot) entries and uses the following attribute types to define the learning entries: InterPro protein family, taxonomy and sequence length. This combination allows SAAS to generate rules to annotate protein properties such as function, catalytic activity, pathway membership, subcellular location, protein names and feature predictions.

SAAS based evidence for UniProtKB annotation
When an annotation is added to an entry based on an automatic annotation from a SAAS rule, the evidence tag indicates this along with a link to the rule itself.

Browsing SAAS rules
In order to browse the dataset to view rules of interest, click on the dropdown next to the search box in the UniProt website and select ‘SAAS’. Now enter a query and hit the search button.

Exploring SAAS rule pages
Conditions are listed on the left hand side of the rule page and annotations are on the right hand side. If a condition holds true then the corresponding annotation is applied. 

SAAS annotation data is recalculated for every UniProt release to ensure that the annotations are accurate and up-to-date. 

Tuesday, September 27, 2016

New and improved publications view in UniProtKB!

UniProt Knowledgebase protein entries are a treasure trove of knowledge mined from the most relevant publications. Entries provide not only a quick evidence tag for statements based on publications but also a list of all publications used for curating an entry. In addition to this, entries  used to have a link to computationally mapped publications for an expanded list of publications if you are looking to review as many relevant papers to your protein as possible.

Now presenting a new and improved view of publications! From being a section within the Entry document, publications now have their own tab within the entries (see snapshot above). When you land on a UniProtKB entry, you can see some alternative view options under 'Display' on the left hand side including the 'Publications' view. This expanded space now allows us to provide all relevant publications in one integrated view, the ones used by UniProtKB for curating the entry as well as additional relevant publications that have been computationally mapped.

Filters on the left hand side allow you to quickly narrow down the list to just the UniProtKB reviewed or the computationally mapped ones. You can also filter the publications based on the category of data that they have been cited for/ mapped for. So you can quickly find reviewed and mapped publications for the type of category you're interested in (such as Function, Sequencing, and Structure). Additionally, you can filter to see publications tagged as being from large-scale or small-scale experiments.

You can switch back to the main Entry using the link under the Display section again, or explore the other alternative views of the Feature viewer and the Feature table.

We hope you enjoy this new feature. If you have any questions or feedback for us, write to us on help@uniprot.org.

Tuesday, September 20, 2016

A new Peptide Search tool now in UniProt

Finding matches for a specific peptide sequence is important for protein identification in proteomics studies. It is also important for sequence-based protein retrieval as peptide sequences can often be too short for effective BLAST search. The Protein Information Resource, a UniProt consortium partner, has hosted a Peptide match service for over a decade. This popular tool is now also available directly through the UniProt interface! It can be accessed from the 'Peptide search' link in the UniProt website header.

All you need to provide as input is a peptide sequence at least three amino acids long. Optionally, you can choose to restrict the search to a specific taxonomic group and/or you can specify whether you would like the tool to treat Isoleucine and Leucine as equivalent.

The Peptide search results display proteins that have exact match to the query sequence in a table with filters on the left hand side. The table contains an additional column titled 'Peptide search:...', which shoes the sequence matched and the sequence position where the match begins in the target sequence. Each peptide search job is assigned a unique identifier that can be used to retrieve the results back from UniProt for up to seven days by simple pasting it back into the main search box or by bookmarking the peptide search results URL.

Try out this new tool now and let us know what you think! 

Friday, September 9, 2016

How can you increase the impact of your research papers and contribute to UniProt?

Have you ever wondered how to get your life science data into public resources like UniProt with due credit and citation of your papers? Would you like the broaden the reach and impact of your research papers?  

Biomedical literature is vast, with over one million papers being added to PubMed every year.  Our team of curators triages these and selects relevant papers to create and update our protein entries. This herculean task is sometimes made more difficult when we cannot easily identify exactly which protein(s) a paper is about due to the lack of species, strain and even sequence information! The simple addition of a UniProt accession number in a paper could go a long way to helping both UniProt and other resources to use your work for adding knowledge into our databases and giving you due credit.

If the protein you are writing about is not in UniProt, you can get an accession number for it by submitting it to us through http://www.uniprot.org/help/submissions. Accession numbers are the alphanumeric identifiers that typically look like P12346 or A0A167SS16. Here are a couple of examples of a protein being referenced in this way within the text of a paper:

Remember, accession numbers are different to Entry names and are stable from release to release. Hence, accession numbers are the best identifier for referring to your protein in a manuscript. 

We recommend the format “UniProtKB P12346” to be used in the body of a manuscript, while the UniProtKB accession numbers would also be a suitable column title for a table. We do have procedures in place to identify the accession numbers alone in the text but elucidating that these are from UniProt would help us and readers of your papers to understand and use your work.  

Although we have discussed UniProt accessions here, adding accession numbers for other resources will also help to link up the research literature helping researchers search for and disseminate your work and hence increase its impact further. We are looking forward to working with you to achieve that goal!

Tuesday, May 24, 2016

Proteomics peptide data in the Feature Viewer

The Protein Feature Viewer in UniProt is an interactive representation of all protein sequence features in one compact view. It now provides a new track displaying proteomics peptide identification data for UniProt Knowledgebase entries.

UniProt runs an analysis pipeline to map data from publicly available proteomics resources to UniProtKB sequences. So far these data were available as a download from the UniProt FTP site. Now these data are also displayed in the Protein Feature Viewer which can be accessed through the ‘Feature viewer’ link on the left hand side of the entry view page.

The proteomics track currently displays mass spectrometry peptide data mapped from PeptideAtlas, EPD and MaxQB to UniProtKB protein sequences and more mass spectrometry proteomics resources will be added in the future. The track can be further expanded to see unique and non-unique mapped peptides, as shown in the example screenshot below.

Unicity of peptides is evaluated according to gene groups underlying the UniProtKB reference proteomes where we group  protein sequences based on the gene(s) encoding them. Each gene group is constituted by one or more UniProtKB protein isoform sequences. A peptide is considered unique if it belongs to only one gene group. Two types of peptides are therefore identified: unique and non-unique.

In the future, we also plan to also add post-translational modification-specific proteomics data sets (initially phosphorylation sites) to the mappings to UniProtKB sequences.