• Intricate Structural Information revealed by Deep Learning
    EMBL-EBI at the Wellcome Genome Campus, Hinxton, Cambridge. (Credit: EMBL-EBI)

News & Views

Intricate Structural Information revealed by Deep Learning

Mar 23 2022

“These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.” Alex Bateman

The EMBL-European Bioinformatics Institute has been able to expand its open access protein family database (Pfam), with the help of deep learning models. Pfam provides insights for biologists on protein characteristics including vital protein annotations, structures and multiple sequence alignments and is widely used to classify protein sequences into phylogenies and identify domains that provide insights into protein activity.

The increase in knowledge content(1) was achieved through the use of deep learning methods developed by Google Research that were trained to use data from the Pfam research base to annotate previously undescribed protein domains, shedding light on potential protein function.

“Initially I was rather sceptical about using deep learning to reproduce the protein families within Pfam. Then I started collaborating more closely with Lucy Colwell and her team at Google Research and my scepticism quickly changed to excitement for the potential of these methods to improve our ability to classify sequences into domains and families,” said Alex Bateman, Senior Team Leader of Protein Sequence Resources at EMBL-EBI.

“These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.”

Exceeding previous expansion efforts

The project resulted in the expansion of the Pfam, database by almost 10%, exceeding previous expansion efforts made over the last decade. The deep learning methods were also able to predict the function for 360 human proteins that had no previous annotation data available in Pfam.

Using additional protein family predictions generated from the Google Research team’s neural networks created a supplement to Pfam called Pfam-N, (network) which added a further 6.8 million protein sequences to the Pfam database.

“We’re also now building on these established deep learning methods to expand the information in the database even further,” said Bateman. “We’re changing the way the existing deep learning model works so that we can call multiple protein domains at once. This new update to the database should be ready very soon.”

“My personal view is that there’s still a lot of scope to improve the deep learning models we’re currently using,” Bateman added. “We’re in the early days of this and I’m very hopeful for what it will mean for the future classification of protein families. This may even be something that will get solved in the next five years.”

This work is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

More information online


Digital Edition

Lab Asia 31.2 April 2024

April 2024

In This Edition Chromatography Articles - Approaches to troubleshooting an SPE method for the analysis of oligonucleotides (pt i) - High-precision liquid flow processes demand full fluidic c...

View all digital editions

Events

Lab Indonesia

Apr 24 2024 Jakarta, Indonesia

Expomed Eurasia

Apr 25 2024 Istanbul, Turkey

AOCS Annual Meeting & Expo

Apr 28 2024 Montreal, Quebec, Canada

SETAC Europe

May 05 2024 Seville, Spain

InformEx Zone at CPhl North America

May 07 2024 Pennsylvania, PA, USA

View all events