Antibodies, which are tiny proteins produced by the immune system, can bind to specific regions of viruses and neutralise them. One promising weapon in the fight against SARS-CoV-2, the virus that causes Covid-19, is a synthetic antibody that attaches to the virus' spike proteins and prevents it from entering a human cell.

To create a successful synthetic antibody, scientists must first figure out how that attachment will take place. Because proteins have lumpy 3D structures with numerous folds, they can cling together in millions of different ways, making selecting the optimal protein complex from the nearly infinite alternatives exceedingly time-consuming.

To speed up the process, MIT researchers developed a machine-learning model that can predict the complex formed when two proteins join together.

Their method is between 80 and 500 times faster than current software methods, and it frequently predicts protein structures that are closer to experimentally observed structures.

This method could aid scientists in better understanding some biological processes involving protein interactions, such as DNA replication and repair, as well as accelerate the development of new treatments.

Equidock, the model built by the researchers, focuses on rigid body docking, which occurs when two proteins join in 3D space by rotating or translating, but their forms do not squeeze or flex.

The model takes the three-dimensional structures of two proteins and turns them into three-dimensional graphs that the neural network can process. Proteins are made up of chains of amino acids, each of which is represented in the network by a node.

Geometric information was added into the model so that it understands how objects change when rotated or translated in 3D space. The model also includes mathematical information to ensure that proteins always connect in the same way, regardless of their location in 3D space. This is how proteins interact with one other in the human body.

The machine-learning algorithm uses this knowledge to identify binding-pocket locations, or atoms of the two proteins that are most likely to interact and create chemical reactions. The points are then used to join the two proteins into a complex.

Overcoming the shortage of training data was one of the most difficult aspects of developing this model. Because there is so little experimental 3D data for proteins, Ganea emphasises the importance of incorporating geometric knowledge into Equidock. The model might pick up erroneous correlations in the dataset if such geometric limitations aren't in place.

Hours vs. seconds

The researchers tested the model against four software approaches after it had been trained. After only one to five seconds, Equidock can anticipate the final protein complex. All of the baselines took a lengthy time, ranging from ten minutes to an hour or more.

Equidock was typically comparable to the baselines in quality metrics, which calculate how well the predicted protein complex matches the real protein complex, but it occasionally underperformed them.

Equidock will be improved in the future so that it can predict flexible protein docking. The largest problem is a shortage of training data, therefore researchers are striving to create synthetic data that they can use to improve the model.