Usage Guide

SPURS is available on GitHub at https://github.com/luo-group/SPURS.

This guide will help you get started with using SPURS for protein stability prediction.

Basic Usage

First, download the example PDB file from the SPURS repository:

wget https://raw.githubusercontent.com/luo-group/SPURS/dev/data/inference_example/DOCK1_MOUSE.pdb

You can place this file in a data/inference_example/ directory in your project.

Single Mutation Prediction

from spurs.inference import get_SPURS, parse_pdb, get_SPURS_from_hub

# Load the model
model, cfg = get_SPURS_from_hub()

# Prepare your protein data
pdb_name = 'DOCK1_MOUSE'
pdb_path = './data/inference_example/' + pdb_name + '.pdb'
chain = 'A'
pdb = parse_pdb(pdb_path, pdb_name, chain, cfg)

# Make predictions
ddg = model(pdb, return_logist=True)

The model returns a tensor ddg containing stability predictions for all possible amino acid substitutions at each position. The values are normalized so that wild-type amino acids have a score of 0, while destabilizing mutations have positive scores and stabilizing mutations have negative scores.

To get the prediction for the wild-type amino acid at a specific position:

# wild-type amino acid at position 1
wt_aa = pdb['seq'][0]
ALPHABET = 'ACDEFGHIKLMNPQRSTVWY'
ddg_wt = ddg[0,ALPHABET.index(wt_aa)]
ddg_wt  # should be 0

For a specific mutation, like changing tryptophan at position 1 to alanine (W1A):

mt_aa = 'A'
ALPHABET = 'ACDEFGHIKLMNPQRSTVWY'
ddg_mt = ddg[0,ALPHABET.index(mt_aa)]
ddg_mt  # ddg for W1A mutation

Multi-mutation Prediction

For predicting the effects of multiple mutations:

The ddg tensor will contain stability predictions for each set of mutations in mut_info_list. For example, ddg[0] corresponds to the combined effect of mutations V2C and P3T, while ddg[1] corresponds to W1A and V2Y mutations.

Functional Site Identification

SPURS can also be used to identify functional sites in proteins.

First, predict the stability of the mutations:

from spurs.inference import get_SPURS, parse_pdb, get_SPURS_from_hub
# ~ 10s
model, cfg = get_SPURS_from_hub()
pdb_name = '1qlh'
pdb_path = '../data/enzyme/1qlh.pdb'
chain = 'A'
pdb = parse_pdb(pdb_path, pdb_name, chain, cfg)
# ~ 1s
ddg = model(pdb,return_logist=True).cpu().detach()

Then, load esm and get the logit differences:

import esm
import torch
from spurs.functional_site_annotation import get_wt_aa_logit_differences

ckpt = '../data/checkpoints/esm1v_t33_650M_UR90S_1/esm1v_t33_650M_UR90S_1.pt'
model, alphabet = esm.pretrained.load_model_and_alphabet_local(ckpt)
batch_converter = alphabet.get_batch_converter()
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

mut_index = list(range(2,376))
'''
mut_idx here is how the original_sequence aligned with the pdb['seq]
for example, the original_sequence here is 'MSTAGKVIK...'
and the pdb['seq'] is 'STAGKVIKCK...'
so here original_sequence[2-1:376-1] shoudl align with pdb['seq']
'''


original_sequence =  'MSTAGKVIKCKAAVLWEEKKPFSIEEVEVAPPKAHEVRIKMVATGICRSDDHVVSGTLVTPLPVIAGHEAAGIVESIGEGVTTVRPGDKVIPLFTPQCGKCRVCKHPEGNFCLKNDLSMPRGTMQDGTSRFTCRGKPIHHFLGTSTFSQYTVVDEISVAKIDAASPLEKVCLIGCGFSTGYGSAVKVAKVTQGSTCAVFGLGGVGLSVIMGCKAAGAARIIGVDINKDKFAKAKEVGATECVNPQDYKKPIQEVLTEMSNGGVDFSFEVIGRLDTMVTALSCCQEAYGVSVIVGVPPDSQNLSMNPMLLLSGRTWKGAIFGGFKSKDSVPKLVADFMAKKFALDPLITHVLPFEKINEGFDLLRSGESIRTILTF'
mask_results = get_wt_aa_logit_differences(original_sequence,mut_index,batch_converter,model,device,alphabet).cpu().detach()

Regression to Sigmoid and Plotting:

from spurs.functional_site_annotation import get_sigmoid_results

result = get_sigmoid_results(mask_results,ddg)

from spurs.functional_site_annotation import plot_sigmoid_results
shift = 2
vcenter = 0
# ground truth label
highlight_positions =[49] +[47,68,175]
plot_sigmoid_results(result,shift,vcenter,highlight_positions)

Reproducing Results

To reproduce the evaluation results from the paper:

# For SPURS on Megascale and ten test sets
python ./test.py experiment_path=data/checkpoints/spurs datamodule._target_=megascale data_split=test ckpt_path=best.ckpt mode=predict

# For ThermoMPNN on Domainome
python ./test.py experiment_path=data/checkpoints/ThermoMPNN datamodule._target_=domainome data_split=test ckpt_path=best.ckpt mode=predict