Usage Guide
===========

SPURS is available on GitHub at https://github.com/luo-group/SPURS.

This guide will help you get started with using SPURS for protein stability prediction.

Basic Usage
----------
First, download the example PDB file from the SPURS repository:

.. code-block:: bash

   wget https://raw.githubusercontent.com/luo-group/SPURS/dev/data/inference_example/DOCK1_MOUSE.pdb

You can place this file in a ``data/inference_example/`` directory in your project.

Single Mutation Prediction
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from spurs.inference import get_SPURS, parse_pdb, get_SPURS_from_hub
   
   # Load the model
   model, cfg = get_SPURS_from_hub()
   
   # Prepare your protein data
   pdb_name = 'DOCK1_MOUSE'
   pdb_path = './data/inference_example/' + pdb_name + '.pdb'
   chain = 'A'
   pdb = parse_pdb(pdb_path, pdb_name, chain, cfg)
   
   # Make predictions
   ddg = model(pdb, return_logist=True)

The model returns a tensor ``ddg`` containing stability predictions for all possible amino acid substitutions at each position. The values are normalized so that wild-type amino acids have a score of 0, while destabilizing mutations have positive scores and stabilizing mutations have negative scores.

To get the prediction for the wild-type amino acid at a specific position:

.. code-block:: python

   # wild-type amino acid at position 1
   wt_aa = pdb['seq'][0]
   ALPHABET = 'ACDEFGHIKLMNPQRSTVWY'
   ddg_wt = ddg[0,ALPHABET.index(wt_aa)]
   ddg_wt  # should be 0

For a specific mutation, like changing tryptophan at position 1 to alanine (W1A):

.. code-block:: python

   mt_aa = 'A'
   ALPHABET = 'ACDEFGHIKLMNPQRSTVWY'
   ddg_mt = ddg[0,ALPHABET.index(mt_aa)]
   ddg_mt  # ddg for W1A mutation

Multi-mutation Prediction
~~~~~~~~~~~~~~~~~~~~~~~

For predicting the effects of multiple mutations:

.. code-block:: python
   from spurs.inference import parse_pdb, get_SPURS_multi_from_hub, parse_pdb_for_mutation
   import torch

   # Define multiple mutations to analyze
   mut_info_list = [
       ['V2C','P3T'],  # First set of mutations
       ['W1A','V2Y'],  # Second set of mutations
   ]

   # Prepare protein data
   pdb_name = 'DOCK1_MOUSE'
   pdb_path = './data/inference_example/' + pdb_name + '.pdb'
   chain = 'A'
   device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

   # Load the multi-mutation model
   model, cfg = get_SPURS_multi_from_hub()

   # Parse PDB and prepare mutation data
   pdb = parse_pdb(pdb_path, pdb_name, chain, cfg)
   mut_ids, append_tensors = parse_pdb_for_mutation(mut_info_list)
   pdb['mut_ids'] = mut_ids
   pdb['append_tensors'] = append_tensors.to(device)

   # Make predictions
   ddg = model(pdb)
   # ddg[i] contains the prediction for mut_info_list[i]

The ``ddg`` tensor will contain stability predictions for each set of mutations in ``mut_info_list``. For example, ``ddg[0]`` corresponds to the combined effect of mutations V2C and P3T, while ``ddg[1]`` corresponds to W1A and V2Y mutations.


Functional Site Identification
----------------------------

SPURS can also be used to identify functional sites in proteins.

First, predict the stability of the mutations:

.. code-block:: python

    from spurs.inference import get_SPURS, parse_pdb, get_SPURS_from_hub
    # ~ 10s
    model, cfg = get_SPURS_from_hub()
    pdb_name = '1qlh'
    pdb_path = '../data/enzyme/1qlh.pdb'
    chain = 'A'
    pdb = parse_pdb(pdb_path, pdb_name, chain, cfg)
    # ~ 1s
    ddg = model(pdb,return_logist=True).cpu().detach()

Then, load esm and get the logit differences:

.. code-block:: python

   import esm
   import torch
   from spurs.functional_site_annotation import get_wt_aa_logit_differences

   ckpt = '../data/checkpoints/esm1v_t33_650M_UR90S_1/esm1v_t33_650M_UR90S_1.pt'
   model, alphabet = esm.pretrained.load_model_and_alphabet_local(ckpt)
   batch_converter = alphabet.get_batch_converter()
   model.eval()  
   device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
   model = model.to(device)

   mut_index = list(range(2,376))
   '''
   mut_idx here is how the original_sequence aligned with the pdb['seq]
   for example, the original_sequence here is 'MSTAGKVIK...'
   and the pdb['seq'] is 'STAGKVIKCK...'
   so here original_sequence[2-1:376-1] shoudl align with pdb['seq']
   '''


   original_sequence =  'MSTAGKVIKCKAAVLWEEKKPFSIEEVEVAPPKAHEVRIKMVATGICRSDDHVVSGTLVTPLPVIAGHEAAGIVESIGEGVTTVRPGDKVIPLFTPQCGKCRVCKHPEGNFCLKNDLSMPRGTMQDGTSRFTCRGKPIHHFLGTSTFSQYTVVDEISVAKIDAASPLEKVCLIGCGFSTGYGSAVKVAKVTQGSTCAVFGLGGVGLSVIMGCKAAGAARIIGVDINKDKFAKAKEVGATECVNPQDYKKPIQEVLTEMSNGGVDFSFEVIGRLDTMVTALSCCQEAYGVSVIVGVPPDSQNLSMNPMLLLSGRTWKGAIFGGFKSKDSVPKLVADFMAKKFALDPLITHVLPFEKINEGFDLLRSGESIRTILTF'
   mask_results = get_wt_aa_logit_differences(original_sequence,mut_index,batch_converter,model,device,alphabet).cpu().detach()

Regression to Sigmoid and Plotting:

.. code-block:: python
   
   from spurs.functional_site_annotation import get_sigmoid_results

   result = get_sigmoid_results(mask_results,ddg)

   from spurs.functional_site_annotation import plot_sigmoid_results
   shift = 2
   vcenter = 0
   # ground truth label
   highlight_positions =[49] +[47,68,175]
   plot_sigmoid_results(result,shift,vcenter,highlight_positions)


Reproducing Results
-----------------

To reproduce the evaluation results from the paper:

.. code-block:: bash

   # For SPURS on Megascale and ten test sets
   python ./test.py experiment_path=data/checkpoints/spurs datamodule._target_=megascale data_split=test ckpt_path=best.ckpt mode=predict

   # For ThermoMPNN on Domainome
   python ./test.py experiment_path=data/checkpoints/ThermoMPNN datamodule._target_=domainome data_split=test ckpt_path=best.ckpt mode=predict

See Also
--------

- :doc:`api` for detailed API documentation