PREreview of Prediction of Ca ²⁺ binding site in proteins with a fast and accurate method based on statistical mechanics and analysis of crystal structures

Published: April 23, 2024
DOI: 10.5281/zenodo.11048024
License: CC BY 4.0

Summary

This paper examines the placement of Ca2+ binding sites in high-resolution X-ray crystallography structures. While the authors refer to new methods to achieve this, it is unclear what this method is and what it is evaluating. I believe that it uses the 3D-RISM method to extract out high-likelihood placements of Ca2+ ions. Still, it does not state cut-off values or how it classifies or provides a probability of Ca2+ likelihood. If this is a new method, more detail on the method needs to be included. Without this information, it is impossible to evaluate the rest of the paper.

Additionally, the authors look bioinformatically at Ca2+ binding sites. It is unclear how they generated the different datasets. Further, it is difficult to evaluate as their developed method is not fully described nor compared to true positive/true negatives or existing methods.

Major Revisions

1) Overall, more information on the method must be provided. There needs to be very detailed information on the methods and an overview in the results section. I understand that you use 3D-RISM, but there in no information on cut off values, number of predicted sites, ect from the output of 3D-RISM.

a. For example, in Figure 2: how do determine the most probable atom?

b. How do you determine if there are more probable sites?

c. In Figure 3, there are many more sites (yellow) than Ca2+. How did you determine which blob is the Ca2+ binding site?

2) Please provide more justification for the creation of Dataset B. What metrics remove structures (other than resolution and R-value) is unclear. What does ‘electron density’ or B-factor mean when removing structures? Removing 125 of the Ca2+ structures and only leaving 9 makes it difficult to evaluate the method.

3) It is difficult to compare the results of your new method to that of the FEATURE method as they are not done on the same PDB dataset. Please run both methods on the same dataset.

4) Please justify why you are using 3.5 Angstroms as a cut-off. This seems very high.

5) Please justify why you removed the PDBs with other ligands in the Ca2+ binding site. What is that distance cut-off value (from other ligands to Ca2+)? How many PDBs have other ligands with X angstroms of the Ca2+ binding site? This is important to understand how to apply your method.

6) Please clarify what you mean by ‘electron density maps’ for removing structures. Is this done manually? Is there a Sigma cutoff? How are you evaluating this? I do not think it is valid to manually remove structures by examining their electron density without providing rationale or data. You also cite different metrics, but they are more concerned with geometry, and I don’t think that CheckMyBlob works for ions.

7) How were binding sites in Dataset-C rebuilt? If they were re-refined, please provide information on that.

8) In addition to (or instead of) explaining R-factors and B-factors, please describe how you are using each of these metrics in your analysis.

a. For example, you state that you removed the standard deviation of the B-factors if it exceeded two times the SD from the mean. Please clarify the reference statement.

9) What does success rate mean? Precision or recall or both? Please define.

10) To define false positives, it would be helpful to run your method on a set of structures without Ca2+.

11) Please clarify what g(r1) an g(r2) mean. How is the protein binding site determined?

12) Please provide information on the restraints, software, and force field used for minimization.

10) The method should be publicly available. Please provide a GitHub or other software repository link.

Minor Revisions

1) In the introduction, please be more specific about what you mean by data-driven approaches by naming them. Are Altman and Yang data based?

2) Please review the manuscript to input references. I think references are needed for the following statements:

a. “As the coordination geometry of metal ions varies widely, from simple to complex arrangements, making the exact geometry prediction a non-trivial task.”

b. “Diamagnetic metal ions, being NMR silent, do not provide insights into the geometric structure of the metalbinding sites. Paramagnetic metal ions within proteins, characterized by unpaired electrons, influence protein nuclei’s chemical shifts and relaxation rates. The metal may affect signals from nuclei in proximity to the metal ion even beyond detection, making it challenging to detect information about the geometric structure of the protein near the metal site.”

c. “In statistical mechanics, the pair-correlation function (PCF) between two particles originates from the two-body reduced distribution function. It is defined as the ratio of the probability of finding two particles at a certain separation (for non-spherical particles, orientation between the two particles is also needed for a complete description) to that in the bulk.”

3) Please clarify what ‘protein ligands’ means in the first paragraph of the introduction.

4) It would be helpful to know how good (precision and/or recall) RosettaAllAtom or AlphaFill is on predicting Ca2+ sites.

5) Please be more specific about the issues that X-ray structures have that make it difficult to predict Ca2+ binding site (as there are many issues that could be applicable).

6) Please clarify what d stands for in equation #1.

7) It would be simpler in the PCF section if you focus on what is used in RISM.

8) Please describe the previously used dataset that make up Dataset #1. Why were these PDBs gathered in previous datasets?

9) For Figure 1, please state what the starting dataset is.

10) For CBVS, what was the cutoff that you used?

11) If this is a new method, it would be helpful to know about run time and compute infrastructure needed.

12) Supplementary Figure 1 and 2 are not included.

13) In section 3.4, state that resolution is >= 2.0 Angstroms.

14) In section 3.4, what software are you using to determine protein structure metrics?

15) How did you identify 3PLF? Did you method predict a Ca2+ site there?

16) In section 3.4, please clarify the ‘two remaining cases’ and how you identified them.

17) Is there any overlap between Dataset A and Dataset C?

18) For each dataset, please provide information on the distribution of structures, including SCOP or sequence diversity (plus R-factors, average B-factors, molprobity information)

19) Comparison to the FEATURE dataset belongs in the results section.

20) I think the water results belong to the results, not the discussion.

21) The discussion would be strengthened by including information on where you see this applied.

PREreview of Prediction of Ca ²⁺ binding site in proteins with a fast and accurate method based on statistical mechanics and analysis of crystal structures

Competing interests

Comments