ClusDOCK Documentation

a tool for cluster analysis of docked poses

ClusDOCK enables the clustering of docking results by creating a pairwise RMSD matrix and by identifying a restrained number of clusters from a large number of docking poses.


Input
ClusDOCK requires a single file (maximum file size: 50 MB), which is the output of one of the following docking programs:
Program Format
AutoDock Vina PDBQT
FRED (OpenEye) SDF
HYBRID (OpenEye) SDF
The user must select the docking program used, which should be in accordance with the file format.
The user must choose:
  1. The clustering algorithm
  2. The clustering cut-off (default cut-off: 1 Å)
  3. The clustering minimum size (default minimum size: 4)
1. Clustering algorithms
ClusDOCK can execute the cluster analysis with 3 different algorithms.
At the basis of each method, there is the calculation of a N x N matrix in which each element contains the RMSD of the ith binding pose calculated with respect to the jth binding pose.
  • Gromos algorithm (described by Daura et al.(1999) (Angew. Chem. Int. Ed. 38:236-240))
    For each structure, the number of other structures for which the RMSD is under the cut-off (neighbour conformations) is calculated. The structure with the highest number of neighbours represents the centre of a cluster, and forms together with all its neighbours a cluster. The structures of this cluster are thereafter eliminated from the pool of structures. The process is repeated until the pool of structures is empty.
  • Agglomerative hierarchical algorithms
    At each stage, the structures or clusters of structures that have the lowest RMSD are fused. Differences between the methods arise because of the different ways of defining the RMSD between two clusters or a structure and a cluster. To form a cluster, this RMSD value must be lower than the cut-off.
    • Single-linkage clustering (or nearest-neighbour method)
      When using this algorithm, the RMSD between two clusters is equal to the lowest RMSD between two structures, one in one cluster, one in the other.
    • Complete-linkage clustering (or farthest-neighbour method)
      When using this algorithm, the RMSD between two clusters is equal to the highest RMSD between two structures, one in one cluster, one in the other.
How to choose the clustering algorithm
No one method can be arbitrarily recommended above all others, and it has to be recognized that different clustering methods may give very different results on the same data. It is generally recommended to run analyses with different choices to check for robustness.
The gromos algorithm enables the definition of a series of nonoverlapping clusters of structures with an easy and fast procedure. This is the standard algorithm we recommend using.
A potential benefit of applying single-linkage is that it can be used to identify outliers since these structures are left as singletons (they are not included in any of the clusters).
Complete-linkage is opposite to single-linkage and generates clusters where the structures can differ more.
2. Clustering cut-off
When using the gromos algorithm, the cut-off affects the number of neighbour conformations, thus the definition of the clusters.
When using agglomerative hierarchical algorithms, the cut-off defines when the algorithm stops, thus the “optimal” number of clusters.
How to choose the clustering cut-off
The default cut-off is 1 Å.
The value depends on the system:
  • - the clustering cut-off for protein-protein complexes should be higher (usually between 4 and 9 Å)
  • - for protein-small molecule docking, the cut-off is lower (usually between 1 and 2 Å)
Anyway, the user is encouraged to try different cut-offs and check the number of structures for every cluster. If it is too small, the cut-off should be increased.
3. Clustering minimum size
The cluster minimum size is the number of structures that a cluster must contain so that this group can be considered a cluster.
How to choose the minimum size
The default minimum size of a cluster is 4.
The value depends on the number of compared poses:
  • - when only a few poses are clustered, this value shouldn’t be very high
  • - when a lot of poses are clustered, it can be higher
This avoids creating a very large number of clusters, containing just a few structures.
Outputs
ClusDOCK generates two different outputs:
  1. a table that contains for each cluster the number of conformations found in the cluster and their specific indexes (referring to the input submitted by the user), where the first one is the most representative of the cluster, as well as its scoring function
  2. a bar plot that shows on the y-axis the number of conformations for each cluster and on the x-axis the scoring function (determined by the docking program) for each cluster
How to interpret the results
Ideally, the largest cluster has also the best scoring function, representing the optimal pose. Anyway, this doesn't always happen.
If the scoring function were perfect, the docked conformation with the lowest energy would always correspond to the optimal pose. This is not always the case, and sometimes a different pose is observed significantly more often than the lowest energy binding mode.
As a result, the most populated cluster is not always the one with the best scoring function.
The user should examine multiple clusters based on their population and scoring function, especially if they have a similar scoring function.
Additionally, the user shouldn’t only consider the representative conformation of a cluster, instead, he should compare the conformations of the group to determine the accuracy of the clustering.
Tips for beginners
It is generally recommended to run analyses with different choices to check for robustness.
If the docking results do not cluster into at least one significantly populated cluster, with a cut-off between 2 and 3 Å, the user should re-run the docking with different parameters.
If the docking uses a stochastic method and the experiment has been run multiple times, the similarity of the predicted binding poses can be assessed by clustering all the docking conformations. If all of the dockings cluster into one family, this indicates that the search parameters were sufficient for each docking to converge. If there is no clustering at all, then the dockings should be repeated but with increased sampling.
The user should remember to only compare the scoring functions of the same receptor-ligand complex.

ClusDOCK