ClusDOCK Documentation
a tool for cluster analysis of docked poses
ClusDOCK enables the clustering of docking results by creating a pairwise RMSD matrix and by identifying a restrained number of clusters from a large number of docking poses.
The user must select the docking program used, which should be in accordance with the file format.
The user must choose:
How to choose the clustering algorithm
No one method can be arbitrarily recommended above all others, and it has to be recognized that different clustering methods may give very different results on the same data. It is generally recommended to run analyses with different choices to check for robustness.
The gromos algorithm enables the definition of a series of nonoverlapping clusters of structures with an easy and fast procedure. This is the standard algorithm we recommend using.
A potential benefit of applying single-linkage is that it can be used to identify outliers since these structures are left as singletons (they are not included in any of the clusters).
Complete-linkage is opposite to single-linkage and generates clusters where the structures can differ more.
Input
ClusDOCK requires a single file (maximum file size: 50 MB), which is the output of one of the following docking programs:Program | Format |
---|---|
AutoDock Vina | PDBQT |
FRED (OpenEye) | SDF |
HYBRID (OpenEye) | SDF |
- The clustering algorithm
- The clustering cut-off (default cut-off: 1 Å)
- The clustering minimum size (default minimum size: 4)
1. Clustering algorithms
ClusDOCK can execute the cluster analysis with 3 different algorithms. At the basis of each method, there is the calculation of a N x N matrix in which each element contains the RMSD of the ith binding pose calculated with respect to the jth binding pose.- Gromos algorithm (described by Daura et al.(1999) (Angew. Chem. Int. Ed. 38:236-240)) For each structure, the number of other structures for which the RMSD is under the cut-off (neighbour conformations) is calculated. The structure with the highest number of neighbours represents the centre of a cluster, and forms together with all its neighbours a cluster. The structures of this cluster are thereafter eliminated from the pool of structures. The process is repeated until the pool of structures is empty.
- Agglomerative hierarchical algorithms
At each stage, the structures or clusters of structures that have the lowest RMSD are fused. Differences between the methods arise because of the different ways of defining the RMSD between two clusters or a structure and a cluster. To form a cluster, this RMSD value must be lower than the cut-off.
- Single-linkage clustering (or nearest-neighbour method) When using this algorithm, the RMSD between two clusters is equal to the lowest RMSD between two structures, one in one cluster, one in the other.
- Complete-linkage clustering (or farthest-neighbour method) When using this algorithm, the RMSD between two clusters is equal to the highest RMSD between two structures, one in one cluster, one in the other.
2. Clustering cut-off
When using the gromos algorithm, the cut-off affects the number of neighbour conformations, thus the definition of the clusters. When using agglomerative hierarchical algorithms, the cut-off defines when the algorithm stops, thus the “optimal” number of clusters. How to choose the clustering cut-off The default cut-off is 1 Å. The value depends on the system:- - the clustering cut-off for protein-protein complexes should be higher (usually between 4 and 9 Å)
- - for protein-small molecule docking, the cut-off is lower (usually between 1 and 2 Å)
3. Clustering minimum size
The cluster minimum size is the number of structures that a cluster must contain so that this group can be considered a cluster. How to choose the minimum size The default minimum size of a cluster is 4. The value depends on the number of compared poses:- - when only a few poses are clustered, this value shouldn’t be very high
- - when a lot of poses are clustered, it can be higher
Outputs
ClusDOCK generates two different outputs:- a table that contains for each cluster the number of conformations found in the cluster and their specific indexes (referring to the input submitted by the user), where the first one is the most representative of the cluster, as well as its scoring function
- a bar plot that shows on the y-axis the number of conformations for each cluster and on the x-axis the scoring function (determined by the docking program) for each cluster