2
EVA is a multivariate molecular descriptor for use in QSAR studies.
				Type: Object | 
				Advantage: None | 
				Novelty: New | 
				ConceptID: Obj1
3
It is constructed from vibrational eigenvalues derived from either a quantum theoretical or molecular mechanical treatment of molecular structure.
				Type: Object | 
				Advantage: None | 
				Novelty: New | 
				ConceptID: Obj1
4
This paper applies the method to biological-activity data using measures of the inotropic potential of a range of Calcium channel agonists.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
5
The performance of the descriptor, as both an explanatory and a predictive tool, is analysed in relation to the way in which it is constructed using a rigorous statistical treatment.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
6
Its capabilities are examined in relation to those of previously published methodology which used a composite descriptor.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
7
It is shown to have improved performance and several procedural advantages, such as ease of calculation and operation.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con1
8
It is a 3-D structural descriptor which does not require prior co-alignment of structures for a QSAR study.
				Type: Object | 
				Advantage: Yes | 
				Novelty: New | 
				ConceptID: Obj1
Introduction
9
A Quantitative Structure-Activity Relationship (QSAR) is a statistical model relating a common property of a series of chemicals (such as their responses in an assay of a particular type of biological activity) to their molecular structure.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
10
The major goal is to use this relationship to discover new chemicals with properties optimally suited to a defined application in an area such as pharmaceuticals, agrochemicals or catalysts.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
11
The process entails relating structurally significant features of a set of chemicals to activity measurements (responses) that are the complex result of integration of many physical, chemical and metabolic processes.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
12
The structural features are summarised numerically in a molecular descriptor, but it is often difficult to do so efficiently and effectively since these features may also be very complex.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met1
13
Furthermore, many studies are performed on data sets comprising limited numbers of compounds and associated responses and this restricts the number of variables that can be used to describe structure in the QSAR model.
				Type: Motivation | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mot1
14
In order to maximise the value of this small number of variables, molecular descriptors that summarise efficiently a large amount of diverse chemical information are required.
				Type: Motivation | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mot1
15
One recent example is the theoretically-based multivariate descriptor, EVA.1–3
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
The EVA (EigenVAlue) descriptor
16
The premise on which the EVA approach is based is that a molecule's normal modes of vibration encode a suitable description of chemical structure, summarising the features necessary for the production of robust, predictive, QSARs.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
17
The normal modes are readily derived, using computational techniques based on either Quantum Theoretical or Molecular Mechanical methods.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp1
18
When suitably encoded they may be used as robust structural descriptors in QSAR lead-optimisation studies.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp1
19
In the following section, an overview of the approach is provided; for a more complete discussion, the reader is referred to reference .1
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac1
20
QSAR studies demand as complete a structural description of a molecule as possible, in numerical terms.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
21
Using Quantum Mechanics, the molecular wavefunction, Ψ, provides a thorough characterisation of the nuclear and electronic properties of a molecule (and in principle should also allow for the estimation of properties such as biological activity).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
22
However, extraction of useful information for a QSAR study is difficult.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met1
23
Following studies with experimentally-based Infra-red Spectra, theoretically-derived normal modes of vibration were identified as a suitable molecular descriptor encoding information on the identity of the constituent atoms, their bonding and spatial relationships (hence molecular shape and size), their vibrational modes and molecular electronic structure.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
24
Intuitively, such a descriptor should constitute a robust, predictive, structural representation.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp1
25
Calculation of a molecule's normal modes is achieved using the molecular electronic potential energy function, V, which is derived from the molecular wavefunction.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod2
26
Normal Coordinate Analysis (NCA) is performed to characterise the molecular vibrational properties.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met2
27
This approach involves the pointwise estimation of the potential function to determine the equations of motion, yielding both normal coordinate EigenVAlues (the normal mode frequencies of vibration) and their associated eigenvectors (the atom vibration vectors).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met2
28
The EVA procedure makes use of the eigenvalues, the eigenvectors being retained for final back-transformation to the molecular structures.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
29
The normal modes for a set of molecules cannot be used directly in a multivariate statistical analysis because current methods demand that the descriptor for each molecule in the training set is comprised of the same number of well-defined components, (i.e., the molecular descriptor must be a vector of fixed dimension).
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met2
30
Since the number of normal modes varies with the number of atoms N in a molecule (actually 3N-6 for a molecule without axial symmetry) each descriptor within a series will generally be of different dimension.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met2
31
Data transformation is therefore required to convert, without loss of information, a varying number of vibrational eigenvalues into the descriptor EVA, which is of fixed dimension.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
32
This is done by projecting the eigenvalues (vibrational frequencies) onto a bounded frequency scale (BFS) with individual vibrations represented by points along this axis.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
33
The frequency range is chosen to be 0 cm−1 to 4000 cm−1 to encompass all fundamental molecular vibrations.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
34
Each of the 3N-6 vibrations is represented by an equivalent Gaussian curve, G{f(μ), σ2}, in which μ is the vibrational frequency; the area under each curve is assigned as unity.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
35
Proximate or coincident Gaussian functions are permitted to overlap and the “intensity” is summed.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
36
The choice of σ for each function defines the degree to which the vibrations overlap and is typically 10 cm−1 to 20 cm−1.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
37
This smoothing operation introduces, deliberately, a “fuzziness” to the spectrum of vibrations and is a key step in the definition of EVA; a value that is too small will fail to detect similarities where they exist, and one that is too large will result in overlaps that obscure significant proportions of the variation in the descriptor.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
38
The process results in a degree of serial correlation in the descriptor that causes inevitably some redundancy in the descriptor variables.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
39
However, it also enables the significance of the presence or absence of peaks to be assessed in the subsequent analysis, together with the monitoring of changes in peak position.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
40
Once a Gaussian smoothing function has been applied, the resultant spectrum is sampled across its whole width in fixed increments (L cm−1).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
41
The choice of increment determines the number of variables in the EVA descriptor; thus, for example, a 2 cm−1 increment results in a descriptor string of 2000 variables.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
42
Molecular vibrations are therefore depicted as “peaks” of intensity on a scale of frequency versus height.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod1
43
This produces the molecular descriptor, EVA, which retains the integrity of its constituent frequency data.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
44
The choice of the variables σ and L can have a marked effect on the quality of the subsequently derived QSAR/QSPR model, and this is discussed below (see section 3.2).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
45
The EVA descriptor was originally formulated at Shell Research Ltd, Sittingbourne Research Centre, and was applied successfully in a series of agrochemical lead-optimisation studies.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac2
46
While much of the experimental data used in the development remain proprietary and hence are not available for publication, its utility has been demonstrated by modelling experimentally observed logP values.1
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac2
47
The intention of the present study is to illustrate the application of the method to biological data that are within the public domain, incidentally providing a comparison of the performance of the EVA descriptor with that of other, more familiar, descriptors.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac3
49
This paper compares the performance of EVA with the published results of this study.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met4
51
The additional inclusion of selected physicochemical descriptors, and the consequences of variable scaling, was then considered in turn to assess whether the QSAR model could be improved upon.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac3
52
All results were cited in terms of the squared correlation coefficient of the fitted PLS regression model, r2.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
53
The intention of the present studies is to conduct a parallel investigation to examine the performance of the EVA descriptor.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
54
A particular focus will be to examine the robustness of the predictive ability of the EVA based QSAR model.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
55
Results will therefore be presented in terms of the squared correlation coefficients of the cross-validated equations, q2, rather than the fitted regression model.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
56
The parameter q2 is commonly used as a measure of the predictive ability of the regression model, but in order to facilitate comparison between the two studies the values of r2 will also be included as appropriate.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
57
This paper reports the results of an investigation to show how to optimise the EVA descriptor in order to increase the predictive power of a QSAR equation.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
Experimental
The Calcium Channel Agonist data set
58
Table 1 shows the structures of the compounds used by Davis et al.5,6
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac4
59
The compounds were assayed in vitro for their inotropic potency, i.e., their ability to increase cardiac contractibility, using guinea pig atria paced at 1 Hz.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met4
60
This was expressed as the concentration of the drug that increased the tension developed in the preparation to 50% of the isoprenaline maximum (EC50).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met4
61
Results were expressed relative to the EC50 of the standard calcium channel agonist Bay K 8644 (Fig. 1) and these values are given in Table 1.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac4
62
Also listed in Table 1 are values of the calculated log octanol/water partition coefficient (clogP) and the calculated molar refractivity (cMR) respectively, corresponding to the various structures.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac4
63
These were calculated using MedChem software v.3.54 B (1989).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met5
64
In the corresponding table in the paper by Davis et al. the quoted values for these parameters are different, having been modified by in-house fitting to a few experimentally determined values.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac4
65
In the present paper we use the standard values in order to maintain external consistency.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac4
GRID descriptor
66
For the 3-D QSAR analysis of the data set, Davis et al. made use of the GRID descriptor.7
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met4
67
The technique was developed originally as a method of probing proteins for potential binding sites and has been used successfully to predict sites of binding of small ligands in proteins,8 in addition to its more recent role as a structure descriptor.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met4
68
A detailed discussion of the derivation of the GRID descriptor can be found in reference .7
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac5
The inclusion of physicochemical descriptors
69
A limitation of GRID is that by generating the descriptors from the calculated interaction energy of a probe molecule, only the enthalpic component of the drug–receptor interaction can be modelled.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met4
70
This neglects other factors which can be, potentially, highly influential, such as the entropic changes resulting from the interaction.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met4
71
In recognition of this limitation, Davis et al5,6. included two physicochemical descriptors, clogP and cMR.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod4
72
ClogP represents a particularly useful descriptor since by accounting for lipophilicity and solvation, features that are influenced heavily by hydrogen bonding, it can give information on the entropic changes associated with a molecule's coming out of solution and binding at a receptor site.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod4
EVA descriptor generation
73
EVA descriptors were generated for the 36 compounds in the calcium channel agonist set.5,6
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac6
75
The correct structures were used in the present study and were created using Sybyl 6.2 from Tripos Associates10 on a Silicon Graphics Indy R4000 platform.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met6
76
In the previous studies, Davis et al. constrained the energy minimisation of the structures in an attempt to minimise noise in the GRID descriptors and hence optimise the subsequent QSAR investigations.5,6
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met4
77
In order to perform the calculation of the normal modes of vibration required for EVA, it is a prerequisite to ensure that the potential energy of a molecular conformation is at a stationary point.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
78
Slight deviations away from this energy minimum can lead to the introduction of rotational and translational modes during the determination of the equations of motion in the NCA.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
79
This in effect ‘contaminates’ the vibrational modes, resulting in the determination of vibrations that are not purely harmonic in nature and degrading significantly the quality of the calculated vibrational spectrum.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
80
A more detailed discussion of this is provided in reference .11
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac7
81
No attempt was made therefore to reproduce directly the conformational co-ordinates used by Davis et al.5,6
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
82
Molecular structures were optimised using the MOPAC 6.0 AM1 Hamiltonian12 (keywords: SCFCRT = 1.D-12 GNORM = 0.05).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met7
83
Successfully minimised structures were then used to generate the NMs (additional keyword: FORCE).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met7
84
These were then converted into EVA descriptors in the manner described.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
85
It is acknowledged that the calculation of NMs using a semi-empirical route is less accurate than for example, by the use of Density Functional Theory methods.13
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met7
86
However, the use of the MOPAC AM1 Hamiltonian has previously been shown to produce qualitatively good results.14
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met7
87
Furthermore, EVA descriptors generated from the MOPAC AM1 Hamiltonian have been applied successfully in previously reported studies.14
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met7
88
Consequently, it is believed that the EVA descriptors generated in this way are acceptable for the present investigation.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp2
Statistical analysis
89
To generate useful QSARs between the activity, EC50, and the EVA descriptor, the Partial Least Squares regression method (PLS) was adopted for the investigations; this is in line with the GRID-based studies of Davis et al.5,6
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met8
90
PLS is a supervised approach that provides a reduced solution of new ‘latent’ variables that are linear combinations of the original variables and are well correlated with Y. In essence, it is a projection method that defines a hyperplane through the x-descriptor space.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod5
91
The latent variables represent the projected ‘summaries’ of the original variables onto this plane and each PLS component represents a co-ordinate dimension of the hyperplane.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod5
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac8
93
The PLS regression approach is an accepted technique for handling situations where over-square matrices are encountered and has some advantages over unsupervised techniques such as, for example, Principal Components analysis.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met8
94
By constructing the components in a supervised fashion, it is possible to produce fewer numbers of components correlated with Y, thereby summarising the data more efficiently.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met8
95
With matrices such as EVA and GRID this is a clear advantage.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met8
96
As has been mentioned earlier in section 1.1, by having the ability to vary the form and dimension of the EVA descriptor (i.e., σ and L), it is possible to change the manner in which the structural information is presented.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met3
97
In effect, this can be thought of as a form of data scaling, in line with log and variance scaling methods.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met3
98
This affects ultimately the regression analysis, particularly with respect to the co-variance with the activity data and the way in which the information is loaded onto the components.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp3
99
Therefore an essential stage in the application of EVA is the generation of descriptors with different σ and L values and their subsequent systematic validation to identify the optimum predictive model.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp3
100
Previous studies have shown that a σ value of between 10 and 20 cm−1 is a reasonable starting point for such investigations.1
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac9
Model validation and significance testing
101
A key step in regression analysis is the validation to determine the correct dimensionality of the model, i.e., the number of significant PLS components to be included.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp4
102
Without this validation there is a risk of ‘over-training’, leading to apparently well-fitting models with effectively little predictive capacity.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp4
103
As the intention of the present study is to assess the predictive capability of the EVA descriptor, two forms of model validation have been incorporated.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
104
In line with the study by Davis et al., Cross-Validation (CV) has been used to test each successive PLS component for significance.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met9
105
In addition, randomisation tests have been included to establish confidence limits on whether or not random correlations are being selected in place of real relationships.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met9
106
CV represents one of the most favoured methods of testing the PLS components for significance.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met9
107
In essence, CV randomly divides the data set into groups of compounds.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
108
Each group is then omitted from the data set in turn and the remaining set used to construct a new regression model, which is then used to calculate the responses for the members of the omitted set.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
109
From the predicted y-values for each of the omitted ‘test’ sets, the sum of squares of the differences from actual values can be calculated.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
110
This yields the PRESS score (Predictive Residual Sum of Squares) and by weighting this against both the inclusion of successive components and the number of cases in the set, the standard error of cross-validation, Scv, can be calculated.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod3
111
This provides a good test of the significance of the inclusion of each component upon the overall regression model and has some advantages over other procedures such as, for example, the F-ratio test.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met9
112
By simulating how well the model predicts new data, it provides an estimate of the robustness of the model and potentially can highlight models that are influenced by data points with high leverage (e.g., outliers).
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met9
113
As the underlying intention of a QSAR is to make predictions, then this is useful.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met9
114
Initial analysis of the EVA descriptor models was performed using Leave-One-Out Cross-Validation (LOO CV).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met10
115
To determine the optimum regression model, the Scv scores for the first 10 PLS components were calculated and the minimum Scv score identified.15
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met10
116
Previous experience has shown that the relevant information linking the response and descriptor variables is summarised in the first few components (typically ≤ 6).1
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac9
117
The squared correlation coefficient of the cross-validated equation, q2, was then cited for the inclusion of each successive component.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac9
118
In a previous publication,1 it was shown that the LOO CV q2 obtained on a training set of 135 structures, using EVA as the descriptor and logPow as the measured response, was a very good indicator of subsequent predictive power for logPow in an unrelated test set of 76 compounds.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met10
119
However, work by Shao17 has challenged the suitability of LOO CV for testing the predictive robustness of a QSAR model and suggests that Leave-n-Out Cross-Validation, where n is greater than 1, represents a more robust alternative.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met10
120
Consequently, in the present investigations we have compared LOO CV results with those obtained using Leave-Group-Out Cross-Validation (LGO CV) with n = 7 (see section 3.2); this is also in line with the validation approach applied by Davis et al.5,6
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met11
121
However, while LOO CV represents an easily reproducible test, i.e., each run will yield the same q2 result, in LGO CV the cases assigned to each group to be left out are selected at random.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met11
122
This means that LGO CV must be repeated a large number of times (e.g., 1000) in order to obtain a realistic estimate of the internal predictivity and is hence more computationally demanding than LOO CV.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met11
123
Our work shows that with this data set and analytical approach, LOO CV gives very similar results to LGO CV and so we have adopted the simpler procedure.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met10
124
It must be acknowledged that neither LOO CV nor LGO CV can be regarded as a sufficient indicator of model validity; in addition, some measure of the probability that the result may be a chance occurrence is needed.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp5
125
This is particularly important in regression studies where there are a smaller number of cases than the dimensionality of the x-descriptor, i.e., where an over-square matrix is present.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp5
126
In such instances, there is an increased possibility of the extraction of a random correlation.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp5
127
Consequently, in order to address this with regard to the EVA descriptor, a randomised permutation test was performed to examine the model for chance correlations; this is in line with previous studies that have applied the EVA descriptor.2,18
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
128
The process involves:
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
129
1.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
130
Randomly assigning observations to structures,
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
131
2.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
132
Performing a regression analysis using EVA and then evaluating q2 (or r2),
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
133
3.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
134
Repeating steps 1 and 2 a large number of times (e.g., 1000 times) to ensure thorough sampling,
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
135
4.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
136
Determining the frequency distribution of q2 (or r2),
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
137
5.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
138
Determining where in this frequency distribution the q2 (or r2) of the real assignment lies.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
139
An actual result that is extreme in the upper confidence tail of the distribution is regarded as having a low probability of occurring by chance
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
Results and discussion
140
In the following section, the results are reported for the QSAR investigation of a set of 36 calcium channel agonists (Table 1), using EVA as the structural descriptor and PLS regression as the statistical approach.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
141
Except where stated otherwise, all results will be reported in terms of the squared correlation coefficient (q2) of the Leave-One-Out Cross-Validated (LOO CV) equation.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met10
142
To facilitate comparisons, Table 2 is taken directly from the work of Davis et al., detailing the PLS regression models reported for the full 36 compound data set.5,6
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac6
Determination of optimum PLS regression model
Optimum EVA descriptor model
143
Initial identification of the optimum EVA model was performed by comparing the results obtained with a series of descriptors using σ values of 10, 20, 30 and 40 cm−1, respectively.
				Type: Method | 
				Advantage: None | 
				Novelty: New | 
				ConceptID: Met3
144
The results established a narrowed focus of attention around σ = 20 cm−1 and a further series of EVA descriptors was then generated (16 ≤ σ ≤ 26 cm−1).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res1
145
Comparison of the predictive performance of the models identified the optimum EVA model as having σ = 24 cm−1.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res1
146
Throughout, the BFS start point, S, was fixed at 1 cm−1 and L was defined as σ/2.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod6
147
This has been assumed previously1 to represent a reasonable sampling frequency for a given value of σ and is justified below (section 3.2).
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac2
148
The results for the full range of EVA models considered are included in Table 3.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res1
Model transformations and variance scaling effects
149
As was demonstrated earlier,5,6 PLS analysis can be highly sensitive to the relative scales of variance between the x-block and y-variable (this is especially significant when different x-descriptors, each with its own unit and scale, are included in the regression equation, although this situation does not apply to EVA).
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met8
150
A series of transformations and scaling procedures was applied to both the activity and EVA variables, to investigate the impact on performance.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met13
151
Scaling and transformation tests were conducted in two parts.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met13
152
In the case of the activity variable, EC50, both log scaling and variable standardisation were considered, as in the study of Davis et al.5,6
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met13
153
For the EVA descriptor, two forms of variance scaling were applied, Autoscaling and Blockscaling.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met14
154
Autoscaling standardises each individual variable column to zero mean and unit variance, while Blockscaling scales the entire descriptor block to zero mean and unit variance.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met14
155
The impact of these two scaling methods is of particular interest, because differences in column variance across the descriptor are considered to contain structurally significant information.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp6
156
The full results are reported in Table 4.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
157
Variance scaling with the EC50 variable has no impact on the PLS regression model (4c model, q2 = 0.498).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
158
Similarly, blockscaling the EVA descriptor does not affect the performance.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
159
However, log10 transformation of the activity values does result in a significant degradation of the PLS model (1c model, q2 = 0.194).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
160
Notably, use of the autoscaled EVA descriptor yields the worst model of all.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
EVA QSAR model testing and stability
161
As discussed above (section 2.4.1), with over-square x-matrices the PLS regression approach, potentially, can extract chance correlations with the y-variable.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met8
162
In the case of the EVA descriptor, an extreme example of an over-square matrix, then the likelihood of such an occurrence may be increased.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp7
163
Therefore, a series of rigorous validation exercises was applied to establish confidence in the QSAR model obtained.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met12
165
Fundamentally, two variables, namely σ (Gaussian standard deviation) and L (the BFS sampling interval), determine the form and dimension of the EVA descriptor when derived from the normal coordinate spectrum.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
166
Potentially, the value chosen for L can impact on the information content retained in the final descriptor model and an extreme illustration of this would occur if the sampling interval were so large that not all of the intensities under the Gaussian curves were to be sampled.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
167
As outlined earlier in the initial determination of the optimum EVA model, L was defined as σ/2.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod6
168
This ensures that all structural information is retained, while keeping the associated computational time within reason.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod6
169
Validation of this procedure was carried out by comparing the effects of using a range of sampling intervals (σ/4 ≤ L ≤ 4σ).
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met15
170
The full set of results is shown in Table 5 and illustrated in Fig. 2.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res3
171
For sampling intervals where L ≤ σ, the information content remains intrinsically stable, with no significant impact on the statistical performance of the EVA models.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res3
172
However when L > σ, decreasing stability in the performance in the EVA model is observed.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res3
173
The results confirm that L = σ/2 is a reasonable rule.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res3
174
To test rigorously the predictive performance of the optimum EVA model (EVA{24,12}), LGO CV was used and compared with the results of LOO CV.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met11
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met11
176
Repeated runs of the LGO CV (1000 iterations) were carried out with different randomisations to ensure that the range of possibilities was thoroughly sampled.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met11
177
The results for the LGO CV study are presented in Table 5.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
178
These show that the 4-component model occurs most frequently (560 iterations; Fig. 3a), in agreement with LOO CV.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
179
The range of q2 scores for the LGO CV 4 component model is approximately normal in distribution (Fig. 3b and 3c).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
180
Determination of the mean q2 reveals that q2LGO compares closely with q2LOO (Fig. 2c: q2LGO = 0.474, q2LOO = 0.498), which gives confidence in the use of LOO CV to test the range of EVA models.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
181
Figs. 4 and 5 show the results of the randomised activity permutation tests on the same data set, obtained by using 1000 permutations of the response data with respect to structure.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res5
182
Fig. 3 shows the frequency distribution and normalised plot of q2, obtained from the LOO CVs; the actual experimental result of 0.498 lies well outside the upper 95% confidence limit (>99%).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res6
183
A similar result is shown in Fig. 4 for the fitted r2 distribution.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res7
184
Table 6 summarises these results.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res8
Augmentation of EVA {24,12} with the physicochemical descriptors clogP and cMR
185
Davis et al5,6. showed that the 3D molecular descriptor GRID did not by itself encode sufficient information to give a high-quality QSAR regression model of −log EC50 (r2 = 0.42).
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met4
186
However, inclusion of clogP and cMR, in linear combination with GRID, gives a significant improvement (r2 = 0.86).
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod4
187
It is therefore of interest to examine the influence of including clogP and cMR with EVA.
				Type: Motivation | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mot2
188
The results are given in Table 6 and show that inclusion of these parameters (either singly or in combination) degrades the quality of the fit (r2) and also the predictive power (q2).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res9
EVA and analysis of the data of Davis et al5,6.
189
QSAR methods that relate biological response to molecular structure are traditionally based on linear free energy relationships, as illustrated in the pioneering work of Hansch et al.19
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
190
Typically, these methods attempt to draw correlations between empirical descriptors which supposedly represent generalised forms of enthalpic and entropic effects associated with a drug molecule's transport to, and interaction with, a receptor or enzyme active site.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met1
191
The current research is aimed at further developing a new molecular descriptor, EVA, which encodes a large amount of diverse chemical information and can be used more effectively than the previous empirical forms.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa1
192
Analysis of the predictive performance of a range of EVA PLS models (Table 3) identifies the optimum descriptor as having a Gaussian sigma; = 24 cm−1 and a sampling interval L = 12 cm−1, EVA {24,12}.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res1
193
Cross-Validation (using both LOO CV and LGO CV methods) suggests that the first 4 components are significant, yielding a QSAR that is able to account for circa. 80% of the variation in the EC50 activity (fitted r2 = 0.801).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
194
Rigorous testing of the internal predictivity of the EVA QSAR model (Tables 3 and 5) shows that for structures outside the training set circa. 50% of the activity variation may be predicted with confidence, an experimentally useful result (4-component models, q2LGO = 0.474; q2LOO = 0.498).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
195
The close agreement between the two validation methods, illustrated in Fig. 2, gives considerable confidence in the use of LOO CV to validate the chosen EVA model and indicates that EVA is robust as a predictive QSAR tool, i.e., the model is stable.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
196
Since Davis et al5,6. analysed their data with log-transformation of the EC50 values (i.e., the y-variate was −log(EC50), Table 1) and their original data are not available to us, an exact comparison of the two sets of results is difficult.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res4
197
Moreover, cross-validation tests were not reported so it is only possible to compare values of r2, obtained by fitting the full-regression equation, rather than q2.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac10
198
The best model of Davis et al5,6. uses a linear combination of GRID, clogP and cMR.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod4
199
(GRID alone gives a 1-component model, r2 = 0.42, clogP alone a 1-component model, r2 = 0.69, and GRID in combination with clogP and cMR gives a 4-component model, r2 = 0.86).
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac3
200
The values of −log(EC50) predicted by this model are not quoted, but a comparison with experimentally observed values is given graphically in Fig. 5 of reference 5; by careful measurement, it is possible to obtain reliable estimates for the predicted −log(EC50) values.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac3
201
Taking antilogarithms yields values for predicted EC50 and the regression statistics may be calculated and compared with those obtained for the EVA model.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp8
202
Overall, the full EVA model (r2 = 0.80, standard error = 4.1) gives a considerably better fit to the untransformed experimental data than does the GRID + clogP + cMR model (r2 = 0.42, standard error = 7.0).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res10
203
The results are shown in Table 7 and are further illustrated in Fig. 6 which shows plots of residuals for both treatments.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res10
204
For small values of EC50 (typically less than 4.1, the standard error of the EVA treatment) the two models (not surprisingly) cannot be distinguished; however, for higher values of EC50 EVA consistently outperforms the other model by an impressive margin.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res10
205
A model-validation test using randomised activity permutations establishes confidence intervals for both the predictive (Fig. 4) and fitted (Fig. 5) regression models and demonstrates that, although the EVA descriptor presents an over square x-matrix in the PLS regression model, random correlations with the activity data are highly unlikely (>99% probability that the correlations have not arisen by chance).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res10
206
EVA contains structurally significant features that account well for the observed biological activity.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con1
207
It should also be emphasised that EVA is a robust descriptor of 3D molecular structure, which does not require prior alignment of structures in order for a QSAR study to be performed.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met3
208
This is especially useful when dealing with non congeneric series, where alignment about a common skeletal feature is not feasible.1,3
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met3
209
Although the relevance, multicolinearity and redundancy of the EVA descriptors have not been addressed in this study, they probably provide the most likely explanation for the relatively small values obtained for the estimates of LOO q2.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res11
210
These aspects will be reported in a future publication.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res11
211
Clearly, a model that yielded a q2 of >50% would be even more useful.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res11
212
Plots of the predicted vs. actual activity for both the fitted and cross-validated regression models (Figs. 7 and 8) highlight problems with outliers, particularly those compounds with EC50 values which are essentially zero; these structures can only contribute noise to the analysis, and the consequences are addressed in detail in section 3.7.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res12
Data scaling
213
In studies where the x-variate function is a vector consisting of a number of variables such as molecular properties with different dimensionalities and variances, careful tailoring of the variable variances is critical in extracting an optimum QSAR model.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp9
214
This is true, for example, of the work by Davis et al.5,6
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac3
215
In the present study however, Table 4 shows that blockscaling and autoscaling of the x-variate function does not improve the EVA QSAR model.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
216
In particular, autoscaling (which scales columns independently to zero mean and unit variance), performs significantly worse.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res2
217
In the EVA descriptor, all variable columns are presented on the same scale with the differences in variance between them intrinsically encoding structurally relevant information.
				Type: Method | 
				Advantage: None | 
				Novelty: Old | 
				ConceptID: Met3
218
Consequently, individual columns should not be standardised since this would result in giving columns with little variation and therefore little information, undue weighting.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
219
Indeed, in-house development with proprietary data sets (Shell Research Ltd., the former Sittingbourne Research Centre) has in the past demonstrated that scaling typically does not improve the performance of EVA-based QSAR models.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac11
220
Log transformation of the y-variate, EC50, leads to a substantially worse EVA model.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac11
221
This contrasts markedly with the results of a similar transformation in the work of Davis et al., which gave the best result.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res13
222
This is investigated further in section 3.7, below.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac11
Inclusion of the physicochemical descriptors
223
It is reasonable to ask the question as to whether or not the inclusion of the physicochemical parameters clogP and cMR would improve the EVA model.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp10
224
The results of doing so are shown in Table 8 and provide an insight into the information content of EVA.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res14
225
It is clear that the inclusion of either clogP or cMR, or both, does not lead to an improved QSAR model, even with scaling of the variables in this more complex x-variate vector.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res14
226
The implication may be that the information contained within the physicochemical parameters is already encoded intrinsically within the EVA descriptor.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp11
227
In order to test this hypothesis, namely that EVA encodes physicochemical information, regression models were constructed for clogP and cMR, respectively.
				Type: Model | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Mod7
228
The results (Table 9) show that while EVA accounts for the variation in cMR extremely well (r2 = 0.983, q2 = 0.963), it does not do so at all well for clogP (r2 = 0.415, q2 = 0.281).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res15
229
The result with clogP is surprising, since in previous work it has been shown that EVA QSAR can explain the variation in experimentally measured logP values across a wide variety of structurally-diverse, non-related compounds and be usefully predictive (135 compounds, r2 = 0.96, q2 = 0.68).1
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res15
230
Why should the approach be less successful with the congeneric set of structures described in Table 1?
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp12
231
The most likely reason for this anomaly is in the variance in clogP; the range of values is relatively small, i.e., only 3 log units in comparison to 6 log units in the logP study.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp12
Investigation of outliers
232
It was noted that eight of the structures in the data set have corresponding values of EC50 that were approaching zero, ranging from 0.01 to 0.09 (see Table 1).
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac12
233
Given the nature of the bioassay from which they have been derived, within experimental error, they probably may not be significantly different from zero.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac12
234
Inclusion of these data has the consequence of biasing the activity data set (y-variate) heavily, and introducing noise in the form of irrelevant descriptor information (x-variate).
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp13
235
The consequence of removing these data to leave a set of 28 structures and activity scores was investigated.
				Type: Goal | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Goa2
236
Table 10 compares the regression results with those obtained for the original set of 36 structures.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res16
237
For the untransformed EC50 data, there was a slight improvement in both the fit of the full regression equation and in the predictive power, albeit with the addition of a further 2 components in the regression model (6 component model, r2 = 0.939, q2 = 0.527, cf. a 4-component model, r2 = 0.802, q2 = 0.498).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res16
238
In contrast, there was a highly significant change in the performance of the regression results for the −log(EC50) scores, Figs. 9 and 10 (5-component model, r2 = 0.963, q2 = 0.721, cf. a 1-component model, r2 = 0.332, q2 = 0.194).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res17
239
This compares favourably with the results of a similar transformation in the work of Davis et al. (4-component model, r2 = 0.86).
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res17
240
More importantly, the q2 result would provide significant confidence in support of the application of this model in a predictive role.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con2
241
However, it should be recognised that chemical interpretability will be non-trivial, given the Gaussian transformation and subsequent sampling of the normal modes of vibration.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con2
242
The degree of improvement in the log transformed EC50 was marked, however, it was not unexpected.
				Type: Result | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Res17
243
As was highlighted earlier, PLS analysis can be highly sensitive to the relative scales of variance between the x-block and y-variable.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met8
244
Certainly, for the log-transformed y-variable set, this variance would have been significant.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp14
245
Furthermore, given the likely experimental error in the bioassay, it is reasonable to postulate that relative differences in the log values for these compounds would have been subject to significant error.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp14
246
Consequently, in attempting to generate a model, these data would have had a greater bias in the log-transformed set.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp14
247
It remains interesting to speculate on whether the sensitivity in model formation was also in part a product of the choice of an essentially linear regression model.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con3
248
While outside the scope of this study, it remains an appealing point for future studies to test the performance of the EVA descriptor across a series of non-linear statistical methods.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con3
Chemical similarity searching
249
The formulation of the EVA descriptor as a bit string opens up the possibility of using the encoded vibrational mode information for chemical similarity searching.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp15
250
There is a substantial body of literature that describes the procedures and software available for this purpose and which points out the strengths and pitfalls of the approach.20–22
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac13
251
Current procedures are based principally on the use of chemical fragments represented as topological maps, atom counts and bond types.
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac13
252
Use of EVA for similarity searches would provide useful additional information to complement and extend this structural data.
				Type: Hypothesis | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Hyp15
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac13
Conclusions
254
Within the context of the present study, EVA is well suited to the production of usefully predictive QSARs with both biological and physicochemical data.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con1
255
The absence of a requirement for additional descriptors demonstrates that EVA has a very high information content encompassing, intrinsically, physicochemical descriptors such as clogP and cMR.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con1
256
In direct comparison with a composite descriptor (GRID + clogP + cMR), EVA has been shown to have superior predictive performance.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con2
257
As a general tool for QSAR studies, EVA advantageously may be applied without the prior alignment of structures required for other 3-D descriptors such as COMFA and GRID.
				Type: Method | 
				Advantage: Yes | 
				Novelty: Old | 
				ConceptID: Met3
258
However, optimisation of the EVA descriptor by adjusting how it is constructed can influence the predictive power of a QSAR equation.
				Type: Method | 
				Advantage: No | 
				Novelty: Old | 
				ConceptID: Met3
259
Further studies with other sets of biological data will be necessary in order to demonstrate the robustness of these conclusions.
				Type: Conclusion | 
				Advantage: None | 
				Novelty: None | 
				ConceptID: Con3
Abbreviations
260
QSAR, Quantitative Structure-Activity Relationship; EVA {σ, L, S}, Eigen VAlue molecular spectral descriptor model where: σ (sigma) = standard deviation of the EVA Gaussian curve, L = sampling interval of summed intensities and S = sampling interval start point on the bounded frequency scale; BFS, Bounded Frequency Scale (0 to 4000 cm−1); NCA, Normal Co-ordinate Analysis; NMs, Normal Modes (of Vibration); EC50, relative force of contraction; clogP, calculated log octanol/water partition coefficient; cMR, calculated molar refractivity; PLS, Partial Least Squares regression analysis; r2, squared correlation coefficient; CV, cross-validation; LOO, ‘Leave One Out’ cross-validation; LGO, ‘Leave One Group Out’ cross-validation; q2, squared correlation coefficient of the cross-validation equation; PRESS, Predictive REsidual Sum of Squares; Scv, cross-validated standard error, where ; n, number of compounds in data set; c, number of PLS components included in QSAR model; F-ratio, weighted ratio of r2 to 1.0 − r2, where 
				Type: Background | 
				Advantage: None  | 
				Novelty: None | 
				ConceptID: Bac14