PROLSQ: Constrained Refinement of Atomic Parameters
Authors: Wayne Hendrickson and John Konnert
Contact: James M. Stewart, Department of Chemistry,
University of Maryland, College Park, MD 20742
PROLSQ is a restrained-parameter structure-factor least-squares atomic
parameter refinement program for large molecules. This XTAL version is derived
from the program of John H. Konnert (1976) which was elaborated for protein
refinement by Wayne A. Hendrickson (1980). Hendrickson (1985) has described the
technique and philosophy of the use of PROLSQ in a readily acquireable
publication.Since that time a number of others have made changes, in details,
to the original code and documentation. Among these contributors are Barry
Finzel, Steven Sheriff, Anita Sielecki, Janet Smith, & Alex
Wlodawer.
The program will refine atomic models using both the conditional
structure-factor least-squares proceedure described by Waser and geometric
restraints so that the canonical stereochemistry of the sub-elements of the
structure will be taken into account in the least-squares process.
The information about the restraining geometry must be supplied by use of the
PROTIN program of Hendrickson which supplies the geometry in terms of
connectivity, covalent radii, bond angles, planarity, chirality, and other
structural features of macromolecular substances. This information and the
diffraction data are combined in the derivative matrix of the least-squares
procedure such that elastic bounds are imposed on the geometry of the model.
The restrained least-squares proceedure uses a greatly reduced, "sparse",
matrix. The parameter shifts are derived from this matrix not by inversion but
by the iterative conjugate gradient method of Hestenes & Stiefel applied,
described, and programmed by Konnert (1976). This approach to refinement has
several advantages over conventional structure factor least-squares procedures:
first, the size of the matrix increases linearly rather than quadratically.
Second, the restraints effectively reduce the number of parameters to be
determined. Third, the method usually shows a larger radius of convergence.
Fourth, all parameters may be refined simultaneously, but only correlations
dictated by the geometry are included in the derivative matrix.
Before the PROTIN/PROLSQ programs are used, Hendrickson's paper (1985) should
be read. In the section "Refinement Strategy", pages 264-268, an outline is
given of the appropriate use of these codes for the refinement of
macromolecular structures.
The method is described by Konnert (1976). The function minimized is the
weighted sums of the difference in the squared F calculated and F relative, of
squared ideal and observed interatomic distances, and of other types of
restraints. The list of the currently allowed restraints is given in the
documentation of PROTIN and will not be reiterated here. The weighted, squared
differences are multiplied by appropriate numerical differentials of the
structure factors, the distances, or other factors, with respect to the
parameters of the structure to form the vector of the normal equations. The
matrix of the normal equations is formed from the derivative terms alone. Only
correlations between atoms related by constraining features of the structure
are saved in memory. This results in the formation of a "sparse" matrix since
the off-diagonal elements are dispersed throught the matrix. Since all
parameters are refined simultaneously and standard matrix inversion methods are
not space efficient and the problem is not separable into a number of smaller
sets of equations, as in conventional block-diagonal least squares, Konnert
uses the conjugate gradient iterative technique for the solution of the normal
equations. The conjugate gradient technique retains the sparse derivative
matrix throughout the solution for the shifts in atomic parameters. The
non-zero elements of the hypothetical n by n derivative matrix are retained
during the iterative solution procedure. For the refinement of a structure of n
atoms and m distance restraints, for example, the number of related elements in
one half of the symmetric matrix that need be stored is 6n + 9m. Since m is
roughly linear with n, m is about 3n, storage space required varies linearly
with n. To ensure rapid convergence, it is necessary to know or determine the
optimum damping factor for the computed shifts. In the XTAL version of PROLSQ
this value is set to 0.5 by default. However, use of the rtest
line will permit calculating R for a selected sub-set of reflections. Once a
suitable damping factor is found it must be applied to all shifts if the
desired geometry is to be retained.
The optional wgtxxx input lines allow for the adjustment of
refinement weights. These weights are given in terms of "standard deviations"
of the class of observation. The default values set in this XTAL version are
similar to those shown by Hendrickson (1985). In this paper there is a section
devoted to the proper use of weights. The setting of these weights is very
important to the sucessful use of PROLSQ on two accounts. First, if the
restraint weights are imbalanced with respect to geometric restraints and
structure factor refinement, one will dominate the other giving rise to large
divergence from ideal geometry or divergence in terms of the structure factors.
If the restraints are too heavily weighted the R value may actually increase as
the refinement goes forward since the conformity to the idealized model will
dominate the structure factor information. Second, if any of the weights cause
the weighted Δs squared to be very large there is the possibility that
"floating point overflow" will occur during the congugate gradient process,
especially on 32 bit machines. The numbers in the output which should be
monitored are the "average weighted Δ squared" values for each of the
restraints and for the structure factors. In a reasonable refinement these
values all approch a common value, usually near 1.0 as the refinement proceeds
to completion. Futhermore, if any of the categories shows a value much larger
than the others, then that restraint will apply with the greatest effect in the
least-squres process.
The values of the weights may be set by the use of the wgtxxx
input lines and by the inclusion of experimental σF values. See
Hendrickson's paper for elaboration of this point. A value for the weights that
will force the average to 1.0 may be found by looking at the corresponding
value of the average Δ given in the output and supplying this value as
the sigma in the appropriate wgtxxx line. For example, consider
the following "snapshot" of input and output:
Including the input lines: wgtb *2 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.5
wgtref 1 2.5 -10.0
Produces the following output:
...........................................................................
>>>>>>>>>>>>>>>>>>>>>>
THERMAL PARAMETER RESTRAINTS
<<<<<<<<<<<<<<<<<<<<<<<
SUM(WGT*DELTA**2) = .4163904+04
AVG(WGT*DELTA**2) = 23.525 FOR ISOTROPIC THERMAL FACTORS
ROOT RMS MEAN RMS SIGMA
TYPE 1/WGT NUMBER DELTA SIGMA \DELB\ DELU U THERMAL FACTOR
TYPE
1 .500 21 1.912 .500 1.296 .1281 .07 MAIN-CHAIN
BOND
2 .500 21 2.278 .500 1.676 .1457 .07 MAIN-CHAIN
ANGLE
3 .500 42 2.487 .500 1.575 .1412 .07 SIDE-CHAIN
BOND
4 .500 63 3.060 .500 1.947 .1570 .07 SIDE-CHAIN
ANGLE
5 .500 0 .000 .500 .000 .0000 .07 X-H BOND
6 .500 0 .000 .500 .000 .0000 .07 X-H ANGLE
7 .500 0 .000 .500 .000 .0000 .07 METAL COORD.
ETC.
8 .500 3 1.342 .500 1.341 .1303 .07 HYDROGEN
BOND
...........................................................................
AGREEMENT FACTORS BASED ON PARAMETERS BEFORE THIS CYCLE
SUM(WGT*DELTA**2) = .5221144+07 AVG(WGT*DELTA**2) = 1508.57
NUMERATOR DENOMINATOR R
STANDARD R-FACTOR 3203.9 68627.7 .047
WEIGHTED R-FACTOR 2285.0 46805.6 .049
ERROR R-FACTOR 686.3 68627.7 .010
USED REJECTED R SURVEY R SURVEY
3461 0 0 0
.93 IS THE AVERAGE \FO-FC\ DISCREPANCY
---------- RESOLUTION BREAKDOWN ----------
DMIN REFS <SIGF>/DATA <SIGF>/USED <FO-FC>
R SHELL R SPHERE
5.00 36 .72 3.43 2.998 .041
.041
3.00 116 .73 2.79 3.673 .050
.048
2.50 101 .46 2.33 2.008 .044
.047
2.00 226 .39 1.90 1.977 .051
.048
1.70 277 .28 1.43 1.217 .043
.047
1.50 324 .22 1.02 1.020 .047
.047
1.30 2381 .12 .16 .568 .046
.047
ALL 3461
.047
MODELLED DELTA = ( 2.5) + ( -10.0)*(S-1/6)
FITTED <FO-FC> = ( 2.5) + ( -10.3)*(S-1/6)
...........................................................................
Then, from the same starting point, the run is made again having altered only
the weight line. Note the values of Δ in the first run and the values of
sigma in the second run:
Input: wgtb *2 1.9 2.3 2.5 3.1 0.5 0.5 0.5 1.3
wgtref 1 2.5 -10.0
Output:
...........................................................................
>>>>>>>>>>>>>>>>>>>>>>
THERMAL PARAMETER RESTRAINTS
<<<<<<<<<<<<<<<<<<<<<<<
SUM(WGT*DELTA**2) = .1480259+03
AVG(WGT*DELTA**2) = .836 FOR ISOTROPIC THERMAL FACTORS
ROOT RMS MEAN RMS SIGMA
TYPE 1/WGT NUMBER DELTA SIGMA \DELB\ DELU U THERMAL FACTOR
TYPE
1 1.900 21 1.912 1.900 1.296 .1281 .15 MAIN-CHAIN
BOND
2 2.300 21 2.278 2.300 1.676 .1457 .17 MAIN-CHAIN
ANGLE
3 2.500 42 2.487 2.500 1.575 .1412 .17 SIDE-CHAIN
BOND
4 3.100 63 3.060 3.100 1.947 .1570 .19 SIDE-CHAIN
ANGLE
5 .500 0 .000 .500 .000 .0000 .07 X-H BOND
6 .500 0 .000 .500 .000 .0000 .07 X-H ANGLE
7 .500 0 .000 .500 .000 .0000 .07 METAL COORD.
ETC.
8 1.300 3 1.342 1.300 1.341 .1303 .12 HYDROGEN
BOND
...........................................................................
AGREEMENT FACTORS BASED ON PARAMETERS BEFORE THIS CYCLE
SUM(WGT*DELTA**2) = .5221144+07 AVG(WGT*DELTA**2) = 1508.57
NUMERATOR DENOMINATOR R
STANDARD R-FACTOR 3203.9 68627.7 .047
WEIGHTED R-FACTOR 2285.0 46805.6 .049
ERROR R-FACTOR 686.3 68627.7 .010
USED REJECTED R SURVEY R SURVEY
3461 0 0 0
.93 IS THE AVERAGE \FO-FC\ DISCREPANCY
---------- RESOLUTION BREAKDOWN ----------
DMIN REFS <SIGF>/DATA <SIGF>/USED <FO-FC>
R SHELL R SPHERE
5.00 36 .72 3.43 2.998 .041
.041
3.00 116 .73 2.79 3.673 .050
.048
2.50 101 .46 2.33 2.008 .044
.047
2.00 226 .39 1.90 1.977 .051
.048
1.70 277 .28 1.43 1.217 .043
.047
1.50 324 .22 1.02 1.020 .047
.047
1.30 2381 .12 .16 .568 .046
.047
ALL 3461
.047
MODELLED DELTA = ( 2.5) + ( -10.0)*(S-1/6)
FITTED <FO-FC> = ( 2.5) + ( -10.3)*(S-1/6)
...........................................................................
Each atom in the structure is loaded from the XTAL program PROATM. PROATM
reads atom parameters and "names" in the Protein Data Bank format. The atom
names consist of an atom serial number, which changes as the numbers of known
atoms in the model changes; an atom name, which consists of a scattering factor
symbol, a remoteness indicator code, and a branch designator; an alternate
location indicator; a residue name; a chain identifier; a residue sequence
number; and a code for insertion of residues. In PROLSQ listings these name
parts are compressed to a 9 character string for compact listing purposes. This
string is structured in the following way:
Position Contents
1 An integer pointing to the chain to which the atom belongs
2 A single letter code for the residue to which the atom belongs
3-5 The residue sequence number from the PDB input to PROATM
6-9 The atom name; symbol, remotness, and branch; left justified
The single letter codes used by PROLSQ are shown in the following table which
shows the three character and one character code side by side.
Residue 3 1 Residue 3 1 Residue 3 1
Alanine ALA A Arginine ARG R Asparagine ASN N
Aspartic acid ASP D ASP/ASN ambig. ASX B Cysteine CYS C
Cystine CYS C Glutamine GLN Q Glutamic acid GLU E
GLU/GLN ambiguous GLX Z Glycine GLY G Histidine HIS H
Isoleucine ILE I Leucine LEU L Lysine LYS K
Methionine MET M Phenylalinine PHE F Proline PRO P
Serine SER S Threonine THR T Tryptophan TRP W
Tyrosine TYR Y Valine VAL V
The total mass of printing may be controlled with the use of a PRINT line. The
print line allows for selected and limited printing of shift, reflection, and
selected large deviations from distance restraints at priority 3. Most echoing
of constraint information is done at priority 4. The results of each cycle are
printed at priority 3 as are R values which result from use of the RTEST input
line. If a sampling of the matrix and vector elements is desired, print
priority 5 may be specified. This feature will result in copious printed output
in all but the smallest test cases.
Reads atom and reflection data from an input archive bdf
Writes refined atomic parameters and structure factors on the output archive
bdf
Reads geometric constraint data from PROTIN on file prt
Optionally writes images of Protein Data Bank ATOM lines on
pch file
Note that the program PROTIN must have been run before PROLSQ may be run in
order that the appropriate constraint information be available on file F for
use in the refinement process.
PROLSQ ncy 25 pch iso
print alis 1
stats 5.0 4.0 3.0 2.0 1.0
rtest 5 23 1 1 4 3 .25 .45 .65 1.25 .35 .65 .95
In this example the atomic parameters of crambin are to be refined with
individual isotropic thermal displacement parameters. Twenty five congugate
gradient cycles will be used to establish the parameter shifts, the refined
coordinates will be written to PCH as well as to the output
bdf. The print line specifies that a listing of the shift
information for all atoms will be written to the print file. The
stats line directs that R values be calculated for shells of
resolution as shown. The rtest line will cause every fifth
reflection to be written to a scratch bdf on file G and every twenty third
reflection to be written on file H. For both of these files the sampling
frequency will begin with the first reflection. Once refinement is complete
these scratch files will be used to calculate R values for the subset of
reflections contained in them. Four structure factor calculations will be made
damping the positional parameter shifts of the atoms by .25, .45, .65, and
1.25. Then three runs will be made damping the isotropic B shifts by .35, .65,
and .95. Because of the 23 in the second field of the rtest,
line the structure factors, and therefore, the R values for the excluded
reflections will also be calculated. This test using excluded reflections is a
sensitive measure of the success of a refinement. If the R value of the
excluded sub-set does not track downward with that of the included reflections
it may well be a signal of a problem with the model or that the restraints are
too slack. That is, the distribution of the restrained parameters about their
"ideal value" is too great; i.e. the energy is too high.
Finzel, B.C. "Incorporation of Fast Fourier Transforms to speed Restrained
Least-squares Refinement of Protein Structures." J. Appl. Cryst. (1987)
20, 53-55
Hendrickson, W.A. "Stereochemically Restrained Refinement of Macromolecular
Structures in Diffraction Methods for Biological Macromolecules Part B" from
Methods in Enzymology Edited by H.W. Wyckoff, C.H.W. Hirs, & S.N.
Timasheff Academic Press, Inc. (1985)
Hendrickson, W.A., & Konnert, J.H. (1979) in Biomolecular Structure,
Conformation, Function, and Evolution, edited by R. Srinivasan, Vol 1, pp.
43-57. Pergamon Press, New York
Konnert, J.H. "A Restrained-Parameter Structure-factor Least-squares Refinement
Procedure for Large Asymmetric Units" (1976). Acta Cryst. A32,
614-617.
Konnert, J.H. & Hendrickson, W.A. "A Restrained-parameter Thermal-factor
Refinement Procedure" (1980). Acta Cryst. A36, 344-350.
Sheriff, S. "Addition of Symmetry-Related Contact Restraints to PROTIN &
PROLSQ", J. Appl. Cryst. (1987) 20, 55-57.