PROLSQ: Constrained Refinement of Atomic Parameters

Authors: Wayne Hendrickson and John Konnert

Contact: James M. Stewart, Department of Chemistry, University of Maryland, College Park, MD 20742

PROLSQ is a restrained-parameter structure-factor least-squares atomic parameter refinement program for large molecules. This XTAL version is derived from the program of John H. Konnert (1976) which was elaborated for protein refinement by Wayne A. Hendrickson (1980). Hendrickson (1985) has described the technique and philosophy of the use of PROLSQ in a readily acquireable publication.Since that time a number of others have made changes, in details, to the original code and documentation. Among these contributors are Barry Finzel, Steven Sheriff, Anita Sielecki, Janet Smith, & Alex Wlodawer.

Purpose

The program will refine atomic models using both the conditional structure-factor least-squares proceedure described by Waser and geometric restraints so that the canonical stereochemistry of the sub-elements of the structure will be taken into account in the least-squares process.

The information about the restraining geometry must be supplied by use of the PROTIN program of Hendrickson which supplies the geometry in terms of connectivity, covalent radii, bond angles, planarity, chirality, and other structural features of macromolecular substances. This information and the diffraction data are combined in the derivative matrix of the least-squares procedure such that elastic bounds are imposed on the geometry of the model. The restrained least-squares proceedure uses a greatly reduced, "sparse", matrix. The parameter shifts are derived from this matrix not by inversion but by the iterative conjugate gradient method of Hestenes & Stiefel applied, described, and programmed by Konnert (1976). This approach to refinement has several advantages over conventional structure factor least-squares procedures: first, the size of the matrix increases linearly rather than quadratically. Second, the restraints effectively reduce the number of parameters to be determined. Third, the method usually shows a larger radius of convergence. Fourth, all parameters may be refined simultaneously, but only correlations dictated by the geometry are included in the derivative matrix.

Before the PROTIN/PROLSQ programs are used, Hendrickson's paper (1985) should be read. In the section "Refinement Strategy", pages 264-268, an outline is given of the appropriate use of these codes for the refinement of macromolecular structures.

Method

The method is described by Konnert (1976). The function minimized is the weighted sums of the difference in the squared F calculated and F relative, of squared ideal and observed interatomic distances, and of other types of restraints. The list of the currently allowed restraints is given in the documentation of PROTIN and will not be reiterated here. The weighted, squared differences are multiplied by appropriate numerical differentials of the structure factors, the distances, or other factors, with respect to the parameters of the structure to form the vector of the normal equations. The matrix of the normal equations is formed from the derivative terms alone. Only correlations between atoms related by constraining features of the structure are saved in memory. This results in the formation of a "sparse" matrix since the off-diagonal elements are dispersed throught the matrix. Since all parameters are refined simultaneously and standard matrix inversion methods are not space efficient and the problem is not separable into a number of smaller sets of equations, as in conventional block-diagonal least squares, Konnert uses the conjugate gradient iterative technique for the solution of the normal equations. The conjugate gradient technique retains the sparse derivative matrix throughout the solution for the shifts in atomic parameters. The non-zero elements of the hypothetical n by n derivative matrix are retained during the iterative solution procedure. For the refinement of a structure of n atoms and m distance restraints, for example, the number of related elements in one half of the symmetric matrix that need be stored is 6n + 9m. Since m is roughly linear with n, m is about 3n, storage space required varies linearly with n. To ensure rapid convergence, it is necessary to know or determine the optimum damping factor for the computed shifts. In the XTAL version of PROLSQ this value is set to 0.5 by default. However, use of the rtest line will permit calculating R for a selected sub-set of reflections. Once a suitable damping factor is found it must be applied to all shifts if the desired geometry is to be retained.

Refinement Weights

The optional wgtxxx input lines allow for the adjustment of refinement weights. These weights are given in terms of "standard deviations" of the class of observation. The default values set in this XTAL version are similar to those shown by Hendrickson (1985). In this paper there is a section devoted to the proper use of weights. The setting of these weights is very important to the sucessful use of PROLSQ on two accounts. First, if the restraint weights are imbalanced with respect to geometric restraints and structure factor refinement, one will dominate the other giving rise to large divergence from ideal geometry or divergence in terms of the structure factors. If the restraints are too heavily weighted the R value may actually increase as the refinement goes forward since the conformity to the idealized model will dominate the structure factor information. Second, if any of the weights cause the weighted Δs squared to be very large there is the possibility that "floating point overflow" will occur during the congugate gradient process, especially on 32 bit machines. The numbers in the output which should be monitored are the "average weighted Δ squared" values for each of the restraints and for the structure factors. In a reasonable refinement these values all approch a common value, usually near 1.0 as the refinement proceeds to completion. Futhermore, if any of the categories shows a value much larger than the others, then that restraint will apply with the greatest effect in the least-squres process.

The values of the weights may be set by the use of the wgtxxx input lines and by the inclusion of experimental σF values. See Hendrickson's paper for elaboration of this point. A value for the weights that will force the average to 1.0 may be found by looking at the corresponding value of the average Δ given in the output and supplying this value as the sigma in the appropriate wgtxxx line. For example, consider the following "snapshot" of input and output:

Including the input lines: wgtb *2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

wgtref 1 2.5 -10.0

Produces the following output:

...........................................................................

>>>>>>>>>>>>>>>>>>>>>> THERMAL PARAMETER RESTRAINTS <<<<<<<<<<<<<<<<<<<<<<<

SUM(WGT*DELTA**2) = .4163904+04

AVG(WGT*DELTA**2) = 23.525 FOR ISOTROPIC THERMAL FACTORS

ROOT RMS MEAN RMS SIGMA

TYPE 1/WGT NUMBER DELTA SIGMA \DELB\ DELU U THERMAL FACTOR TYPE

1 .500 21 1.912 .500 1.296 .1281 .07 MAIN-CHAIN BOND

2 .500 21 2.278 .500 1.676 .1457 .07 MAIN-CHAIN ANGLE

3 .500 42 2.487 .500 1.575 .1412 .07 SIDE-CHAIN BOND

4 .500 63 3.060 .500 1.947 .1570 .07 SIDE-CHAIN ANGLE

5 .500 0 .000 .500 .000 .0000 .07 X-H BOND

6 .500 0 .000 .500 .000 .0000 .07 X-H ANGLE

7 .500 0 .000 .500 .000 .0000 .07 METAL COORD. ETC.

8 .500 3 1.342 .500 1.341 .1303 .07 HYDROGEN BOND

...........................................................................

AGREEMENT FACTORS BASED ON PARAMETERS BEFORE THIS CYCLE

SUM(WGT*DELTA**2) = .5221144+07 AVG(WGT*DELTA**2) = 1508.57

NUMERATOR DENOMINATOR R

STANDARD R-FACTOR 3203.9 68627.7 .047

WEIGHTED R-FACTOR 2285.0 46805.6 .049

ERROR R-FACTOR 686.3 68627.7 .010

USED REJECTED R SURVEY R SURVEY

3461 0 0 0

.93 IS THE AVERAGE \FO-FC\ DISCREPANCY

---------- RESOLUTION BREAKDOWN ----------

DMIN REFS <SIGF>/DATA <SIGF>/USED <FO-FC> R SHELL R SPHERE

5.00 36 .72 3.43 2.998 .041 .041

3.00 116 .73 2.79 3.673 .050 .048

2.50 101 .46 2.33 2.008 .044 .047

2.00 226 .39 1.90 1.977 .051 .048

1.70 277 .28 1.43 1.217 .043 .047

1.50 324 .22 1.02 1.020 .047 .047

1.30 2381 .12 .16 .568 .046 .047

ALL 3461 .047

MODELLED DELTA = ( 2.5) + ( -10.0)*(S-1/6)

FITTED <FO-FC> = ( 2.5) + ( -10.3)*(S-1/6)

...........................................................................

Then, from the same starting point, the run is made again having altered only the weight line. Note the values of Δ in the first run and the values of sigma in the second run:

Input: wgtb *2 1.9 2.3 2.5 3.1 0.5 0.5 0.5 1.3

wgtref 1 2.5 -10.0

Output:

...........................................................................

>>>>>>>>>>>>>>>>>>>>>> THERMAL PARAMETER RESTRAINTS <<<<<<<<<<<<<<<<<<<<<<<

SUM(WGT*DELTA**2) = .1480259+03

AVG(WGT*DELTA**2) = .836 FOR ISOTROPIC THERMAL FACTORS

ROOT RMS MEAN RMS SIGMA

TYPE 1/WGT NUMBER DELTA SIGMA \DELB\ DELU U THERMAL FACTOR TYPE

1 1.900 21 1.912 1.900 1.296 .1281 .15 MAIN-CHAIN BOND

2 2.300 21 2.278 2.300 1.676 .1457 .17 MAIN-CHAIN ANGLE

3 2.500 42 2.487 2.500 1.575 .1412 .17 SIDE-CHAIN BOND

4 3.100 63 3.060 3.100 1.947 .1570 .19 SIDE-CHAIN ANGLE

5 .500 0 .000 .500 .000 .0000 .07 X-H BOND

6 .500 0 .000 .500 .000 .0000 .07 X-H ANGLE

7 .500 0 .000 .500 .000 .0000 .07 METAL COORD. ETC.

8 1.300 3 1.342 1.300 1.341 .1303 .12 HYDROGEN BOND

...........................................................................

AGREEMENT FACTORS BASED ON PARAMETERS BEFORE THIS CYCLE

SUM(WGT*DELTA**2) = .5221144+07 AVG(WGT*DELTA**2) = 1508.57

NUMERATOR DENOMINATOR R

STANDARD R-FACTOR 3203.9 68627.7 .047

WEIGHTED R-FACTOR 2285.0 46805.6 .049

ERROR R-FACTOR 686.3 68627.7 .010

USED REJECTED R SURVEY R SURVEY

3461 0 0 0

.93 IS THE AVERAGE \FO-FC\ DISCREPANCY

---------- RESOLUTION BREAKDOWN ----------

DMIN REFS <SIGF>/DATA <SIGF>/USED <FO-FC> R SHELL R SPHERE

5.00 36 .72 3.43 2.998 .041 .041

3.00 116 .73 2.79 3.673 .050 .048

2.50 101 .46 2.33 2.008 .044 .047

2.00 226 .39 1.90 1.977 .051 .048

1.70 277 .28 1.43 1.217 .043 .047

1.50 324 .22 1.02 1.020 .047 .047

1.30 2381 .12 .16 .568 .046 .047

ALL 3461 .047

MODELLED DELTA = ( 2.5) + ( -10.0)*(S-1/6)

FITTED <FO-FC> = ( 2.5) + ( -10.3)*(S-1/6)

...........................................................................

Atom Names In Lists

Each atom in the structure is loaded from the XTAL program PROATM. PROATM reads atom parameters and "names" in the Protein Data Bank format. The atom names consist of an atom serial number, which changes as the numbers of known atoms in the model changes; an atom name, which consists of a scattering factor symbol, a remoteness indicator code, and a branch designator; an alternate location indicator; a residue name; a chain identifier; a residue sequence number; and a code for insertion of residues. In PROLSQ listings these name parts are compressed to a 9 character string for compact listing purposes. This string is structured in the following way:

Position Contents

1 An integer pointing to the chain to which the atom belongs

2 A single letter code for the residue to which the atom belongs

3-5 The residue sequence number from the PDB input to PROATM

6-9 The atom name; symbol, remotness, and branch; left justified

The single letter codes used by PROLSQ are shown in the following table which shows the three character and one character code side by side.

Residue 3 1 Residue 3 1 Residue 3 1

Alanine ALA A Arginine ARG R Asparagine ASN N

Aspartic acid ASP D ASP/ASN ambig. ASX B Cysteine CYS C

Cystine CYS C Glutamine GLN Q Glutamic acid GLU E

GLU/GLN ambiguous GLX Z Glycine GLY G Histidine HIS H

Isoleucine ILE I Leucine LEU L Lysine LYS K

Methionine MET M Phenylalinine PHE F Proline PRO P

Serine SER S Threonine THR T Tryptophan TRP W

Tyrosine TYR Y Valine VAL V

Printer Output

The total mass of printing may be controlled with the use of a PRINT line. The print line allows for selected and limited printing of shift, reflection, and selected large deviations from distance restraints at priority 3. Most echoing of constraint information is done at priority 4. The results of each cycle are printed at priority 3 as are R values which result from use of the RTEST input line. If a sampling of the matrix and vector elements is desired, print priority 5 may be specified. This feature will result in copious printed output in all but the smallest test cases.

File Assignments

Reads atom and reflection data from an input archive bdf

Writes refined atomic parameters and structure factors on the output archive bdf

Reads geometric constraint data from PROTIN on file prt

Optionally writes images of Protein Data Bank ATOM lines on pch file

Example

Note that the program PROTIN must have been run before PROLSQ may be run in order that the appropriate constraint information be available on file F for use in the refinement process.

PROLSQ ncy 25 pch iso

print alis 1

stats 5.0 4.0 3.0 2.0 1.0

rtest 5 23 1 1 4 3 .25 .45 .65 1.25 .35 .65 .95

In this example the atomic parameters of crambin are to be refined with individual isotropic thermal displacement parameters. Twenty five congugate gradient cycles will be used to establish the parameter shifts, the refined coordinates will be written to PCH as well as to the output bdf. The print line specifies that a listing of the shift information for all atoms will be written to the print file. The stats line directs that R values be calculated for shells of resolution as shown. The rtest line will cause every fifth reflection to be written to a scratch bdf on file G and every twenty third reflection to be written on file H. For both of these files the sampling frequency will begin with the first reflection. Once refinement is complete these scratch files will be used to calculate R values for the subset of reflections contained in them. Four structure factor calculations will be made damping the positional parameter shifts of the atoms by .25, .45, .65, and 1.25. Then three runs will be made damping the isotropic B shifts by .35, .65, and .95. Because of the 23 in the second field of the rtest, line the structure factors, and therefore, the R values for the excluded reflections will also be calculated. This test using excluded reflections is a sensitive measure of the success of a refinement. If the R value of the excluded sub-set does not track downward with that of the included reflections it may well be a signal of a problem with the model or that the restraints are too slack. That is, the distribution of the restrained parameters about their "ideal value" is too great; i.e. the energy is too high.

References

Finzel, B.C. "Incorporation of Fast Fourier Transforms to speed Restrained Least-squares Refinement of Protein Structures." J. Appl. Cryst. (1987) 20, 53-55

Hendrickson, W.A. "Stereochemically Restrained Refinement of Macromolecular Structures in Diffraction Methods for Biological Macromolecules Part B" from Methods in Enzymology Edited by H.W. Wyckoff, C.H.W. Hirs, & S.N. Timasheff Academic Press, Inc. (1985)

Hendrickson, W.A., & Konnert, J.H. (1979) in Biomolecular Structure, Conformation, Function, and Evolution, edited by R. Srinivasan, Vol 1, pp. 43-57. Pergamon Press, New York

Konnert, J.H. "A Restrained-Parameter Structure-factor Least-squares Refinement Procedure for Large Asymmetric Units" (1976). Acta Cryst. A32, 614-617.

Konnert, J.H. & Hendrickson, W.A. "A Restrained-parameter Thermal-factor Refinement Procedure" (1980). Acta Cryst. A36, 344-350.

Sheriff, S. "Addition of Symmetry-Related Contact Restraints to PROTIN & PROLSQ", J. Appl. Cryst. (1987) 20, 55-57.