Chapter 3: Indexing, Integration and Scaling the Data
Step 14: Scaling
Step 14: Scaling
Click on the Scaling tab to pull up the Scalepack windows. Select the appropriate Global Refinement choice (previously called Postrefinement in the old Scalepack): Non-Slipping Crystal and Perfect Goniostat or Non-Slipping Crystal and Imperfect Goniostat or Small Slippage and Imperfect Goniostat or Custom Postrefinement or No Postrefinement (Figure 45). Click on Scale Sets in the Controls panel to start the first round of scaling (Figure 48).
Due to the correlation between crystal and detector parameters the values of the unit cell parameters refined from a single image may be quite imprecise. This lack of precision is of little significance to the process of integration, as long as the predicted positions are on target. There is no contradiction here, because at some crystal/detector orientation the positions of the reflections may only weakly depend on the value of a particular crystal parameter. At the end of the data reduction process one would wish to get precise unit cell values. This is done in a procedure referred to as a Global Refinement or Postrefinement. The implementation of this method in Scalepack allows for separate refinement of the orientation of each image, but with the same unit cell value for the whole data set. In each batch of data (a batch is typically one image) different unit cell parameters may be poorly determined. However, in a typical data set there are enough orientations to determine precisely all unit cell lengths and angles.
Global Refinement is also more precise than the processing of a single image in the determination of crystal mosaicity and the orientation of each image. This is particularly important when deciding which reflections are fully recorded and which are partially recorded, which is why you should routinely do the Global Refinement.
Within the global refinement, the following parameters can be refined: the unit cell lengths and angles, the crystal orientation parameter rotx roty rotz, and the mosaicity. These can either be refined over the “Crystal” (generally this means over the whole data set), or over the “Batch” (generally this means each image). You can see these listed explicitly if you click on the 4th choice under the Global Refinement menu: Custom Postrefinement.
Figure 45. The global refinement options
The defaults are set so that the unit cell parameters and the mosaicity are refined to be the same over the entire data set. This makes sense for most data collections these days where the crystal is frozen and the entire data set is collected from a single crystal. However, if something happened to the crystal in the middle of the data collection (e.g. a temperature change that caused the unit cell constants to change, or the crystal began to decay severely), then one would want to refine these for each batch rather than over the entire data set.
Similarly, the defaults for the crystal orientation parameters in this case are refined for each batch, or frame, rather than over the entire data set. This allows you to spot any problems with the goniostat. If you do refine this for each batch and it doesn’t change much, then on subsequent rounds of refinement you can refine the orientation once for the entire data set (crystal).
In general it is best to assume that the crystal is slipping and that the goniostat is imperfect when choosing Global Refinement options and only restrict the refinement with the other options if the data warrants it.
These options affect how scaling of the integrated diffraction intensities is performed. The default options are: Scale Restrain 0.01, B restrain 0.1, Write Rejection File, Ignore Overloads, Number of Zones 10, Error Scale Factor 1.3, Default Scale 10, and Resolution Limits as determined from the .x file headers (Figure 46). What follows is a discussion of each of the options listed in this panel.
Figure 46. The Scaling Options panel
scale restrain can be used to restrain scale factor differences from consecutive films or batches. The value entered in the box represents the amount you will allow the scale factors to differ from consecutive films or batches. It adds a factor of (scale1 - scale2)2/(scale restrain)2 to the target function minimized in scaling. This only applies to batches between which you add partials. The value should roughly represent the expected relative change in scale factors between adjacent frames.
For very thin frames, or frames where most of the reflections are partially recorded (yellow, as opposed to green, spots on the image during integration, see Figure 28), or in sectors whose width is less than the mosaicity, or from data which consists of only a few frames, scale restrain is almost obligatory. In these cases, there may not be enough intersections between frames to get accurate scale and B factors. Indeed, what you may see is both scale and B factors ranging all over the place. If things get really bad, the program may crash due to floating point arithmetic exceptions when taking the exponent of unreasonable B factors. This does not mean that the data is unusable; it simply means that the scale and B factors must be restrained. Note that the restraints only apply to the frames over which you are adding up partially recorded reflections.
It is not correct to try to "bin" the individual frames into larger batches to try to overcome the problem of few intersections between frames. This is because you then lose the ability to add partials between the new "bins." You can, however, overcome the problem in another way, by including instead the statement number of iterations 0. In this case, the data will not be scaled, but simply merged. Obviously, this has its drawbacks.
b restrain can be used to restrain B factor differences from consecutive films or batches. The value which follows the flag represents the amount in Å2 you will allow the B factors to differ from consecutive frames or batches.
use rejections on next run means that the reflections that were written to the file reject are not used in subsequent scaling runs. Unless this is explicitly checked reflections will only be flagged for rejection, not actually rejected. Generally one uses this on the second, and all subsequent runs of scaling, unless an anomalous signal is expected.
write rejection file tells the program to create a list of reflections, stored in the file called reject that meet the criteria for rejection. They are applied, i.e. rejected, when the use rejections on next run button is checked.
fix b. This flag tells the program not to fit B factors at all. Usually it is combined with the input of the B factors you want to apply but do not wish to refine anymore or it is used for frozen crystals where you do not expect significant decay. This is in contrast to the default procedure, where the B factors are fit only after the convergence of the scaling. In the default procedure, if scaling does not converge in 20 (default) cycles of refinement, B factors will not be fitted.
anomalous. Flag for keeping Bijovets (I+ and I-) separate in output file. If the anomalous flag is on, anomalous pairs are considered equivalent when calculating scale and B factors and when computing statistics, but are merged separately and output as I+ and I- for each reflection. See the next section for a discussion of how to treat data that contains an anomalous signal.
scale anomalous. This is the flag for keeping Bijvoets (I+ and I-) separate both in scaling and in the output file. If the scale anomalous flag is on, anomalous pairs are considered non-equivalent when calculating scale and B factors and when computing statistics, and are merged separately and output as I+ and I- for each reflection.
This is a dangerous option because scaling may be unstable due to the reduced number of intersections between images. The danger is much greater in low symmetry space groups. scale anomalous will always reduce Rmerge, even in the absence of an anomalous signal, because of the reduced redundancy. However, χ2's will not be affected in the absence of an anomalous signal.
ignore overloads. Useful if you collect data at low and high exposures and is useful to ignore the saturated reflections at the high exposures. Applies to data read in after this command. If it is not checked, fitted profiles with some pixels missing (typically due to overload) are included in the scaling. Note that for summed profiles this does not apply because only profile fitting can estimate the value of the overloaded pixels.
direction cosines produce information that can be read by an outside absorption correction program, such as Shelx. The need for it will disappear as the HKL-2000 absorption correction routines are implemented.
fit goniostat gives the goniostat misalignment angles. It should really only be used by beamline staff.
number of zones. Number of resolution shells the data is divided into for the basis of calculating statistics. This input is required and must match the number of zones specified under the estimated error keyword. Handy tip: it's nice to set up the number of zones to equal the number of zones used by CNS for the output of refinement statistics by shell, for when you get around to publishing your data and refinement statistics together in a paper. The default value is 10 but for higher resolution data the larger value allows to detect small crystals of ice or salt not always easily visible on the diffraction image.
error scale factor. This is a single multiplicative factor which is applied to the input σ. This should be adjusted so the normal χ2 (goodness of fit) value that is printed in the final table of the output comes close to 1. By default the input errors are used (error scale factor = 1.3). It applies to the data which are read after this keyword, so you can apply different error scale factor to subsequent batches by repeating this input with different values. Reasonable values of the error scale factor are between 1 and 2. If you have to resort to higher values (2, 3, …) in order to get your overall χ2 values to 1 then you should be suspicious of some problem with your crystal, indexing, space group/lattice assignment, or detector/goniostat. That doesn’t mean that your data is useless, but it does mean that there are problems with it.
default scale. Overall scale factor used in the absence of an initial scale factor. This is useful if the data are too strong, which is sometimes the case with small molecules. It will reduce the output intensities by the factor entered.
resolution. Minimum d-spacing for this run. Default is the maximum resolution found in the input data. Note that if you integrated your raw data correctly the maximum resolution should slightly exceed the resolution at which your signal drops below 2 σ. Therefore, in subsequent rounds of scaling it is usually important to enter a new value for the maximum resolution over which scaling is performed. Here’s an example: You collected data and integrated it over the resolution range of 50 to 2.2 Å. The first round of scaling was performed over this resolution range and you noted that the average I/σ dropped below 2 at 2.3 Å resolution. Therefore, you should do all subsequent scaling with a minimum resolution of 50 Å and a maximum resolution of 2.3 Å.
Change the resolution after the first round of scaling where you have properly adjusted the error scale factor and rejected outliers. It is only at this point that you can determine where the signal drops below 2 σ. This is your high resolution limit. Once this has been determined, you should delete the reject file and start your scaling over again, this time with the proper resolution limits.
The three files and their default names are output.sca, scale.out, and scale.log (Figure 47).
output.sca is the file that contains your scaled reflections. The header lists the unit cell and space group, and the five columns correspond to h, k, and l, I, and σ(I). You may rename this file as you wish, although the convention is to retain the .sca suffix to show that it was created by Scalepack.
scale.log is a log of the scaling operations. It is often quite long once the list of rejected reflections is included. You may rename this file as you wish also.
scale.out (not visible in an Output File window) is a file that contains the data for the plots seen in the user interface. You shouldn’t rename this.
The files are all written to the output data directory you specified in the Main page. However, if you fill in the path explicitly in the box provided you can write the files anywhere you’d like.
Figure 47. The Scalepack output files
No. Combining into a single zone for the purposes of calculations those resolution shells where Rmerge is rapidly changing yields misleading statistics. In this case, the shell will be dominated by the strong data at the low resolution end of the zone and give the impression that the high resolution limit of the zone has better statistics than it really does. For example, if you combined all your data into a single zone, the Rmerge in the final shell would be pretty good (=Rmerge overall), when in fact it was substantially worse. It is more sensible to divide your zones into equal volumes and have enough of them so that you can accurately monitor the decay with resolution.
The error model is the estimate of the systematic error for each of the resolution shells. There will be exactly the same number of error estimates here as there are number of zones. So if you have 10 zones, you need 10 numbers - one for each zone.
The error estimates do not all have to be the same. The estimated error applies to the data which are read after this keyword, so you can apply different error scale factor to subsequent batches by repeating this input with different values. This is an important point if you enter data from a previous Scalepack output that does not need its s to be increased.
The error estimates should be approximately equal to the R-factor in the table at the end of the output for resolution shells where statistical errors are small, namely the earlier resolution shells where the data is strong. This is a crude estimate of the systematic error, to be multiplied by I, and is usually invariant with resolution. Default = 0.03 (i.e. 3%) for all zones. Examine the difference between the total error and the statistical error in the final table of statistics. The difference between these numbers tells you what contribution statistical error makes to the total error (σ). If the difference is small, then reasonable changes in the estimated error values following the will not help your χ2 much. This is because the estimated errors represent your guess / knowledge of the contribution of systematic error to the total error and a small difference indicates that systematic error is not contributing much. If the difference between total and statistical error is significant, and the χ2 are far from 1, then consider adjusting the estimated error values in the affected resolution shells.
The default for the Number of Global Refinement Cycles is 10. You can change this by changing the number in the box under Global Refinement (Figure 45).
The rejection file is called reject and contains the hkl’s to be rejected on subsequent rounds of scaling if the Use Rejections on Next Run button is set.
3D Window and Mosaicity
Table of Contents
Output Plots and Error Scale Factor