Linear Least-Squares Fitting

There are many situations in chemistry and the other sciences where there is an inherent linear variation between a dependent variable (say, y) and an independent variable (say, x). You must be knowing that a linear variation between the variables x and y is expressible by the relation y = mx + c, where m and c are two constants. For example, in reaction kinetics for 1st order reactions, there exists an inherent linear variation between  log [A], i.e., the logarithm of molar concentration of the reactant A on one hand, and the time t on the other. (The exact relation for this inherent variation is 
log [A] = (–k/2.303).t + log [A]o, where m equals –k/2.303 and c equals log [A]o). 

However, due to experimental errors of measurement occurring in varying extents for the various 
(xi, yi) points (where i = 1, 2, 3, ...., n), the experimental (xi, yi) points for a given set of such measurement generally do not exactly lie on a straight line (in absence of experimental errors, they would have surely lied on a line, because of the aforesaid inherent linear variation). For example, in case of a 1st order reaction, there (naturally!) arises various experimental errors in measuring the varying reactant concentrations [A]i, because of which the points (ti , log [A]i) do not exactly fall on a straight line (you must have experienced that yourself). So, to minimize the underlying experimental errors in measuring the various points, an obvious way would be to find a straight line that runs as close to the experimental points as possible, with the points not falling on the line being allowed to lie along both the sides of the line in nearly equal extents (see figure). 

One frequently finds such a straight line manually, simply by carefully observing the points and choosing such a line. However, there is a statistical way of finding such a straight line employing definite mathematical operations, this method being known as the method of linear least-squares fitting. In this method, we consider the deviation di of the experimental value (say, yi) of the dependent variable from its value on the chosen straight line (say, mxi + c = ui), corresponding to the value of the independent variable (say, xi) for that point (in above figure, consider the point P). This means, di = yi – (mxi + c) = yi – mxi – c. Note that the deviation is considered wholly along the vertical direction (see above figure). The sum S of the squares of these deviations for all the n experimental points (i.e., S = Si di2) is then minimised to get the proper straight line (where n is the number of experimental points in the given set).
The deviations themselves are not summed, because positive and negative deviations have a tendency of cancelling one another even for large individual deviations! So, to always have a smaller sum S indicate smaller individual deviations, it is advisable to consider the sum of the squares of the deviations. Also, because of squaring, larger individual deviations become more significant in deciding the line to choose.

We thus get, S = Si di2 = Si (yi – mxi – c)2 = Si (yi2 + m2 xi2 + c2 – 2mxiyi – 2cyi + 2mcxi)
    = Si yi2 + m2 Si xi2 + n c2 – 2m Si xiyi – 2c Si yi + 2mc Si xi
Where we note that the constant c2 summed n times would simply equal nc2. Let us, for simplicity, denote the sums Si yi2 as Syy, Si xi2 as Sxx, Si yi2 as Syy, Si xi yi as Sxy, Si yi as Sy and Si xi as Sx. This gives us:
    S = Syy + m2 Sxx + n c2 – 2m Sxy – 2c Sy + 2mc Sx ------ (i). Noting that for a given set of experimental points, the above sums Syy, Sxx, Sxy, Sy, and Sx are obviously constants (the xi and yi values have already been obtained), we find that the sum S is a function of only the two variables m and c, as per the relation (i). 
[The parameters m and c are, however, variables because when one chooses different straight lines (say, manually) to represent the set of points, m and c would obviously vary.]

Thus, using concepts of differential calculus about a function of two variables, we find that for S to be a minimum, (∂S/∂m)c = 0 and (∂S/∂c)m = 0. Employing equation (i), the differentiations give:
    0 + 2mSxx + 0 – 2Sxy + 2cSx = 0 and 0 + 0 + 2cn – 2 Sy + 2m Sx = 0. Rearranging and dividing both the relations by 2, we get the following two linear equations in the two unknowns m & c:
    Sxx m + Sx c = Sxy ---- (ii) and  Sx m + n c = Sy ---- (iii).
To solve these two equations, we first multiply equation (ii) with n and equation (iii) with Sx, then subtract the second result from the first. This gives m = (nSxy–SxSy)/(nSxx–Sx2). Next we multiply equation (ii) with Sx and equation (iii) with Sxx, then subtract the first result from the second. This gives c = (SxxSy–SxSxy)/(nSxx–Sx2). Note that the denominator is common in both the expressions -- commit it (and also the two numerators) to memory!
The above two expressions for m & c are rather easy to remember for their symmetry and dimensional characteristics. Count the number of x and y in any numerator or in the denominator terms -- they're same (e.g., two x-s are there in both denominator terms i.e., in nSxx and Sx2)!. The left (plus) term always tend to have them together (as in  nSxx or in  nSxy), whereas the right (minus) term tend to have them separate (as in Sx2 or in SxSy): if in this way there occurs only one sum in the left term, there remains the n factor instead. Dimensionally, m must obviously (considering the basic relation y = mx + c) have the dimension of y/x and c the dimension of y. This is wholly consistent with the fact that there is one x and one y in the numerator for m and two x-s in the (common) denominator, whereas there is two x and one y in the numerator for c. So, even if you forget, you can make out the two expressions.

A good technique to perform a linear least-square fitting calculation on a given set of points is probably to make a table with columns for xi, yi, xi2 and xiyi values. The sum of these four columns would conveniently provide the values of the sums Sx, Sy, Sxx, and Sxy, which values may then be put in the expressions for m & c to calculate these two parameters (see the following problem). Free software packages are also available for performing linear (or higher-power) least-square fitting for one-variable (or even two-variable) situations; for such automated calculations, this author prefers the Power-Surface package developed by him.

Problem: In kinetic investigation of acid hydrolysis of AcOMe, the titre volumes Vt of NaOH solution remains involved in the linear relation log (Va–Vt) = (–k/2.303) t + log (Va–Vo). At time t = 0, 10, 20, 30, 45 & a min., a student observed titre values Vt = 24.2, 25.9, 27.5, 29.0, 31.1 & 44.7 mL respectively. Find the rate constant k using linear least-squares method.

Answer: Denoting time t as xi (and naturally excluding the infinite-time point at t = a -- that's required only to know Va), and log (Va–Vt) as yi, we construct the following table with 5 points (i.e., n = 5):

Point  
No.
xi = t
(time t in minutes)
yi = log (Va–Vt) xi2 xi yi
1     0 1.312       0    0.00
2   10 1.274    100   12.74
3   20 1.236    400   24.72
4   30 1.196    900   35.88
5   45 1.134  2025   51.03
TOTAL 105 6.152 3425 124.37

Thus we get n = 5, Sx = 105, Sy = 6.152, Sxx = 3425 and Sxy = 124.37
This gives m = (5 x 124.37 – 105 x 6.152)/(5 x 3425–1052) = –0.003952, 
and c = (3425 x 6.152 – 105 x 124.37)/(5 x 3425–1052) = 1.313.
So we get m = (–k/2.303), implying k = –2.303 m. This gives  k = 9.101 x 10–3 (min–1).

1