Linear Least Square Fitting

There are many situations in chemistry and the other sciences where there is an inherent linear variation between a dependent variable (say, y) and an independent variable (say, x). You must be knowing that a linear variation between the variables x and y is expressible by the relation y = mx + c, where m and c are two constants. For example, in reaction kinetics for 1st order reactions, there exists an inherent linear variation between log [A], i.e., the logarithm of molar concentration of the reactant A on one hand, and the time t on the other. (The exact relation for this inherent variation is
log [A] = (–k/2.303).t + log [A]_o, where m equals –k/2.303 and c equals log [A]_o).

However, due to experimental errors of measurement occurring in varying extents for the various
(x_i, y_i) points (where i = 1, 2, 3, ...., n), the experimental (x_i, y_i) points for a given set of such measurement generally do not exactly lie on a straight line (in absence of experimental errors, they would have surely lied on a line, because of the aforesaid inherent linear variation). For example, in case of a 1st order reaction, there (naturally!) arises various experimental errors in measuring the varying reactant concentrations [A]_i, because of which the points (t_i, log [A]_i) do not exactly fall on a straight line (you must have experienced that yourself). So, to minimize the underlying experimental errors in measuring the various points, an obvious way would be to find a straight line that runs as close to the experimental points as possible, with the points not falling on the line being allowed to lie along both the sides of the line in nearly equal extents (see figure).

One frequently finds such a straight line manually, simply by carefully observing the points and choosing such a line. However, there is a statistical way of finding such a straight line employing definite mathematical operations, this method being known as the method of linear least-squares fitting. In this method, we consider the deviation d_i of the experimental value (say, y_i) of the dependent variable from its value on the chosen straight line (say, mx_i + c = u_i), corresponding to the value of the independent variable (say, x_i) for that point (in above figure, consider the point P). This means, d_i = y_i – (mx_i + c) = y_i – mx_i – c. Note that the deviation is considered wholly along the vertical direction (see above figure). The sum S of the squares of these deviations for all the n experimental points (i.e., S = S_id_i²) is then minimised to get the proper straight line (where n is the number of experimental points in the given set).
The deviations themselves are not summed, because positive and negative deviations have a tendency of cancelling one another even for large individual deviations! So, to always have a smaller sum S indicate smaller individual deviations, it is advisable to consider the sum of the squares of the deviations. Also, because of squaring, larger individual deviations become more significant in deciding the line to choose.

We thus get, S = S_id_i² = S_i(y_i – mx_i – c)² = S_i(y_i² + m²x_i² + c² – 2mx_iy_i – 2cy_i + 2mcx_i)
= S_iy_i² + m²S_ix_i² + n c² – 2m S_ix_iy_i – 2c S_iy_i + 2mc S_ix_i
Where we note that the constant c² summed n times would simply equal nc². Let us, for simplicity, denote the sums S_iy_i² as S_yy, S_ix_i² as S_xx, S_iy_i² as S_yy, S_ix_i y_i as S_xy, S_iy_i as S_y and S_ix_i as S_x. This gives us:
S = S_yy + m²S_xx + n c² – 2m S_xy – 2c S_y + 2mc S_x ------ (i). Noting that for a given set of experimental points, the above sums S_yy, S_xx, S_xy, S_y, and S_x are obviously constants (the x_i and y_i values have already been obtained), we find that the sum S is a function of only the two variables m and c, as per the relation (i).
[The parameters m and c are, however, variables because when one chooses different straight lines (say, manually) to represent the set of points, m and c would obviously vary.]

Thus, using concepts of differential calculus about a function of two variables, we find that for S to be a minimum, (∂S/∂m)_c = 0 and (∂S/∂c)_m = 0. Employing equation (i), the differentiations give:
0 + 2mS_xx + 0 – 2S_xy + 2cS_x = 0 and 0 + 0 + 2cn – 2 S_y + 2m S_x = 0. Rearranging and dividing both the relations by 2, we get the following two linear equations in the two unknowns m & c:
S_xxm + S_xc = S_xy ---- (ii) and S_xm + nc = S_y ---- (iii).
To solve these two equations, we first multiply equation (ii) with n and equation (iii) with S_x, then subtract the second result from the first. This gives m = (nS_xy–S_xS_y)/(nS_xx–S_x²). Next we multiply equation (ii) with S_xand equation (iii) with S_xx, then subtract the first result from the second. This gives c = (S_xxS_y–S_xS_xy)/(nS_xx–S_x²). Note that the denominator is common in both the expressions -- commit it (and also the two numerators) to memory!
The above two expressions for m & c are rather easy to remember for their symmetry and dimensional characteristics. Count the number of x and y in any numerator or in the denominator terms -- they're same (e.g., two x-s are there in both denominator terms i.e., in nS_xx and S_x²)!. The left (plus) term always tend to have them together (as in nS_xx or in nS_xy), whereas the right (minus) term tend to have them separate (as in S_x² or in S_xS_y): if in this way there occurs only one sum in the left term, there remains the n factor instead. Dimensionally, m must obviously (considering the basic relation y = mx + c) have the dimension of y/x and c the dimension of y. This is wholly consistent with the fact that there is one x and one y in the numerator for m and two x-s in the (common) denominator, whereas there is two x and one y in the numerator for c. So, even if you forget, you can make out the two expressions.

A good technique to perform a linear least-square fitting calculation on a given set of points is probably to make a table with columns for x_i, y_i, x_i² and x_iy_i values. The sum of these four columns would conveniently provide the values of the sums S_x, S_y, S_xx, and S_xy, which values may then be put in the expressions for m & c to calculate these two parameters (see the following problem). Free software packages are also available for performing linear (or higher-power) least-square fitting for one-variable (or even two-variable) situations; for such automated calculations, this author prefers the Power-Surface package developed by him.

Problem: In kinetic investigation of acid hydrolysis of AcOMe, the titre volumes V_t of NaOH solution remains involved in the linear relation log (V_a–V_t) = (–k/2.303) t + log (V_a–V_o). At time t = 0, 10, 20, 30, 45 & a min., a student observed titre values V_t = 24.2, 25.9, 27.5, 29.0, 31.1 & 44.7 mL respectively. Find the rate constant k using linear least-squares method.

Answer: Denoting time t as x_i (and naturally excluding the infinite-time point at t = a -- that's required only to know V_a), and log (V_a–V_t) as y_i, we construct the following table with 5 points (i.e., n = 5):

Point No.	x_i = t (time t in minutes)	y_i = log (V_a–V_t)	x_i²	x_iy_i
1	0	1.312	0	0.00
2	10	1.274	100	12.74
3	20	1.236	400	24.72
4	30	1.196	900	35.88
5	45	1.134	2025	51.03
TOTAL	105	6.152	3425	124.37

Thus we get n = 5, S_x = 105, S_y = 6.152, S_xx = 3425 and S_xy = 124.37
This gives m = (5 x 124.37 – 105 x 6.152)/(5 x 3425–105²) = –0.003952,
and c = (3425 x 6.152 – 105 x 124.37)/(5 x 3425–105²) = 1.313.
So we get m = (–k/2.303), implying k = –2.303 m. This gives k = 9.101 x 10^–3 (min^–1).