Let’s suppose we have 3 sets of data to create
3 models.
Here is the first data set:
y 
2.1 
4.5 
6.8 
9.4 
10.8 
X1 
1 
2 
3 
4 
5 
And the second:
y 
4.8 
9.6 
13.7 
19.6 
21.4 
X2 
1 
2 
3 
4 
5 
X3 
0.25 
0.42 
0.84 
1.32 
1.64 
And the third:
y 
0.5 
1.1 
1.4 
2.3 
3.4 
X4 
10.2 
20.5 
30.4 
41.6 
50.2 
Using least squares, the summary
statistics for each of the coefficients are:
Model #1

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.03 
0.3728 
0.0805 
0.9409 
1.1564 
1.2164 
X1 
2.23 
0.1124 
19.8402 
0.0003 
1.8723 
2.5877 
Model #2

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
1.1418 
2.1708 
0.5260 
0.6514 
8.1986 
10.4822 
X2 
3.8261 
2.9894 
1.2799 
0.3290 
9.0363 
16.6886 
X3 
1.3420 
8.0463 
0.1668 
0.8829 
33.2786 
35.9626 
Model #3

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.3674 
0.3177 
1.1562 
0.3313 
1.3786 
0.6438 
X4 
0.0689 
0.0094 
7.3223 
0.0053 
0.0390 
0.0989 
If all the variables are combined into
one model, this is the result:

Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Intercept 
0.0492 
0.4615 
0.1066 
0.9172 
0.9791 
1.0775 
x1 
2.2248 
0.1557 
14.2931 
0.0000 
1.8779 
2.5716 
x2 
5.1036 
0.9861 
5.1753 
0.0004 
2.9064 
7.3009 
x3 
1.8103 
2.9568 
0.6122 
0.5540 
8.3984 
4.7778 
x4 
0.0577 
0.0153 
3.7725 
0.0036 
0.0236 
0.0918 
The goal is to create models that
solve for the coefficients simultaneously. Arnold
Zellner introduced a concept called the seemingly
unrelated regression (SUR) model. In this model, this is how the data would be
set up:
y1 
x1 
x2 
x3 
x4 
2.1 
1 
0 
0 
0 
4.5 
2 
0 
0 
0 
6.8 
3 
0 
0 
0 
9.4 
4 
0 
0 
0 
10.8 
5 
0 
0 
0 
4.8 
0 
1 
0.25 
0 
9.6 
0 
2 
0.42 
0 
13.7 
0 
3 
0.84 
0 
19.6 
0 
4 
1.32 
0 
21.4 
0 
5 
1.64 
0 
0.5 
0 
0 
0 
10.2 
1.1 
0 
0 
0 
20.5 
1.4 
0 
0 
0 
30.4 
2.3 
0 
0 
0 
41.6 
3.4 
0 
0 
0 
50.2 
Of course, for each model, you would still
need a vector of 1s for the intercept, which leads to this matrix which I call
the X new matrix:
y1 
x10 
x11 
x20 
x21 
x22 
x30 
x31 
2.1 
1 
1 
0 
0 
0 
0 
0 
4.5 
1 
2 
0 
0 
0 
0 
0 
6.8 
1 
3 
0 
0 
0 
0 
0 
9.4 
1 
4 
0 
0 
0 
0 
0 
10.8 
1 
5 
0 
0 
0 
0 
0 
4.8 
0 
0 
1 
1 
0.25 
0 
0 
9.6 
0 
0 
1 
2 
0.42 
0 
0 
13.7 
0 
0 
1 
3 
0.84 
0 
0 
19.6 
0 
0 
1 
4 
1.32 
0 
0 
21.4 
0 
0 
1 
5 
1.64 
0 
0 
0.5 
0 
0 
0 
0 
0 
1 
10.2 
1.1 
0 
0 
0 
0 
0 
1 
20.5 
1.4 
0 
0 
0 
0 
0 
1 
30.4 
2.3 
0 
0 
0 
0 
0 
1 
41.6 
3.4 
0 
0 
0 
0 
0 
1 
50.2 
In the frequentist school, the beta vector
β and the covariance matrix of the error terms ∑ are fixed
quantities. However, in the Bayesian school, β follows a normal
distribution and ∑ follows an inverse Wishart distribution.
This is the basic way the Gibbs sampler works
for the SUR model: Start off with least squares as in the frequentist school.
From that, you create the E matrix of the error terms. Since there are 3 models
and each model has 5 rows, the E matrix is a 5x3 matrix. For the above models,
this would be the E matrix:
0.16 
0.5040 
0.1645 
0.01 
0.2414 
0.0546 
0.08 
0.0492 
0.3276 
0.45 
1.3795 
0.1994 
0.38 
1.0767 
0.3079 
From that, you create the S matrix which is
E’E. For the above models, this would be the S matrix:
0.3790 
1.1090 
0.2587 
1.1090 
3.3771 
0.6602 
0.2587 
0.6602 
0.2719 
It should be noted that the diagonal elements are
equivalent to the sum of squared errors (residuals) from the regression ANOVA
table for each model.
To get the sigma matrix, we divide the
elements by the number of rows in each model. Since there are 5 rows, this is
the sigma matrix for the above data:
0.0758 
0.2218 
0.0517 
0.2218 
0.6754 
0.1320 
0.0517 
0.1320 
0.0544 
As you notice from the above matrix,
the error terms are correlated. However, we
would like uncorrelated errors. The way
we do that is by first computing the lower Cholesky
root of the sigma matrix which we call the L matrix. This would be the result:
0.275318 
0 
0 
0.805645 
0.162314 
0 
0.18795 
0.119367 
0.069388 
We then take the inverse of the L matrix:
3.6322 
0.0000 
0.0000 
18.0282 
6.1609 
0.0000 
40.8517 
10.5985 
14.4117 
The dimensions of the X new matrix are 15x7.
However, the dimensions of L^{ }inverse matrix are 3x3. To create a
15x15 matrix, we compute the Kronecker product of the L inverse matrix and the
identity matrix of dimensions 5x5, since each model has 5 rows. This is what
the result looks like for the first 5 rows and columns:
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
0 
0 
0 
0 
0 
3.6322 
We then create the matrix which
is the product of the above Kronecker
product and the X new matrix.
14.4117 
14.4117 
0 
0 
0 
0 
0 
3.6322 
7.2644 
0 
0 
0 
0 
0 
3.6322 
10.8966 
0 
0 
0 
0 
0 
3.6322 
14.5288 
0 
0 
0 
0 
0 
3.6322 
18.161 
0 
0 
0 
0 
0 
18.0282 
18.0282 
6.1609 
6.1609 
1.540225 
0 
0 
18.0282 
36.0564 
6.1609 
12.3218 
2.587578 
0 
0 
18.0282 
54.0846 
6.1609 
18.4827 
5.175156 
0 
0 
18.0282 
72.1128 
6.1609 
24.6436 
8.132388 
0 
0 
18.0282 
90.141 
6.1609 
30.8045 
10.10388 
0 
0 
40.8517 
40.8517 
10.5985 
10.5985 
2.64963 
14.4117 
146.9993 
40.8517 
81.7034 
10.5985 
21.197 
4.45137 
14.4117 
295.4399 
40.8517 
122.5551 
10.5985 
31.7955 
8.90274 
14.4117 
438.1157 
40.8517 
163.4068 
10.5985 
42.394 
13.99 
14.4117 
599.5267 
40.8517 
204.2585 
10.5985 
52.9925 
17.3815 
14.4117 
723.4673 
We can also create the vector
which is the product of the Kronecker product and the y vector. This is what it
looks like for this data:
30.26457 
16.3449 
24.69896 
34.14268 
39.22776 
8.2869 
21.9823 
38.1874 
48.7114 
62.8613 
42.12162 
97.93992 
152.7685 
209.4223 
263.3902 
This is the point at which the Bayesian
approach departs from that of the frequentist school.
We start with the prior distributions of
β and ∑. To do that, we introduce some new terms.
The vector is
the average of previous beta vectors. If there are no previous beta vectors, it
is a vector of 0s.
The A matrix is known as the precision matrix.
To say there is a bit of literature on this matrix is a tad of an
understatement. However, in a PDF file put together by Peter E. Rossi (one of
the authors of Bayesian Statistics and Marketing), I found a formulation of it
by multiplying the identity matrix by a scalar of 0.05. As a twist on this, I
tried a scalar of the product of the number of coefficients and 0.01. Since
there are 4 X variables and 3 intercepts for a total of 7 coefficients, the
scalar for the A matrix would be 0.07.
The prior distribution of β follows a
normal distribution with the mean equal to and the
covariance matrix equal to A^{1}.
The V matrix is constructed by taking the
sigma matrix constructed previously and multiplying it by the degrees of
freedom v = n – k in which k represents the number of coefficients. Since n =
15 and k = 7, we have v = 8 degrees of freedom. For this data set, this is the
V matrix:
0.5306 
1.5526 
0.3619 
1.5526 
4.7278 
0.924 
0.3619 
0.924 
0.3808 
The prior distribution of ∑ follows an
inverse Wishart distribution with parameters V and v degrees of freedom.
We now combine the prior distributions and the
data to create the posterior distributions.
In the frequentist school, B = (X’X)^{1}(X’Y).
However, in the Bayesian school,
= (X’X
+ A)^{1}(X’Y + A).
Once is
constructed, the posterior distribution of β follows a normal distribution
with the mean of and
covariance matrix of .
If we assume that the vector
is a vector of 0s, then .
For this data set, this is:
0.1569 
2.274553 
0.383083 
4.328316 
0.331196 
0.24892 
0.066227 
In order to generate β, we need a
vector of Z values which follows a normal distribution with mean of 0 and
standard deviation of 1. Since there are 7 coefficients, we need a vector of 7 Z
values. Suppose, the computer generates this
vector:
0.7839 
2.5079 
1.8482 
1.2640 
0.2313 
0.8057 
0.2340 
We also need the lower Cholesky
root of For
this data, the current value of this matrix is:
0.01062 
0.00452 
0.031115 
0.01332 
0.00025 
0.00732 
0.000305 
0.00452 
0.003009 
0.01327 
0.008877 
0.00019 
0.003126 
0.0002 
0.031115 
0.01327 
0.130568 
0.06583 
0.052223 
0.00036 
0.000331 
0.01332 
0.008877 
0.06583 
0.062417 
0.09222 
0.003747 
0.00042 
0.00025 
0.00019 
0.052223 
0.09222 
0.249939 
0.00112 
4.36E05 
0.00732 
0.003126 
0.00036 
0.003747 
0.00112 
0.026158 
0.00078 
0.000305 
0.0002 
0.000331 
0.00042 
4.36E05 
0.00078 
3.22E05 
The lower root is:
0.103054 
0 
0 
0 
0 
0 
0 
0.04386 
0.032934 
0 
0 
0 
0 
0 
0.301925 
0.00082 
0.198514 
0 
0 
0 
0 
0.12922 
0.097434 
0.13465 
0.134514 
0 
0 
0 
0.002426 
0.00264 
0.259367 
0.42169 
0.069514 
0 
0 
0.07106 
0.000278 
0.10628 
0.065777 
0.01119 
0.073216 
0 
0.002959 
0.00221 
0.00284 
0.00151 
0.00186 
0.00196 
0.000919 
Since this is a multivariate normal
distribution, we multiply the lower matrix with the Z vector and add . This is the result:
0.7839 
2.5079 
1.8482 
1.264 
0.2313 
0.8057 
0.234 
When we multiply X and β and calculate
the error terms, we can create the E matrix:
1.1918 
4.273625 
1.0811 
1.2997 
10.37695 
2.8913 
1.5076 
15.83809 
4.9079 
1.4155 
23.11312 
6.6287 
2.5234 
26.25113 
7.5411 
After that, we once again create the S
matrix which is E’E:
13.75365 
141.416 
40.85756 
141.416 
1600.128 
463.527 
40.85756 
463.527 
134.4237 
The posterior distribution of ∑ follows an
inverse Wishart distribution with parameters S + V and degrees of freedom v + n
in which n is the number of rows in each model. For this data set, this is the
sum of S and V:
14.28425 
139.863 
40.49566 
139.863 
1604.856 
464.451 
40.49566 
464.451 
134.8045 
The degrees of freedom are 8 + 5 = 13.
To generate ∑, we need the T matrix.
Since there are 3 models, the dimensions of the T matrix are 3 x 3. For this
data set, this is the basic setup of the T matrix:
For example, is a
chisquare variable with 13 degrees of freedom.
Suppose, the following T matrix is generated:
7.201962 
0 
0 
0.1135 
15.26426 
0 
0.567206 
0.759346 
8.226725 
We also need U which is the upper Cholesky root of the S + V matrix:
3.779451 
37.0062 
10.71469 
0 
15.34273 
4.42824 
0 
0 
0.624956 
We then generate C = T’U:
51.86826 
2.54992 
8.665058 
0 
232.9976 
17.83779 
0 
0 
67.679 
Finally, ∑ = (C^{1})(C^{1})’
:
0.000378119 
3.76071E06 
3.72939E05 
3.76071E06 
1.96999E05 
1.6714E05 
3.72939E05 
1.6714E05 
0.000218319 
We then use this ∑ to generate
the next β and so on. After running 50,000 iterations of the Gibbs
sampler, the averages of the generated β and ∑ are calculated. This is due to the random Z and values that are generated in each
iteration. This is a result
for β:
Coefficient 
Beta (Gibbs) 
SE (Gibbs) 
Beta (LS) 
SE (LS) 
X10 
0.1073 
2.4016 
0.03 
0.3728 
X11 
2.2077 
0.2182 
2.23 
0.1124 
X20 
1.1355 
0.2255 
1.1418 
2.1708 
X21 
3.8157 
0.0751 
3.8261 
2.9894 
X22 
1.3808 
0.4142 
1.3420 
8.0463 
X30 
0.3544 
0.2041 
0.3674 
0.3177 
X31 
0.0686 
0.0002 
0.0689 
0.0094 
As a comparison, the coefficients
and standard errors from the individual least square models are included. It
should be noted that the standard errors for models #2 and #3 under the Gibbs
sampler are quite less than under least squares.
After 50,000 iterations,
this is a result for ∑:
2.187177 
0.62216 
0.501253 
0.62216 
0.188484 
0.13562 
0.501253 
0.13562 
0.183619 
The aforementioned standard
errors are calculated by computing
In this equation, n represents
the number of rows in each model, which, in this case, is 5.
In regard to the sum of the
squared errors, these are the results:
SUR 
Model #1 
Model #2 
Model #3 
Combined 
4.0339 
0.379 
3.3770 
0.2719 
4.6119 
The sum of
squared errors is less for each of the individual models than for the SUR model
which, in turn, is less than that of the model in which all of the variables
are combined into one model. It should be noted that when the Gibbs sampler is
run for 100,000 iterations, the sum of squared errors goes down to 4.0331 which
would seem to suggest that running the extra 50,000 iterations is not worth the
while.
Reference:
Rossi, P.E.,
Allenby, G.M., and McCullogh, R. Bayesian Statistics and Marketing. Chichester: John Wiley &
Sons, 2005.
Zellner, A. An Introduction to Bayesian Inference in Econometrics. New York:
John Wiley & Sons, 1970.