# Multiple Regression Basics

## Motivation

### Introductory Example: tree dataset from R

using Statistics
using Plots
using RDatasets

trees = dataset("datasets", "trees")

scatter(trees.Volume, trees.Girth,
legend=false, xlabel="Girth", ylabel="Volume")

Aim: Find relationship between the response variable volume and the explanatory variable/covariate girth? Can we predict the volume of a tree given its girth?

scatter(trees.Girth, trees.Volume,
legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)

First Guess: There is a linear relation!

## Simple Linear Regression

Main assumption: up to some error term, each measurement of the response variable $$y_i$$ depends linearly on the corresponding value $$x_i$$ of the covariate

$$\leadsto$$ (Simple) Linear Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,$ where $$\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$$ are independent normally distributed errors with unknown variance $$\sigma^2$$.

Aim: Find the straight line that fits best, i.e., find the optimal estimators for $$\beta_0$$ and $$\beta_1$$.

Typical choice: Least squares estimator (= maximum likelihood estimator for normal errors)

$(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2$

where $$\mathbf{y}$$ is the vector of responses, $$\mathbf{x}$$ is the vector of covariates and $$\mathbf{1}$$ is a vector of ones.

Written in matrix style:

$(\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2$

Note: There is a closed-form expression for $$(\hat \beta_0, \hat \beta_1)$$. We will not make use of it here, but rather use Julia to solve the problem.

lm(@formula(Volume ~ Girth), trees)

Interpretation of the Julia output:

• column estimate : least square estimates for $$\hat \beta_0$$ and $$\hat \beta_1$$

• column Std. Error : estimated standard deviation $$\hat s_{\hat \beta_i}$$ of the estimator $$\hat \beta_i$$

• column t value : value of the $$t$$-statistics

$t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1,$

Under the hypothesis $$\beta_i=0$$, the test statistics $$t_i$$ would follow a $$t$$-distribution.

• column Pr(>|t|): $$p$$-values for the hypotheses $$\beta_i=0$$ for $$i=0,1$$

Tip

The command rand(n) generates a sample of n “random” (i.e., uniformly distributed) random numbers. If you want to sample from another distribution, use the Distributions package, define an object being the distribution of interest, e.g. d = Normal(0.0, 2.0) for a normal distribution with mean 0.0 and standard deviation 2.0, and sample n times from this distribution by rand(d, n).

1. Generate $$n=20$$ covariates $$\mathbf{x}$$ randomly.
2. Given these covariates and the true parameters $$\beta_0=-3$$, $$\beta_1=2$$ and $$\sigma=0.5$$, simulate responses from a linear model (with normally distributed errors) and estimate the coefficients $$\beta_0$$ and $$\beta_1$$.
3. Play with different choices of the parameters above (including the sample size $$n$$) to see the effects on the parameter estimates and the $$p$$-values.

## Multiple Regression Model

Idea: Generalize the simple linear regression model to multiple covariates, w.g., predict volume using girth and height.

$$\leadsto$$ Linear Model: $y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,$where

• $$y_i$$: $$i$$-th measurement of the response,

• $$x_{i1}$$: $$i$$ th value of first covariate,

• $$x_{ip}$$: $$i$$-th value of $$p$$-th covariate,

• $$\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$$: independent normally distributed errors with unknown variance $$\sigma^2$$.

Task: Find the optimal estimators for $$\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)$$.

Our choice again: Least squares estimator (= maximum likelihood estimator for normal errors)

$\hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2$

where $$\mathbf{y}$$ is the vector of responses, $$\mathbf{x}$$_j is the vector of the $$j$$ th covariate and $$\mathbf{1}$$ is a vector of ones.

Written in matrix style:

$\mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2$

Defining the design matrix

$\mathbf{X} = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \end{array}\right) \qquad (\text{size } n \times (p+1)),$

we get the short form

$\mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$

[use Julia code (existing package) to perform linear regression for volume ~ girth + height]

The interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now:

• parameter estimates:

$(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$

• estimated standard errors:

$\hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p-1} \|\mathbf{y} - \mathbf{X} \beta\|^2}$

• $$t$$-statistics:

$t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p.$

• $$p$$-values:

$p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p-1}$

1. Implement functions that estimate the $$\beta$$-parameters, the corresponding standard errors and the $$t$$-statistics.
2. Test your functions with the tree’ data set and try to reproduce the output above.

Which model is the best? For linear models, one often uses the $$R^2$$ characteristic. Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model.

r2(linmod1)
r2(linmod2)

linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)

r2(linmod3)
Note

The more covariates you add the more variance can be explained by the linear model - $$R^2$$ increases. In order to balance goodness-of-fit of a model and its complexity, information criteria such as aic are considered.

## Generalized Linear Models

Classical linear model

$\mathbf{y} = \mathbf{X} \beta + \varepsilon$

implies that $\mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).$

In particular, the conditional expectation satisfies $$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta$$.

However, the assumption of a normal distribution is unrealistic for non-continuous data. Popular alternatives include:

• for counting data: $\mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta)$

Here, the components are considered to be independent and the exponential function is applied componentwise.

• for binary data: $\mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})}$

Again, the components are considered to be independent and all the operations are applied componentwise.

All these models are defined by the choice of a family of distributions and a function $$g$$ (the so-called link function) such that

$\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta).$

For the models above, these are:

Type of Data Distribution Family Link Function
continuous Normal

identity:

$g(x)=x$

count Poisson

log:

$g(x) = \log(x)$

binary Bernoulli

logit:

$g(x) = \log\left( \frac{x}{1-x} \right)$

In general, the parameter vector $$\beta$$ is estimated via maximizing the likelihood, i.e.,

$\hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}),$

which is equivalent to the maximization of the log-likelihood, i.e.,

$\hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}),$

In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above.

using CSV
using HTTP

http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))

SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")

model = glm(@formula(participation ~ age^2),
SwissLabor, Binomial(), ProbitLink())
1. Reproduce the results of our data analysis of the tree data set using a generalized linear model with normal distribution family.
2. Generate $$n=20$$ random covariates $$\mathbf{x}$$ and Poisson-distributed counting data with parameters $$\beta_0 + \beta_1 x_i$$. Re-estimate the parameters by a generalized linear model.
In the linear regression models so far, we assumed that the response variable $$\mathbf{y}$$ depends on the design matrix of covariates $$\mathbf{X}$$ - which are assumed to be given/fixed - multiplied by the so-called fixed effects coefficients $$\mathbf{X}\beta$$ and independent errors $$\varepsilon$$. However, in many situations, there are also random effects on several components of the response variable. These can be included in the model by adding another design matrix $$\mathbf{Z}$$ multiplied by a random vector $$u$$, the so-called random effects coefficients, that are assumed to be jointly normally distributed with mean vector $$0$$ and variance-covariance matrix $$\Sigma$$ (typically not a diagonal matrix). In matrix notation, we have the following form:
$\mathbf{y} = \mathbf{X} \beta + \mathbf{Z}u + \varepsilon$
Maximizing the likelihood, we can estimate $$\beta$$ and optimally predict the random vector $$u$$.