A Gentle Introduction to Regression

Parametric vs Nonparametric Regressioin

Parametric regression

If we assume a particular form of the regressor:

Goal: to learn the parameters which yield the minimum error/loss

Non-parametric regression

If no specific form of regressor is assumed:

Goal: to learn the predictor directly from the input data that yields the minimum error/loss

Linear Regression

Want to find a linear predictor ( f ), i.e., ( w ) (intercept ( w_0 ) absorbed via lifting):

\[hat{f}(\vec{x}) := \vec{w} \cdot \vec{x}\]

which minimizes the prediction loss over the population.

\[min_{\vec{w}} \mathbb{E}_{\vec{x}, y} \left[ L(\hat{f}(\vec{x}), y) \right]\]

We estimate the parameters by minimizing the corresponding loss on the training data:

\[\arg \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^n \left[ L(\vec{w} \cdot \vec{x}_i, y_i) \right] = \arg \min_{\vec{w}} \frac{1}{n} \sum_{i=1}^n \left( \vec{w} \cdot \vec{x}_i - y_i \right)^2\]

Learning the Parameters

Unconstraint problem:

\[= \arg \min_{\vec{w}} \left\| \begin{pmatrix} ... & \vec{x}_1 & ... \\ ... & \vec{x}_i & ... \\ ... & \vec{x}_n & ... \end{pmatrix} \vec{w} - \begin{pmatrix} y_1 \\ y_i \\ y_n \end{pmatrix} \right\|_2^2\] \[= \arg \min_{\vec{w}} \| X \vec{w} - \vec{y} \|_2^2\]

Thus best fitting w:

\[\frac{\partial}{\partial \mathbf{w}} \|X\mathbf{w} - \mathbf{y}\|^2 = 2X^T (X\mathbf{w} - \mathbf{y})\]

At stationarity, this results in the equation:

\[(X^T X)\mathbf{w} = X^T \mathbf{y}\]

This system is always consistent:

\[\mathbf{\vec{y}} = \mathbf{\vec{y}}_{\text{col}(X)} + \mathbf{\vec{y}}_{\text{null}(X^T)}\]

Thus,

\[X^T \mathbf{y} = X^T \mathbf{\vec{y}}_{\text{col}(X)}\]

Since,

\[{\vec{y}}_{\text{col}(X)} = \sum_i w_i \mathbf{x}_i \quad \text{(for some coefficients $w_i$, where $\mathbf{x}_i$ are columns of $X$)}\]

Define,

\[{\vec{w}} := \begin{bmatrix} w_1 \\ \vdots \\ w_d \end{bmatrix}\]

Then,

\[(X^T X)\mathbf{\vec{w}} = X^T (X\mathbf{\vec{w}}) = X^T \left(\sum_i w_i \mathbf{x}_i \right) = X^T \mathbf{y}_{\text{col}(X)} = X^T \mathbf{y}\] \[\mathbf{\vec{w}}_{\text{ols}} = (X^T X)^{\dagger} X^T \mathbf{y}\]

Also called the Ordinary Least Squares (OLS). The solution is unique and stable when $(X^T X)$ is invertible

Linear Regression in Statisticall Modeling View

Let’s assume that data is generated from the following process:

A example $x_i$ is draw independently from the data space $X$ $x_i \sim D_X$
$y_i^{clean}$ is computed as $(w \cdot x_i)$, from a fixed unknown $w$ $y_i^{clean} := w \cdot x_i$
$y_i$ is corrupted from $y_i^{clean}$ by adding independent Gaussian noise $N(0,\sigma^2)$ $y_i := y_i^{clean} + \epsilon_i = w \cdot x_i + \epsilon_i \quad\quad \epsilon_i \sim N(0, \sigma^2)$
$(x_i, y_i)$ is revealed as the $i$-th sample $(x_1, y_1), ..., (x_n, y_n) =: S$

How can we determine $w$, from Gaussian noise corrupted observations?

\[S = (x_1, y_1), ..., (x_n, y_n)\]

Observation:

\[y_i \sim w \cdot x_i + N(0, \sigma^2) = N(w \cdot x_i, \sigma^2)\]

parameter: w (Ignored terms indepdent of w)

\[\log L(w|S) = \sum_{i=1}^n \log p(y_i|w)\] \[\propto \sum_{i=1}^n \frac{-(w \cdot x_i - y_i)^2}{2\sigma^2}\]

optimizing for $w$ yields the same OLS result!

Lasso Regression & Ridge Regression

Previously we looked at Ordinary Least Square (OLS)

\[minimize\| X \vec{w} - \vec{y} \|_2^2\] \[\mathbf{\vec{w}}_{\text{ols}} = (X^T X)^{\dagger} X^T \mathbf{y}\]

Which is poorly behaved (due to overfitting) when we have limited data. We can incorporate application dependent piror knowledge.

Lasso regression:

Objective

minimize $

X\vec{w} - \vec{y}

^2 + \lambda|\vec{w}|^2$

reconstruction error ‘regularization’ parameter

$\vec{w}_{ridge} = (X^TX + \lambda I)^{-1}X^T\vec{y}$

The ‘regularization’ helps avoid overfitting, and always resulting in a unique solution. Equivalent to the following optimization problem:

minimize $

X\vec{w} - \vec{y}

^2$

such that $

\vec{w}

^2 \leq B$

Ridge Regression:

Objective

minimize $

X\vec{w} - \vec{y}

^2 + \lambda|\vec{w}|_1$

‘lasso’ penalty

$\vec{w}_{lasso} = ?$ no closed form solution

Lasso regularization encourages sparse solutions. Equivalent to the following optimization problem:

minimize $

X\vec{w} - \vec{y}

^2$

such that $

\vec{w}

_1 \leq B$

Logistic Regression

Linear regression for classification:

Although its name contains “regression,” it is actually used for classification tasks. For a binary classification problem, given input x, how likely is it that it has label 1?Let this be denoted by P, ie, P is the chance that a given x the associated label y = 1, P = P(Y=1

X=x) ranges between 0 and 1, hence cannot be modelled appropriately via linear regression. If we look at the ‘odds’ of getting y=1 (for a given x) $odds(P) := \frac{P}{1-P}$

For an event with P=0.9, odds = 9 But, for an event P=0.1, odds = 0.11

Consider the “log” of the odds (very asymmetric)

$log(odds(P)) := logit(P) := log(\frac{P}{1-P})$

$logit(P) = -logit(1-P)$ Symmetric! Can model logit as a linear function!!

Model the log-odds or logit with linear function! Given an input x

$logit(P(Y=1

X=x)) = logit(P) = log(\frac{P}{1-P}) \stackrel{\text{modeling assumption}}{=} w \cdot x$

$\frac{P}{1-P} = e^{w \cdot x}$

$P(Y=1

X=x) = P = \frac{e^{w \cdot x}}{1+e^{w \cdot x}} = \frac{1}{1+e^{-w \cdot x}}$ Sigmoid!

How do we learn the prameters?

Given samples $S = (x_1, y_1), …, (x_n, y_n)$ yi is binary)

$\mathcal{L}(w

S) = \prod_{i=1}^{n} P((x_i, y_i)

w) \propto \prod_{i=1}^{n} P(y_i = 1

x_i, w)= \prod_{i=1}^{n} P(y_i = 1

x_i, w)^{y_i} (1 - P(y_i = 1

x_i, w))^{1-y_i}$ (Binomial MLE)

$log \mathcal{L}(w

S) \propto \sum_{i=1}^{n} y_i log P_{x_i} + (1 - y_i) log(1 - P_{x_i}) = \sum_{i=1}^{n} y_i log \frac{P_{x_i}}{1 - P_{x_i}} + \sum_{i=1}^{n} log(1 - P_{x_i})$

Now, use logistic model!

$= \sum_{i=1}^{n} y_i (w \cdot x_i) + \sum_{i=1}^{n} - log(1 + e^{w \cdot x_i})$

No closed form solution Can use iterative methods like gradient ‘ascent’ to find the solution

Non-parametric Regression

What if we don’t know parametric form of the relationship between the independent and dependent variables? How can we predict value of a

new test point x without model assumptions?

Kernel Regression

$y = f_n(x) := \sum_{i=1}^{n} w_i(x)y_i$

Want weights that emphasize local observations

Localization functions:

Gaussian Kernel: $K_h(x, x’) = e^{-

x-x’

^2/h}$

Box Kernel: $1[

x - x’

\leq h]$

$K_h(x, x') = e^{-||x-x'||^2/h}$ Gaussian kernel $= 1[||x - x'|| \leq h]$ Box kernel $= [1 - (1/h)||x - x'||]_+$ Triangle kernel

Then define: w_i(x) := \frac{K_h(x,x_i)}{\sum_{j=1}^n K_h(x,x_j)} $$