gradient of least squares loss function derivation

Question

I am trying to derive the derivative of the loss function from least squares. If I have this (I am using ' to denote the transpose as in matlab)

(y-Xw)'(y-Xw)

and I expand it

=(y'- w'X')(y-Xw)
=y'y -y'Xw -w'X'y + w'X'Xw
=y'y -y'Xw -y'Xw + w'X'Xw
=y'y -2y'Xw + w'X'Xw

Now I get the gradient

=-2y'Xw + X'(Xw) + X(w'X')
=-2y'Xw + X'(Xw) + X'(Xw)
=-2y'Xw + 2X'(Xw)

And that is the intended result. Now, I saw in this post Vector derivation of $x^Tx$ That the gradient of x'x=2x, So I am trying to get the same result applying that, and the chain rule to get the gradient of

=(y-Xw)'(y-Xw)

So I think this might be

=2(y-Xw)(-X)
=-2yX + 2XwX

The result is similar but the transpositions are missing so it would not work... What am I doing wrong? My mathematical background has almost disappeared and I just started to begin the recovery, so please be patient if I did something terribly wrong...

As you can see I have 1 of reputation, and I just got a voted down, so you might know I have no idea why, please explain so I do not do it again... — Sfp, Jun 07 '18 at 20:49

score 2 · Accepted Answer · answered Jun 07 '18 at 20:57

Here is a piece of background information that we must be clear on at the beginning. If $F:\mathbb R^p \to \mathbb R^q$ is differentiable at a point $z$, then $F'(z)$ is a $q \times p$ matrix.

I assume $X$ is a real $m \times n$ matrix and $y$ is an $m \times 1$ column vector. Let $g:\mathbb R^m \to \mathbb R$ be defined by $g(u) = u^T u$. Note carefully that for any $u \in \mathbb R^m$, $g'(u) = 2 u^T$ is a $1 \times m$ matrix.

Define $h:\mathbb R^n \to \mathbb R^m$ by $h(w) = y - X w$, and note that $h'(w) = - X$.

Now let $f:\mathbb R^n \to \mathbb R$ be defined by $$ f(w) = g(h(w)) = (y - X w)^T(y - X w). $$ The chain rule tells us that $$ \underbrace{f'(w)}_{1 \times n} = \underbrace{g'(h(w))}_{1 \times m} \underbrace{h'(w)}_{m \times n}. $$

With our particular choices of $g$ and $h$, we have $$ f'(w) = 2 (y - X w)^T(-X). $$ If we use the convention that the gradient is a column vector, then we have $$ \nabla f(w) = f'(w)^T = 2 X^T (X w - y). $$

All the assumptions in the answer were perfect. I have started some machine learning courses and they always use the same names for variables, so I wrongly thought the variable names were some kind of standard in least squares. In my context, X has m rows that are the training examples with n features. And I have y, which is a vector of m labels. So you want to find a vector w that multiplies X, and predicts the label y. Very helpful answer, many thanks Dr. O'Connor. — Sfp, Jun 08 '18 at 04:10

gradient of least squares loss function derivation

1 Answers1