Matrix Calculus, So I stumbled upon something like this: https://www.dropbox.com/s/rc8qm4g7gbqat7i/Screenshot%202016-03-20%2021.06.05.png?dl=0 He is trying to take derivate of the norm of a Vector...... My question is... Is it common practice to use "rules of thump" to come up with the derivative of this, without getting the derivative of it's individidual element of the vector ? (ie. In 'Normal' Calculus a rule of thump would be the 'power rule' or 'quotient rule' .....) So is there something like a rules of thump that we use when we are trying to find the derivative of a
..... Vector with respect to a Vector ?
I'm not entirely sure what this notation means, but it's pretty common and useful to take a derivative with respect to a vector, which usually pops up as the directional derivative and can be written as the dot product of a unit vector in the direction you want with the gradient of a scalar field, although it's a much more general concept than this.
Soooo before I can help you, what is X, w, y, N, \(E_{in}\), and the signficance of your \(\nabla\) operator? Are these matrices / vectors / scalars and what specifically is the dependence of stuff we're taking derivatives of?
the variable here is the vector w
E_in(w) is just like saying f(w) ---> again w is a vector since is bolded lower case
The only scalar is N
bolded lower case letters are vectors , unbolded letters are scalars , and capital bolded letters are matrices
now I don't understand what you are asking when you say the significance of the gradient operator
basically we are trying to find d/dw
where w is a vector
Nah it's clear to me now what you're doing after you cleared up the rest give me a sec
oki
Sidenote: w and y are 'column' vectors
Alright so I know this in tensor notation so I'm not quite sure if my way of doing it is cheating, but it ends up looking the same. First off you can write out: \[E_{in}(w) = \frac{1}{N} (Xw-y)^\top (Xw-y) \] Then when you take the derivative apply the product rule. If you think about the entries of a matrix or vector, the derivative of the transpose is the same as the transpose of the derivative so there's no issue there. Now when I do this with tensors I am doing it component-wise however here it works just as well to treat w as if you were doing it like it was just a scalar from normal calculus: \[\nabla E_{in}(w) = \frac{1}{N} [ X^\top(Xw-y) + (Xw-y)^\top X] \] So I guess now (I'm not sure how to show this in matrix notation without turning it into component form, oh well) we should somehow acknowledge that these are equal: \[X^\top(Xw-y) = (Xw-y)^\top X\] In the summation notation you can differentiate with respect to an individual component of w and then it becomes clear that what you are producing is a vector with components: \[\nabla E_{in}(w) = \frac{2}{N} X^\top(Xw-y) \] I guess that's the quick answer but we can go more in depth if you want to focus on any details cause I know this is pretty vague sounding right now. I'm just really lazy to keep track of the indices with summation notation on the matrices and vectors right now so I avoided it heh
where does the 2 come from
ah I see so you used the equality to have 2(Xw - y)^T
The product rule produces two identical terms. Our original \(E(w)\) is a scalar value so when we differentiate it all of the indices line up into the same component of our vector is the best way I can say it without writing the computation (which I can show you if you really want). So here I copy pasted and am showing where the 2 comes from: \[\nabla E_{in}(w) = \frac{1}{N} [ X^\top(Xw-y) + (Xw-y)^\top X] \] Since that scalar argument I gave at the beginning of this post, this thing is equal to its transpose: \[X^\top(Xw-y) = (Xw-y)^\top X\] plug this in to the right part of the equation there to get: \[\nabla E_{in}(w) = \frac{1}{N} [ X^\top(Xw-y) + X^\top(Xw-y)] \] They're the same so, 2: \[\nabla E_{in}(w) = \frac{2}{N} X^\top(Xw-y) \]
yeah haha
the only thing I am struggling a bit to understand https://www.dropbox.com/s/xjblntttrb91hfe/Screenshot%202016-03-21%2010.59.45.png?dl=0
would it be easy for you if you showed me the computation please @Kainui ?
the derivative of the transpose is the same as the transpose of the derivative ---> is this a general matrix calculus rule ?
Yeah I can show you, I just figured out some tricks to make it go easier too. Yeah, so here's how you can visualize it to get the intuition that it's true, check this out: \[\frac{d}{dt} \begin{pmatrix} f(t) & g(t) \end{pmatrix} = \begin{pmatrix} f'(t) \\ g'(t) \end{pmatrix} ^\top\] So on the left I transposed it and I'm differntiating second, and in the other I have already differentiated and I'm about to transpose. You can see they'll get the same result. Give me a sec and I'll show the calculation
same question goes for --> a^T * b = a * b^T
ok
Yeah, don't think of the w in the expression as the same w that you are differentiating with respect to. By itself, I think we should clarify this first: The derivative of a vector with respect to itself is the identity matrix: \[\frac{dw}{dw} = I\] Why should this be? Well if you look at the individual entries of the matrix, you should think of them as: \[\frac{d w_i}{d w_j}\] so if \(i=j\) then you are taking the derivative of a variable with respect to itself, so for the same reason \(\frac{dx}{dx} = 1\) and \(\frac{dy}{dy}=1\) so does \(\frac{dw_1}{d w_1} = 1\). Since the components of our vector are independent though, that means \(\frac{dw_1}{dw_2} = 0\). You can also write the identity matrix in index form as the Kronecker delta if you're familiar with it: \[\frac{d w_i}{d w_j} = \delta_{ij}\] Hopefully this makes sense, then we can talk about differentiating the rest of the thing but I think I should have started out with saying this first.
I see
Since this whole function represented by \(E_{in}(w)\) is a scalar, it is kind of silly to state but pretty important to say from the start that it is equal to its own transpose: \[E_{in}(w) = E_{in}(w)^\top\] Now let's write it out in full, I'll stop and just write that thing as E cause I'm lazy: \[E = \frac{1}{N} (X w -y)^\top (Xw -y) \] We can distribute this all out to get: \[E = \frac{1}{N} [ w^\top X^\top X w -w^\top X^\top y - y^\top X w + y^\top y ]\] Ok make sure you can do all of this before I continue on, I don't want you to get lost.
yea so far so good
(by the way in case my laptop battary runs out (5%) ill return here asap)
ok cool, alright so now since all of these terms are scalars, it makes no difference, all of these terms are equal to their own transpose. Specifically I want to change around this term now just to get it into the form I want later: \[ -w^\top X^\top y = - y^\top X w \] \[E = \frac{1}{N} [ w^\top X^\top X w -w^\top X^\top y - y^\top X w + y^\top y ]\] Plug that in: \[E = \frac{1}{N} [ w^\top X^\top X w -2w^\top X^\top y + y^\top y ]\] Right now let's also recognize that \(w^\top X^\top X w\) is also equal to its own transpose, we'll deal with this for a second when taking the derivative. So let's take the derivative already! \[\nabla E = \frac{1}{N} [ \nabla (w^\top X^\top X w) -2\nabla (w^\top X^\top y )+ \nabla (y^\top y) ]\] Let's take care of the last two terms immediately, because y doesn't depend on w, its derivative is 0, and that last term let's look at it, I'll expand out the product rule, even though these terms are really constants, you can see it anyways: \[\nabla (w^\top X^\top y )=\nabla (w^\top )X^\top y +w^\top\nabla ( X^\top y ) = I^\top X^\top y + 0 = X^\top y\] Lots going on here so take a minute to make sure it all makes sense haha. Alright so really we have part of our thing worked out: \[\nabla E = \frac{1}{N} [ \nabla (w^\top X^\top X w) -2X^\top y]\] This term we work similarly, although we gotta be a little trickier with it: \[ \nabla (w^\top X^\top X w) = \nabla (w^\top )X^\top X w +w^\top X^\top X \nabla ( w) \] Here we're gonna use the transpose property with the derivative quite a bit, the first piece after doing the product rule just turns that piece into something simple: \[ \nabla (w^\top X^\top X w) = X^\top X w +w^\top X^\top X \nabla ( w) \] What abut that last piece? Well derivative and transpose still don't affect each other so we can safely transpose this: \[(w^\top X^\top X \nabla ( w))^\top = (\nabla w)^\top X^\top X w = X^\top X w\] Ok now we can plug it in since it looks like how we want: \[\nabla E = \frac{1}{N} [2X^\top X w -2X^\top y]\] factor this out: \[\nabla E = \frac{2}{N} X^\top [Xw -y]\]
This is what it looks like if you do the same calculation in tensor notation btw, using Einstein summation notation: \[E = \frac{1}{N} (A_{ij} w_j - y_i)(A_{ki}w_k - y_i)\] \[\frac{\partial E}{\partial w_l} =\frac{1}{N} [A_{ij} \delta_{jl}(A_{ki}w_k - y_i)+(A_{ij} w_j - y_i)A_{ki}\delta_{kl}] \] \[\frac{\partial E}{\partial w_l} =\frac{1}{N} [A_{il} (A_{ki}w_k - y_i)+(A_{ij} w_j - y_i)A_{li}] \] \[\frac{\partial E}{\partial w_l} =\frac{2}{N}A_{il} (A_{ki}w_k - y_i) \] Things get substantially easier I think, I didn't have to worry so much about transposing things haha.
Join our real-time social learning platform and learn together with your friends!