Multivariate CalculusNotation problemFunctionsRise over run - speed vs timeDerivatives - definitionRise over run of a linear functionRise over run for more complex functionsMeaningsThe first derivative The second derivative Sum rulePower ruleSpecial Cases - the function is its own gradientPositive caseSin and CosProduct ruleChain ruleNested functionsTaming a beastMultivariate calculusWhat a variable is?Differentiate with respect to anythingTotal DerivativeJacobianJacobian of a single function of many variablesJacobian AppliedExample IExample IIExample IIISandpitThe Hessian2-D exampleReality is hardMultivariate Chain RuleSimplifying the notationUnivariate exampleMultivariate exampleSimple Neural NetworksSimplest possible caseAdding more neuronsAdding more outputsBeginning to generalizeHidden LayersIn summaryTrainingBack-propagationBuilding approximate functionsPower SeriesPower Series derivationExampleZeroth order approximationFirst order approximationSecond order approximationThird order approximationFourth order approximationTaylor SeriesExamples1. Cosine function - not well behavedLinearisationChange in notationChanging the first-order approximation (example)Multivariate Taylor SeriesRecapTwo-dimensional caseZeroth order approximationFirst order approximationSecond order approximationExpressionFirst order approximationSecond order approximationNewton-Raphson methodGradient descentGradient?Descent?

Multivariate Calculus

Notation problem

Different people invented / contributed to calculus over time - each one has chosen to use a notation

Some notations are better to the applications that they're invented for

image-20200203182208329

 

Functions

 

²

 

Context

 

Modeling the world

 

Calculus

 

Rise over run - speed vs time

img

image-20200203183016487

Acceleration is the local gradient of a speed-time graph

Is a function of time (in this example)

 

image-20200203183221507

Using the tangent over a point to see the slope at that point

The slope change can then be plotted to show acceleration x time instead of speed x time

 

img

 

Constant speed = zero gradient = zero acceleration over time

 

image-20200203183419685

Orange line = acceleration = space * time ^ 2

This is the first derivative

 

image-20200203183542464

The **second derivative can be taken from plotting the slope change in the acceleration function.

It its related to the car starting and stopping

 

 

Thinking about what curve generates that derivative is called anti-derivative, or integral

 

image-20200203183915205

 

In this graph it shows the distance covered by the car - how much distance are covered per unit time - or just the speed

 

Derivatives - definition

Rise over run of a linear function

 

The slope of that line is represented by the amount of growth between two points (an interval) divided by the length of the considered interval - this is called rise over run

image-20200203192326580

 

Rise over run for more complex functions

image-20200203192409289

 

The rise over run changes depending on the chosen points

 

What is the rise over run at a single point ?

 

image-20200203192506815

 

image-20200203192538341

 

Defining the second point at a distance from the first one - we can use to determinate they position

 

Now, writing a function for any based on the rise over run concept

 

image-20200203192749351

 

image-20200203192807182

img

With smaller values of , the result becomes a better representation of the gradient

 

image-20200203192930481

 

image-20200203192954206

 

This concept can be expressed with limits - so, we want to know the gradient for an as small as possible / needed

 

image-20200203193052130

 

Then, we can get the slope for any single point in the function

image-20200203193203309

 

This is often represented with the notation

Note that is not zero

  1. because we can't divide by zero - but it's a number really close to it
  2. we could make our results better as we "zoom in" at the point by making it smaller as we get closer
  3. If we are far away, a value further from zero wouldn't make that of a difference

 

image-20200203200109325

Meanings

https://i.giphy.com/media/1SZAA5izDA6Ji/giphy.webp

 

https://math.dartmouth.edu/opencalc2/cole/lecture8.pdf

 

Critical point: a point where the slope is , often a local maximum or minimum

 

Imagine that a certain city got a single train railway, connecting any number of stations

The first derivative

Is the slope of the tangent line to a function at the point

It tells us whether and how much a function is increasing or decreasing

 

 

image-20200221164614867

 

Other properties

 

In summary, the first derivative measures the rate of change of a certain function. If we have a function where the axis indicates time and the axis t

 

The second derivative

Note that the first derivative does tell where the critical points are, but it cannot show if they're rather local maximums or local minimums - the second derivative can tell this

 

If the first derivative shows that there are a critical point at , i.e.

 

image-20200221170223692

Sum rule

https://media.giphy.com/media/SXx6ayaHyUzJVcXrX6/giphy.gif

 

 

image-20200203200127031

image-20200203201946281

 

 

Example

image-20200204173351532

 

 

 

Power rule

img

image-20200204173309744

 

image-20200205182511189

Special Cases

image-20200204173557856

The gradient is negative everywhere except at , but at this point we can't see the gradient

The function has a discontinuity because at , is not defined

image-20200204181134316

image-20200204181149984

- the function is its own gradient

Positive case

image-20200204181341928

 

Besides , only one function fit the criteria

image-20200204181415212

image-20200204181428826

is a universal constant

 

 

Sin and Cos

image-20200204181707350

image-20200204181722820

 

 

image-20200204181730808

image-20200204181748413

 

The trigonometric functions are really exponential functions

image-20200204181824045

image-20200204182540826

 

Product rule

 

img

 

image-20200204182701045

If we differentiate what we are really looking for is the change in area of the rectangle as we vary

 

image-20200204182818807

 

Adding to the size of the rectangle changes - in this case both sides happily increase (easier to see the concept).

We can subdivide the new rectangle in smaller rectangles, one of them has the same size of the original one.

image-20200204183023373

Then we can calculate the width and height of the other rectangles

image-20200204183140439

We can then write an expression only for the area of the new rectangles

image-20200204183228358

 

 

image-20200204183259578

As approaches , all rectangles will shrink, but, analyzing the equations, note that the smaller rectangle will shrink the fastest

 

image-20200204183511943

image-20200204183533802

We can ultimately disconsider the area of the smaller rectangle

image-20200204183715435

The limit will be calculated by

image-20200204183742193

It's useful to rearrange it in the following way:

 

image-20200204183833803

  1. Split into two fractions

image-20200204183916140

  1. Taking and out of the numerators

image-20200204183952773

 

 

Note that both fractions are just the derivatives for and

image-20200204184203429

 

This can be rewritten as

image-20200204184225657

 

Contemplate the product rule

image-20200204184458800

 

Chain rule

 

Nested functions

image-20200204184940802

  1. - how happy i am
  2. - how many pizzas i have eaten
  3. - how much money i make

 

giphy.mp4

 

giphy.mp4

 

 

 

 

Note that we are relating the concept of money to happiness, but via the concept of pizza

image-20200204185200355

image-20200204185210557

image-20200204185243278

 

By knowing how much money i have now, how much effort should i put into making more, if my aim is to be happy?

 

So, we need to know the rate of change for happiness is with respect to money. Which is

In this simple example we could just substitute one function into another and derive it

 

image-20200204185710874

 

But the chain rule provides us with a more elegant solution that will work even for more complex functions, where simple plugging in into another (direct substitution) isn't an option.

 

In this particular notation convention, the product looks like it would give the desired function - this approach is called the chain rule.

image-20200204185827384

 

image-20200204185858151

img

image-20200204190059257

 

image-20200204190140585

So, if we don't want to appear in our final function, we just need to substitute for its function in terms of

 

image-20200204190256563

Then rearranging the terms

image-20200204190314549

Result

image-20200204190406445

 

 

Taming a beast

image-20200229021835346

~The beast~

image-20200205163742436

 

 

marhs

 

image-20200205192239815

 

 

Multivariate calculus

Multiple input or output variables

How to apply the concepts shown before to systems with multiple variables?

 

What a variable is?

1. One of the variables is a function of the other

 

 

2. Dependent variables

When we can say

But not

Because the other way around doesn't necessarily makes sense

 

 

img

image-20200206160347319

 

The vehicle speed is a function of time because at each point in time, the vehicle can be at one and only one speed

However, we cannot say that time is a function of speed, because a same speed can happened at different points in time

image-20200206160555532

Therefore, the speed is a dependent variable, because it depends on time and the time is the independent variable in this context

 

Typically, when you first learn calculus, you take functions containing variables and constants and then differentiate the dependent variables (such as speed) in respect to the independent variables (such as time).

However, what gets labeled as a constant or a variable can be subtler then expected - it will require you to understand the context of the problem being described

 

The car example

image-20200206161205415

 

image-20200206162559608

 

But if you're a car designer, and have a target speed, then your speed becomes the constant and the mass and drag can be adjusted by changing the car's design

 

TL; DR; - you can differentiate any term with respect with another - it depends on the context

 

Another example - designing a can

image-20200206163842618

 

 

In principle we could change about everything about the can, even the metal's density (except for 🤷)

So, let's find the derivative of the can's mass with respect to each variable

img

When differentiating in respect to some variable, simply consider all of the other variables to behave as constants

 

image-20200206165337634

Note that the first term doesn't contains , so the derivative of constants just becomes , as usual.

The second term does contains , and it is just multiplied by some constants - differentiating this leaves just those constants

The partial derivative in respect to doesn't even contains because the mass will vary linearly with the height when all else is kept constant

 

image-20200206165745508

 

Note that the notation also changed - instead of the we're using , which indicates that we are differentiating a function of more then one variable

img

 

image-20200206171214984

image-20200206171320494

 

Partial differentiation is in essence just taking a multidimensional problem and pretending that it's a standard 1-D problem as we consider each variable separately

 

Differentiate with respect to anything

Working on the example

 

image-20200206171713105

image-20200206173207354

 

 

Total Derivative

image-20200206171713105

Imagine that the variables and were actually themselves a function of a single other parameter where:

²

image-20200206173511532

 

We are looking for the derivative of with respect to

In this simple case can be directly substituted into the function and the derivative taken

 

image-20200206173659787

 

In a more more complicated scenario the chain rule comes in handy

image-20200206173756507

The derivative with respect to the variable will be the sum of the chains of the three variables

So we need to know the derivatives with respect to

image-20200206174056331

image-20200206174205203

image-20200206174226337

image-20200206174250351

 

calscs

 

Analogy

 

image-20200210183134072

 

Jacobian

 

Jacobian of a single function of many variables

image-20200210194919841

image-20200210195601341

 

is a vector in which when we give it a specific coordinate will return a vector pointing in the direction of steepest uphill slope of this function

 

In this specific example, is a constant which does not depend on the location selected

 

 

image-20200210200041371

 

 

Another example

 

image-20200210200609217

 

image-20200210200702413

image-20200210200730457

image-20200210200747198

The steeper the slope grater the Jacobian becomes at that point

 

Converting into a contour plot

image-20200210200912291

Plotting Jacobian vectors

image-20200210201105828

 

Jacobian Applied

Example I

²²

image-20200211145931677

image-20200211152035296

Remember kids

If

Then

If we are dealing with multivariate calculus we will differentiate with respect to some variable like or

 

 

image-20200211150055327

image-20200211150257357

image-20200211150429583

 

image-20200211154040514

 

image-20200211154907699

 

image-20200211155042059

image-20200211155048535

image-20200211155058704

 

Example II

 

image-20200211155203190

 

This function receives a vector as the input but gives also a vector as the output

We can think about this problem as being contained by two vector spaces, one for and another to

Each point in has a corresponding point in

As we move in , we also move in , but making a different path

 

image-20200211155219473

 

The Jacobian can them be represented as a matrix by stacking the rows

 

image-20200211155258872

image-20200211155910871

 

This matrix is just a transformation from space to space

image-20200211160059824

 

Example III

Many of the functions aren't so nice

They can be highly non-linear and much more complicated

But often they may still be smooth - by approximating enough we can say that a region is approximately linear

Therefore adding all the contributions from the Jacobian determinants at each point in space, we can still calculate the change in the size of a region after a transformation

 

TRANSFORMING BETWEEN CARTESIAN AND POLAR COORDINATE SYSTEMS

image-20200211160848151

 

Polar coordinates uses an radius and an angle , we want the coordinates as values

image-20200211161055324

 

Making the Jacobian and taking the determinant

 

As we move along , away from the origin, small regions of space will scale as a function of

image-20200211161206898

 

image-20200211161219453

 

Sandpit

img

Optimization: greatly related to find inputs that gives us the maximum or the minimum of a function

Making becomes much more complicated, but also ineffective as we can have functions with more then one point with the gradient equals to zero

 

image-20200211161657504

image-20200211161907016

 

 

Going to the highest peak can be like walking at night: maybe we don't have a nice analytical expression and each point in the plot are the result of a week of processing in a supercomputer or the result of a practical experiment

The problem: the Jacobian points uphill, nut not to the tallest one, if you follow the arrows you will arrive at some hill with all arrows pointing towards you

In maths we don't need to follow the arrows, we can just teleport to any region of space - so we are not really walking

 

A better analogy is the sandpit:

image-20200211162934945

You're using a sick to poke the sand measure how deep the soil beneath is, you can poke any point in space and you can't see the hills because they're blocked by the sand

 

The Hessian

 

Takes the second order derivatives into a matrix

image-20200211163301164

image-20200211164852095

 

We can pass an coordinate to the Hessian and it will return matrix that tells us something about that point in space

 

2-D example

²²

image-20200211202140383

 

image-20200211202420780

 

²²

At the origin we have a saddle point

image-20200211203103611

 

The Hessian is negative, so, it is not a maximum or a minimum

 

image-20200211203300749

But the gradient is flat

The slope is coming down in one direction and upwards in another direction

That feature is called a saddle point

They can cause a lot of confusion when searching for a peak

 

Reality is hard

img

  1. In reality we deal with hundreds or thousands of dimensions - we can't simply plot them as a nice landscape but rather use our intuition from 2-D to guide us and enable us to trust the math, because the math works in any case.
  2. You could not have a nice analytical formula that describes your problem - even if this formula is in 2-D space, calculating each point can be very expensive, depending on a supercomputer's processing power or laboratory staff
  3. Not all functions are nice and smooth - what if your function contains a sharp feature like discontinuity
  4. For any number of reasons the function can contain noise, making the Jacobian useless unless we were careful

The question: how to calculate the Jacobian for problems that we don't even have the function that we're trying to optimize?

The answer: numerical methods

There are a range of techniques that allows us to find approximate answers to that questions

 

The derivative measures the slope of two points as their distance tends to zero - if we can't calculate every point for a formula, let's use only what we got to build the derivatives - approximation

 

image-20200211204951497

But that's not practical for higher dimension scenarios

 

If we start from a initial location and we would like to approximate the Jacobian we could approximate each partial derivative intern:

image-20200211205523355

image-20200211205535620

 

  1. How big should the little step be?

Too big: bad approximation

Too small: numerical issues - if the points are too close, the computer may not register any movement at all (floating point range stuff)

 

  1. What if the data is a bit noisy?

Simplest approach: to calculate the gradient using a few different step sizes and taking some kind of average

 

Multivariate Chain Rule

Simplifying the notation

image-20200219130410601

 

image-20200219130955869

image-20200219132245788

 

 note

 

Univariate example

image-20200219145925626

Multivariate example

image-20200219150948007

Differentiating the scaled valued function with respect to the input vector gives us the Jacobian row vector

Differentiating a vector valued function with respect to the scalable variable gives us a column vector of derivatives

 

But, and the middle term?

 

For the function wee need to find the derivative of each of the two output variables with respect to each of the two input variables - we end up with four terms in total

This can be arranged as a matrix - this object is referred as a Jacobian

 

The derivative of with respect to is the product of the Jacobian of with the Jacobian of and the derivative vector

image-20200219180638560

image-20200219180728229

image-20200219181536244

Simple Neural Networks

img

Normally they are drawn as something like:

image-20200219181659368

But fundamentally, they're just a mathematical function

image-20200219181809964

It takes variables in and spits variables out - where both of these variables could be vectors

 

Simplest possible case

image-20200219181933164

image-20200219182710516

 

The relation between neural networks and the brain comes from , the activation function

img

Neurons in the brain receive information from their neighbors through chemical and electrical stimuli

When the sum of all these stimulations goes beyond a certain threshold - the neuron activates and starts to stimulating its neighbors

This behavior can be mathematically expressed by some functions e.g. the hyperbolic tangent function

image-20200219183150712

image-20200219183243412

belongs to a family of similar functions all with an "s" shape - they're called sigmoids - hence, the name/symbol sigma

 

Adding more neurons

image-20200219183610561

image-20200219183625400

 

For any number of neurons

image-20200219183655262

Using the algebraic notation

image-20200219183738666

Adding more outputs

image-20200219183823431

image-20200219183908383

Combining in vector form

image-20200219183924985

 

Beginning to generalize

image-20200219184019791

 

Hidden Layers

image-20200219184129406

 

In summary

This is the linear algebra needed to describe the output a simple feed-forward neural network

image-20200219184225358

 

Training

 

Back-propagation

Looks at the output neurons and works back trough the network

Objective: to find the weights and biases that best match the input with the labeled data

Initially we initialize them as random numbers, then calculate a cost function

image-20200219192512029

 

image-20200219192522884

image-20200219192804172

At some initial point - if we could work out the gradient of with respect to the variable

image-20200219192931539

Then we could just head in the opposite direction

image-20200219193000731

More realistic scenario - there are lots of local minimums

image-20200219193050601

And we need to solve this cost function to all of the weights - so we are actually looking for the minimum in the hyper-surface that they form

image-20200219193303248

image-20200219193317802

If we want to head downhill / to we need to build the Jacobian by putting together the partial derivatives of the cost function with respect to all of the relevant variables

img

Knowing what we have to do, let's write a chain rule expression for the partial derivatives for the cost with respect to either the weight or the bias

The term links those derivatives

image-20200219193626757

It's often convenient to throw the weight plus bias terms to a separate function i.e. to use a new term

This will allow us to think about differentiating the particular function that we had chosen separately

image-20200219194415443

Now we can navigate trough the space in order to minimize the cost of the network for a set of training examples

 

 

image-20200219194559285

 

Building approximate functions

image-20200220114555778

image-20200220114740528

To derive a function that is a good representation of another function at least inside some boundaries - we can do this with Taylor Series

is a good representation of around the - but it becomes a poor representation further away from it

 

Power Series

img

 

 

Taylor Series are composed by coefficients in front of increasing powers of , a power series

Example

²³

Potentially going to infinite

 

In Taylor Series the approximation becomes better and better as we increase the number of terms - a pattern may emerge

 

Order of approximationGeneral Formula
Zeroth order approximation
First order approximation
Second order approximation²
Third order approximation²³
Nth order approximation²³

 

These short sections of the series are called Truncated series

The approximations will visually look like:

Zeroth: just a straight line, without any angle the best we can do is make it pass trough the point horizontally

image-20200220122628711

 

First: we can do a line with an angle, the best approximation would be a line with the same slope as our original function at that point

image-20200220122818730

 

Second: now we got a quadratic function, this allows us to make a single curved shape - for the points immediately around our point it's a nice approximation

image-20200220123233669

Third: we can approximate it even more, now with draw up to two curves

image-20200220123423336

Fourth

image-20200220123456277

Fifth

image-20200220123536055

 

Animation

 

Power Series derivation

image-20200220161643277

If we know everything about a function at the point :

then we can use that information function to reconstruct that function everywhere else

If i know everything about one place, I also know everything about it everywhere

However, this is only true for a certain type of functions that we call well behaved

 

Example

image-20200220162142361

 

Zeroth order approximation

Using only the - the value at that point, the best we can do is a line

The result isn't even a function of

image-20200220162242286

 

First order approximation

We will use the value of the function at and also the gradient at , which we will call f dash () at

image-20200220162624745

img

 

Second order approximation

We will use:

 

image-20200220170501483

 

image-20200220170615982

image-20200220170701372

Third order approximation

img

 

image-20200220173744071

 

image-20200220174347239

image-20200220174356622

Note that we could add higher order terms piecewise, and the lower order terms will remain the same

0f67e15eda4d56eb46055771008d1df6--fun

Fourth order approximation

Let's try to generalize

Note that the '' in the cubic term was a result of having to differentiate a cubic term twice

And we know that for the fourth order approximation we will need to calculate the fourth derivative for a function like ³² , so we will differentiate x to the fourth power three times

image-20200220180916102

The same kind of notation can be used for the lower order terms

image-20200220180953663

 

Remember that

 

So, the nth term of the chain will be

image-20200220181110807

Therefore, the complete power series can be written as

 

image-20200220181144823

This is certainly a Taylor Series, but as we're looking at the point , this is often called a Maclaurin Series

img

 

image-20200220181330412

 

Taylor Series

image-20200220182549426

Maclaurin said that if you have all information about a function at , you can reconstruct it everywhere. The Taylor Series simply acknowledges that there isn't nothing special about - it says that if you know everything about a function at any point you could reconstruct it everywhere anywhere.

 

image-20200220182937260

The number that we substitute in will correspond to the accuracy of our series

 

Now, let's solve it for an arbitrary point

 

image-20200220183548948

image-20200220191103644

 

By building the approximation around the point , when using the gradient temp (), rather then applying it directly to we instead apply to or How far are you from ?

image-20200220191356918

 

image-20200220191652616

image-20200220191701346

image-20200220191822891

 

image-20200220191846481

image-20200220191904736

 

Examples

1. Cosine function

 

Maclaurin Series - to know everything about

 

image-20200220211658712

In this case, the differentiation cycle of trigonometric functions makes that every other term ( and ) of the series receives zero coefficient - these values are in the odd positions of the series, so the elements of the series in odd positions are

and , at the even positions results in a coefficient of or

This configuration indicates that is a even function, (only has to the power of even numbers) being symmetrical along the vertical axis

image-20200220212430831

The resultant expression doesn't even contains references the cosine function

- not well behaved

image-20200220213906479

 

image-20200220213920019

We can't use , we need to use another point, so solving it by a Taylor Series

image-20200220223935367

This can tell us things about the power series more generally

  1. The approximation ignores the asymptote, going straight across it

    • img
  2. The approximations does dot describe at all the regions where

  3. The function is gradually improving for larger values of as we increase however, for values gather then around five, the function doesn't change as much, not describing larger values of and its tail flips up and down as the sign of each additional term flips the function from positive to negative and back again

 

Linearisation

Re-framing the Taylor Series concepts to show things like the expected error in an approximation

image-20200225173746421

 

Adding higher power terms = improved approximation

 

Change in notation

Changing the first-order approximation (example)

image-20200225173941769

image-20200225174026392

The expression says: starting from the height as you move away from your corresponding change in height is equals to your distance away from times the gradient of your function at

 

image-20200225175136687

The approximation will be used to evaluate the function near as you must already know about it at

Now, the distance from to , called will now be called , meaning a small step size from to

image-20200225175449784

can now be expressed in terms of

image-20200225175527747

image-20200225175536299

 

Now

Now, without , we will put it back 🤔, just by swapping for - because is more commonly used and swapping it will make no difference in this case because everything is is terms of

image-20200225175940952

 

image-20200225180800750

image-20200225181351322

 

 

When using the first order approximation, instead of evaluating the base function, how big should i expect the error to be?

 

image-20200225181544609

The gap between the green and other lines grows as we get away from the point

image-20200225181700373

Thinking about the series:

  1. If is a small number, ² should be really small and even smaller
  2. If the infinite series can exactly describe the function
  3. Although we couldn't evaluate all of the sums, the first chain that we ignore when making a first order approximation is multiplied by ²
  4. Therefore, the error must be in the order of ²

 

image-20200225182324832

So, we can add to our first order approximation an error term

 

This process of taking a function and ignoring terms above is called linearisation - we took a potentially very nasty function and approximated it with just a straight line

 

Note that

image-20200225213530150

10 Totally Real Reasons Spider-Man Is Leaving The MCU

 

 

The rise over run approximation and the first-order Taylor Series are the tangent line that goes trough the point

In the rise over run approach we used two points to graph a straight line. As the points become closer, the line becomes a better and better approximation for the slope at that point, when the points are indistinguishably close, we say that the line became a tangent, and that it's slope is the same as the slope of the function at that point

TL;DR

As becomes closer to zero, the approximation for the function's slope becomes exact

image-20200225213711345

image-20200225213722143

 

But, what if... they don't

If the point's don't come closer, is not fully tending towards zero, there is finite amount of space between the points

So, the resultant line - the gradient - will have a certain amount of error

 

image-20200225214540489

Then, it is possible to rearrange the Taylor Series to indicate how big we expect that error to be

image-20200225215109791

The gradient term has been isolated to the left hand side of the expression, the result is exactly the same, but the isolated part is suspiciously similar to the rise over run expression plus a collection of higher order terms

If we remove everything except for that first term and add the error expression

image-20200225215436971

The expected error is proportional to the distance between the two points

The method is first order accurate

This is particularly useful to make computer programs that solve these types of programs numerically rather then analytically

 

Multivariate Taylor Series

Recap

image-20200225215747187

image-20200225215803722

 

Two-dimensional case

image-20200225215902492

is now a function of two variables and

The truncated Taylor Series expressions will enable us to approximate the function at some point nearby

In the 2-D case, the approximation function should always be 2-D

 

 

 

Zeroth order approximation

It was a straight line with no gradient in 1-D, now it is a straight plane with no gradient

image-20200225221521437

image-20200225221547470

First order approximation

From the 1-D case, it should be something with a height and a gradient. In 2-D it still is a straight surface, but that time, it can have an angle

image-20200225221715598

Second order approximation

In the 1-D analogy it has height, angle and a single parabolic curvature - now, we're expecting some kind of parabolic surface

Using the peak as our approximation point, the parabola is created inside the curve

image-20200225222015868

Choosing a point on the lateral face of the curve we got a saddle function

image-20200225222116705

image-20200225222142894

Expression

First order approximation

image-20200225222253357

Second order approximation

image-20200225222314373

 

Using the Jacobian

image-20200225222357343

Using the Hessian

image-20200225222439037

image-20200225222503847

image-20200225222528692

image-20200225222541537

image-20200225222555920

This can be generalized for even higher dimension curves (hyper-curves)

 

Newton-Raphson method

Let's say that we have a distribution of heights

image-20200225224838459

 

And we want to fit that data to some equation, them we could:

 

But, how to find the right parameters for the model - the best and we can?

We will need an expression that indicates how good the model fits the data and look how that goodness of fit varies as we change the parameters and

 

Example for a simpler case:

 

Let's say that we have a function that describes how far away are we from the best value for a certain parameter, means perfect fit, positive values that the model's predictions are too large, and negative values that the model is predicting too low values - in this case we want to find the roots of the function right away.

Or maybe just the value itself says how good our fit is. In this case, we want to find peaks and troughs in o our function. Turns out that the roots of the first derivative lays on the same as those peaks and troughs. So, we could take the first derivative and finds the roots of it.

If we depend on multiple variables, let's say that we are solving the function as a partial derivative, that considers only one variable of interest at a time.

The Newton-Raphson method allows us to take the derivative of a function at some points and converge to some of it's roots. The process is:

  1. Make a guess of where a root might be
  2. Take the slope at that point
  3. Extrapolate the line of the slope until it reaches the axis, this will be your next guess
  4. Repeat steps 2 and 3 until convergence

The formula for the new guess (step 3) is:

Where the new guess depends on the last guess

We are actually saying that the function is a straight line and then guessing that the root is where the line crosses the axis - we hope to find a pretty good approximation to the real root by repeating this process over and over.

Example - let's use the following expression:

³²

It its plotted as:

image-20200227180422119

 

Rather then this simple function, that we can easily plot, it could:

So, with this method we don't need to:

 

Python implementation and execution:

img

The function has roots as x = 0 | x = 0.7 | x = 4.3

First guess: x=10 we find a root at x=4.3 after 7 iterations

Second guess: x=-10 and we find the root at x=0 after 10 iterations

Third guess:

We already know the roots 4.3 and 0, our function is cubic, so it might have three roots

Some guesses could be: x < 10, x > 10 or x between 0 and 4.3 Running the method with really large or really small values will return the already know roots - a good guess could be at the middle of the known roots, at x=2.15.

For the guess x=2.15 we find a root at x=0.697 after 3 iterations

 

However, some thing might go wrong. some guess could put us in a closed loop, going back and forth between the same values, never getting closer tho the answer

image-20200227205144469

In this example, the initial guess is , the calculated new guess lead us to , evaluating that results in again, so we would cycle on it forever

img

Running

results in

 

Other problem

When you're close to a turning point, either a minimum or a maximum, the gradient will be very small that when you divide by it the new guess might return you a crazy value, therefore, it wont converge, just trow you somewhere else.

image-20200227232146994

Example:

Using our already known function

If our first guess is the first iteration will trow us at

image-20200227233038467

The gradient at this point is

Changing to we got:

image-20200227233417419

We end up using less then a third of iterations to reach the same result

 

Bonus: simple polynomial printer

 

Gradient descent

img

Gradient?

 

This class is a bit confusing at first look, maybe you want to look at some other references first

Recommendations below

We've already learn that we can use the derivatives to measure the slope of a function at some point and that we could use that slope to navigate to lower or upper points in our function

We used the Newton-Raphson method to find the roots of a certain one-dimensional function numerically

In some scenarios we have a function that says how good or bad our model fits some data in this case, we may want to find troughs in that function if this means to reduce our model's badness in fitting

In a multivariate space, the derivative at a point isn't enough:

 

The solution is using a vector instead. A vector is composed by:

This vector is called Grad and it's represented by as the grad of at some point

For real problems - specially training neural networks - we need to find the troughs and peaks of a lot of multivariate functions a lot of times

Let's start by looking at the function

²

 

image-20200228135451778

 

Note that the function gets bigger when is larger and is positive and it gets smaller whenever gets negative

Spinning and looking down the axis we got a projection of a straight line

image-20200228140053692

Spinning and looking to the axis we got an upward parabola for and an downward parabola for

image-20200228141153109

 

And the function is equal to zero along both axis

 

The question is: how do i find the fastest or steepest way to get down in this graph?

We can find the gradient of the function in respect to each of its axis - we could differentiate for each variable by treating everything else as constants

Grad is a awesome vector - it's the thing that connects calculus to linear algebra

 

image-20200228215405192

 

The Grad vector is defined as:

In this case:

image-20200228215334630

²

Therefore

²

 

We thing about grad as the combination of two vectors, the first is

Both start from some given point

Let's ignore

The partial derivative in respect to at

will result in a vector that starts from and goes in the direction an amount corresponding to the slope in direction at that point

The second vector is

The partial derivative in respect to at some

will result in a vector that starts from and goes in the direction an amount corresponding to the slope in direction at that point

image-20200229214831339

 

Then

So, if we sum it with , the resultant will be a vector that starts at and points towards the steepest hill around by a direction proportional to the steepness of the hill

image-20200229215309324

 

 

 

The vector comes in handy because once we calculated it, we can calculate the derivative in any direction by multiplying it with some unit vector

 

²²

 

Descent?

OK, now we know the direction we want to go: , but, by how much should we walk

Turns out that can help us with that problem too

If we begin at some point

We will take small steps corresponding with the slope of the hill

image-20200229221139840

image-20200229221912711

 

img