Multivariate CalculusNotation problemFunctionsRise over run - speed vs timeDerivatives - definitionRise over run of a linear functionRise over run for more complex functionsMeaningsThe first derivative The second derivative Sum rulePower ruleSpecial Cases - the function is its own gradientPositive caseSin and CosProduct ruleChain ruleNested functionsTaming a beastMultivariate calculusWhat a variable is?Differentiate with respect to anythingTotal DerivativeJacobianJacobian of a single function of many variablesJacobian AppliedExample IExample IIExample IIISandpitThe Hessian2-D exampleReality is hardMultivariate Chain RuleSimplifying the notationUnivariate exampleMultivariate exampleSimple Neural NetworksSimplest possible caseAdding more neuronsAdding more outputsBeginning to generalizeHidden LayersIn summaryTrainingBack-propagationBuilding approximate functionsPower SeriesPower Series derivationExampleZeroth order approximationFirst order approximationSecond order approximationThird order approximationFourth order approximationTaylor SeriesExamples1. Cosine function - not well behavedLinearisationChange in notationChanging the first-order approximation (example)Multivariate Taylor SeriesRecapTwo-dimensional caseZeroth order approximationFirst order approximationSecond order approximationExpressionFirst order approximationSecond order approximationNewton-Raphson methodGradient descentGradient?Descent?
Different people invented / contributed to calculus over time - each one has chosen to use a notation
Some notations are better to the applications that they're invented for

1INPUTS (a, b, c, ...) => [FUNCTION] => OUTPUTS (p, q, r, ...)
Context
Modeling the world
Calculus


Acceleration is the local gradient of a speed-time graph
Is a function of time (in this example)

Using the tangent over a point to see the slope at that point
The slope change can then be plotted to show acceleration x time instead of speed x time

Constant speed = zero gradient = zero acceleration over time

Orange line = acceleration = space * time ^ 2
This is the first derivative

The **second derivative can be taken from plotting the slope change in the acceleration function.
It its related to the car starting and stopping
Thinking about what curve generates that derivative is called anti-derivative, or integral

In this graph it shows the distance covered by the car - how much distance are covered per unit time - or just the speed
The slope of that line is represented by the amount of growth between two points (an interval) divided by the length of the considered interval - this is called rise over run


The rise over run changes depending on the chosen points
What is the rise over run at a single point ?


Defining the second point at a distance from the first one - we can use to determinate they position
Now, writing a function for any based on the rise over run concept



With smaller values of , the result becomes a better representation of the gradient


This concept can be expressed with limits - so, we want to know the gradient for an as small as possible / needed

Then, we can get the slope for any single point in the function

This is often represented with the notation
Note that is not zero


https://math.dartmouth.edu/opencalc2/cole/lecture8.pdf
Critical point: a point where the slope is , often a local maximum or minimum
Imagine that a certain city got a single train railway, connecting any number of stations
Is the slope of the tangent line to a function at the point
It tells us whether and how much a function is increasing or decreasing

Other properties
In summary, the first derivative measures the rate of change of a certain function. If we have a function where the axis indicates time and the axis t
Note that the first derivative does tell where the critical points are, but it cannot show if they're rather local maximums or local minimums - the second derivative can tell this
If the first derivative shows that there are a critical point at , i.e.




Example





The gradient is negative everywhere except at , but at this point we can't see the gradient
The function has a discontinuity because at , is not defined



Besides , only one function fit the criteria


is a universal constant




The trigonometric functions are really exponential functions




If we differentiate what we are really looking for is the change in area of the rectangle as we vary

Adding to the size of the rectangle changes - in this case both sides happily increase (easier to see the concept).
We can subdivide the new rectangle in smaller rectangles, one of them has the same size of the original one.

Then we can calculate the width and height of the other rectangles

We can then write an expression only for the area of the new rectangles


As approaches , all rectangles will shrink, but, analyzing the equations, note that the smaller rectangle will shrink the fastest


We can ultimately disconsider the area of the smaller rectangle

The limit will be calculated by

It's useful to rearrange it in the following way:



Note that both fractions are just the derivatives for and

This can be rewritten as

Contemplate the product rule





Note that we are relating the concept of money to happiness, but via the concept of pizza



By knowing how much money i have now, how much effort should i put into making more, if my aim is to be happy?
So, we need to know the rate of change for happiness is with respect to money. Which is
In this simple example we could just substitute one function into another and derive it

But the chain rule provides us with a more elegant solution that will work even for more complex functions, where simple plugging in into another (direct substitution) isn't an option.
In this particular notation convention, the product looks like it would give the desired function - this approach is called the chain rule.





So, if we don't want to appear in our final function, we just need to substitute for its function in terms of

Then rearranging the terms

Result


~The beast~



Multiple input or output variables
How to apply the concepts shown before to systems with multiple variables?
1. One of the variables is a function of the other
2. Dependent variables
When we can say
But not
Because the other way around doesn't necessarily makes sense



The vehicle speed is a function of time because at each point in time, the vehicle can be at one and only one speed
However, we cannot say that time is a function of speed, because a same speed can happened at different points in time

Therefore, the speed is a dependent variable, because it depends on time and the time is the independent variable in this context
Typically, when you first learn calculus, you take functions containing variables and constants and then differentiate the dependent variables (such as speed) in respect to the independent variables (such as time).
However, what gets labeled as a constant or a variable can be subtler then expected - it will require you to understand the context of the problem being described
The car example


But if you're a car designer, and have a target speed, then your speed becomes the constant and the mass and drag can be adjusted by changing the car's design
TL; DR; - you can differentiate any term with respect with another - it depends on the context
Another example - designing a can

In principle we could change about everything about the can, even the metal's density (except for 🤷)
So, let's find the derivative of the can's mass with respect to each variable

When differentiating in respect to some variable, simply consider all of the other variables to behave as constants

Note that the first term doesn't contains , so the derivative of constants just becomes , as usual.
The second term does contains , and it is just multiplied by some constants - differentiating this leaves just those constants
The partial derivative in respect to doesn't even contains because the mass will vary linearly with the height when all else is kept constant

Note that the notation also changed - instead of the we're using , which indicates that we are differentiating a function of more then one variable


Partial differentiation is in essence just taking a multidimensional problem and pretending that it's a standard 1-D problem as we consider each variable separately
Working on the example



Imagine that the variables and were actually themselves a function of a single other parameter where:

We are looking for the derivative of with respect to
In this simple case can be directly substituted into the function and the derivative taken

In a more more complicated scenario the chain rule comes in handy

The derivative with respect to the variable will be the sum of the chains of the three variables
So we need to know the derivatives with respect to





Analogy



is a vector in which when we give it a specific coordinate will return a vector pointing in the direction of steepest uphill slope of this function
In this specific example, is a constant which does not depend on the location selected

Another example




The steeper the slope grater the Jacobian becomes at that point
Converting into a contour plot

Plotting Jacobian vectors



Remember kids
If
Then
If we are dealing with multivariate calculus we will differentiate with respect to some variable like or









This function receives a vector as the input but gives also a vector as the output
We can think about this problem as being contained by two vector spaces, one for and another to
Each point in has a corresponding point in
As we move in , we also move in , but making a different path

The Jacobian can them be represented as a matrix by stacking the rows


This matrix is just a transformation from space to space

Many of the functions aren't so nice
They can be highly non-linear and much more complicated
But often they may still be smooth - by approximating enough we can say that a region is approximately linear
Therefore adding all the contributions from the Jacobian determinants at each point in space, we can still calculate the change in the size of a region after a transformation
TRANSFORMING BETWEEN CARTESIAN AND POLAR COORDINATE SYSTEMS

Polar coordinates uses an radius and an angle , we want the coordinates as values

Making the Jacobian and taking the determinant
As we move along , away from the origin, small regions of space will scale as a function of



Optimization: greatly related to find inputs that gives us the maximum or the minimum of a function
Making becomes much more complicated, but also ineffective as we can have functions with more then one point with the gradient equals to zero


Going to the highest peak can be like walking at night: maybe we don't have a nice analytical expression and each point in the plot are the result of a week of processing in a supercomputer or the result of a practical experiment
The problem: the Jacobian points uphill, nut not to the tallest one, if you follow the arrows you will arrive at some hill with all arrows pointing towards you
In maths we don't need to follow the arrows, we can just teleport to any region of space - so we are not really walking
A better analogy is the sandpit:

You're using a sick to poke the sand measure how deep the soil beneath is, you can poke any point in space and you can't see the hills because they're blocked by the sand
Takes the second order derivatives into a matrix


We can pass an coordinate to the Hessian and it will return matrix that tells us something about that point in space


At the origin we have a saddle point

The Hessian is negative, so, it is not a maximum or a minimum

But the gradient is flat
The slope is coming down in one direction and upwards in another direction
That feature is called a saddle point
They can cause a lot of confusion when searching for a peak

The question: how to calculate the Jacobian for problems that we don't even have the function that we're trying to optimize?
The answer: numerical methods
There are a range of techniques that allows us to find approximate answers to that questions
The derivative measures the slope of two points as their distance tends to zero - if we can't calculate every point for a formula, let's use only what we got to build the derivatives - approximation

But that's not practical for higher dimension scenarios
If we start from a initial location and we would like to approximate the Jacobian we could approximate each partial derivative intern:


Too big: bad approximation
Too small: numerical issues - if the points are too close, the computer may not register any movement at all (floating point range stuff)
Simplest approach: to calculate the gradient using a few different step sizes and taking some kind of average






Differentiating the scaled valued function with respect to the input vector gives us the Jacobian row vector
Differentiating a vector valued function with respect to the scalable variable gives us a column vector of derivatives
But, and the middle term?
For the function wee need to find the derivative of each of the two output variables with respect to each of the two input variables - we end up with four terms in total
This can be arranged as a matrix - this object is referred as a Jacobian
The derivative of with respect to is the product of the Jacobian of with the Jacobian of and the derivative vector




Normally they are drawn as something like:

But fundamentally, they're just a mathematical function

It takes variables in and spits variables out - where both of these variables could be vectors


The relation between neural networks and the brain comes from , the activation function

Neurons in the brain receive information from their neighbors through chemical and electrical stimuli
When the sum of all these stimulations goes beyond a certain threshold - the neuron activates and starts to stimulating its neighbors
This behavior can be mathematically expressed by some functions e.g. the hyperbolic tangent function


belongs to a family of similar functions all with an "s" shape - they're called sigmoids - hence, the name/symbol sigma


For any number of neurons

Using the algebraic notation



Combining in vector form


This is the linear algebra needed to describe the output a simple feed-forward neural network

Looks at the output neurons and works back trough the network
Objective: to find the weights and biases that best match the input with the labeled data
Initially we initialize them as random numbers, then calculate a cost function



At some initial point - if we could work out the gradient of with respect to the variable

Then we could just head in the opposite direction

More realistic scenario - there are lots of local minimums

And we need to solve this cost function to all of the weights - so we are actually looking for the minimum in the hyper-surface that they form


If we want to head downhill / to we need to build the Jacobian by putting together the partial derivatives of the cost function with respect to all of the relevant variables

Knowing what we have to do, let's write a chain rule expression for the partial derivatives for the cost with respect to either the weight or the bias
The term links those derivatives

It's often convenient to throw the weight plus bias terms to a separate function i.e. to use a new term
This will allow us to think about differentiating the particular function that we had chosen separately

Now we can navigate trough the space in order to minimize the cost of the network for a set of training examples



To derive a function that is a good representation of another function at least inside some boundaries - we can do this with Taylor Series
is a good representation of around the - but it becomes a poor representation further away from it

Taylor Series are composed by coefficients in front of increasing powers of , a power series
Example
Potentially going to infinite
In Taylor Series the approximation becomes better and better as we increase the number of terms - a pattern may emerge
| Order of approximation | General Formula |
|---|---|
| Zeroth order approximation | |
| First order approximation | |
| Second order approximation | |
| Third order approximation | |
| Nth order approximation |
These short sections of the series are called Truncated series
The approximations will visually look like:
Zeroth: just a straight line, without any angle the best we can do is make it pass trough the point horizontally

First: we can do a line with an angle, the best approximation would be a line with the same slope as our original function at that point

Second: now we got a quadratic function, this allows us to make a single curved shape - for the points immediately around our point it's a nice approximation

Third: we can approximate it even more, now with draw up to two curves

Fourth

Fifth

Animation

If we know everything about a function at the point :
- Value
- First derivative
- Second derivative
- Nth derivative
then we can use that information function to reconstruct that function everywhere else
If i know everything about one place, I also know everything about it everywhere
However, this is only true for a certain type of functions that we call well behaved
- Continuous
- That you can differentiate as many times as you want

Using only the - the value at that point, the best we can do is a line
The result isn't even a function of

We will use the value of the function at and also the gradient at , which we will call f dash () at


We will use:







Note that we could add higher order terms piecewise, and the lower order terms will remain the same

Let's try to generalize
Note that the '' in the cubic term was a result of having to differentiate a cubic term twice
And we know that for the fourth order approximation we will need to calculate the fourth derivative for a function like , so we will differentiate x to the fourth power three times

The same kind of notation can be used for the lower order terms

Remember that
So, the nth term of the chain will be

Therefore, the complete power series can be written as

This is certainly a Taylor Series, but as we're looking at the point , this is often called a Maclaurin Series



Maclaurin said that if you have all information about a function at , you can reconstruct it everywhere. The Taylor Series simply acknowledges that there isn't nothing special about - it says that if you know everything about a function at any point you could reconstruct it everywhere anywhere.

The number that we substitute in will correspond to the accuracy of our series
Now, let's solve it for an arbitrary point


By building the approximation around the point , when using the gradient temp (), rather then applying it directly to we instead apply to or How far are you from ?






Well behaved
Infinitely differentiable
Maclaurin Series - to know everything about

In this case, the differentiation cycle of trigonometric functions makes that every other term ( and ) of the series receives zero coefficient - these values are in the odd positions of the series, so the elements of the series in odd positions are
and , at the even positions results in a coefficient of or
This configuration indicates that is a even function, (only has to the power of even numbers) being symmetrical along the vertical axis

The resultant expression doesn't even contains references the cosine function


We can't use , we need to use another point, so solving it by a Taylor Series

This can tell us things about the power series more generally
The approximation ignores the asymptote, going straight across it

The approximations does dot describe at all the regions where
The function is gradually improving for larger values of as we increase however, for values gather then around five, the function doesn't change as much, not describing larger values of and its tail flips up and down as the sign of each additional term flips the function from positive to negative and back again
Re-framing the Taylor Series concepts to show things like the expected error in an approximation

Adding higher power terms = improved approximation


The expression says: starting from the height as you move away from your corresponding change in height is equals to your distance away from times the gradient of your function at

The approximation will be used to evaluate the function near as you must already know about it at
Now, the distance from to , called will now be called , meaning a small step size from to

can now be expressed in terms of


Now
Now, without , we will put it back 🤔, just by swapping for - because is more commonly used and swapping it will make no difference in this case because everything is is terms of



When using the first order approximation, instead of evaluating the base function, how big should i expect the error to be?

The gap between the green and other lines grows as we get away from the point

Thinking about the series:

So, we can add to our first order approximation an error term
This process of taking a function and ignoring terms above is called linearisation - we took a potentially very nasty function and approximated it with just a straight line
Note that


The rise over run approximation and the first-order Taylor Series are the tangent line that goes trough the point
In the rise over run approach we used two points to graph a straight line. As the points become closer, the line becomes a better and better approximation for the slope at that point, when the points are indistinguishably close, we say that the line became a tangent, and that it's slope is the same as the slope of the function at that point
TL;DR
As becomes closer to zero, the approximation for the function's slope becomes exact


But, what if... they don't
If the point's don't come closer, is not fully tending towards zero, there is finite amount of space between the points
So, the resultant line - the gradient - will have a certain amount of error

Then, it is possible to rearrange the Taylor Series to indicate how big we expect that error to be

The gradient term has been isolated to the left hand side of the expression, the result is exactly the same, but the isolated part is suspiciously similar to the rise over run expression plus a collection of higher order terms
If we remove everything except for that first term and add the error expression

The expected error is proportional to the distance between the two points
The method is first order accurate
This is particularly useful to make computer programs that solve these types of programs numerically rather then analytically



is now a function of two variables and
The truncated Taylor Series expressions will enable us to approximate the function at some point nearby
In the 2-D case, the approximation function should always be 2-D
It was a straight line with no gradient in 1-D, now it is a straight plane with no gradient


From the 1-D case, it should be something with a height and a gradient. In 2-D it still is a straight surface, but that time, it can have an angle

In the 1-D analogy it has height, angle and a single parabolic curvature - now, we're expecting some kind of parabolic surface
Using the peak as our approximation point, the parabola is created inside the curve

Choosing a point on the lateral face of the curve we got a saddle function




Using the Jacobian

Using the Hessian





This can be generalized for even higher dimension curves (hyper-curves)
Let's say that we have a distribution of heights

And we want to fit that data to some equation, them we could:
But, how to find the right parameters for the model - the best and we can?
We will need an expression that indicates how good the model fits the data and look how that goodness of fit varies as we change the parameters and
Example for a simpler case:
Let's say that we have a function that describes how far away are we from the best value for a certain parameter, means perfect fit, positive values that the model's predictions are too large, and negative values that the model is predicting too low values - in this case we want to find the roots of the function right away.
Or maybe just the value itself says how good our fit is. In this case, we want to find peaks and troughs in o our function. Turns out that the roots of the first derivative lays on the same as those peaks and troughs. So, we could take the first derivative and finds the roots of it.
If we depend on multiple variables, let's say that we are solving the function as a partial derivative, that considers only one variable of interest at a time.
The Newton-Raphson method allows us to take the derivative of a function at some points and converge to some of it's roots. The process is:
The formula for the new guess (step 3) is:
Where the new guess depends on the last guess
We are actually saying that the function is a straight line and then guessing that the root is where the line crosses the axis - we hope to find a pretty good approximation to the real root by repeating this process over and over.
Example - let's use the following expression:
It its plotted as:

Rather then this simple function, that we can easily plot, it could:
So, with this method we don't need to:
Calculate for a lot of points and plot the result (to solve it analytically)
Solve it algebraically ()
Python implementation and execution:
x1# Messy definition of linear, quadratic and cubic functions2def cubic_function(a=0, b=0, c=0, d=0):3 return lambda x : (a * pow(x, 3)) + (b * pow(x, 2)) + (c * x) + d45def quadratic_function(a=0, b=0, c=0):6 return cubic_function(0, a, b, c)78def linear_function(a=0, b=0):9 return quadradic_function(0, 0, a, b)10111213# The derivative on a point for any function (smaller values of h means more precision)14def derivative(function, x, h=0.00001):15 '''16 lim f(x + h) - f(x)17 h->0 ---------------18 h 19 '''20 return (function(x + h) - function(x)) / h2122def evaluate_guess(function, guess_x):23 '''24 The distance between a f(guess) and 025 '''26 return abs( 0 - function(guess_x) ) # the distance from f(x) and x=02728def newthon_raphson_step(function, guess_x=0):29 '''30 If you want to find the roots for a function f(x) numerically, the Newton-Raphson's method says3132 1. Choose an initial guess x033 2. Calculate new guesses until convergence with3435 f(x_n)36 x_n+1 = x_n - ---------37 f'(x_n) 3839 '''40 return guess_x - (function(guess_x) / derivative(function, guess_x) )414243def newthon_raphson(function, x_0=0, precision=0.000001, max_iterations=1000):44 '''45 The execution of multiple steps of the Newthon-Raphson's method, it will repeat until the distance between the guess46 and x=0 is grather then 0.000001 or if this value isn't reached after 1000 iterations - we'll assume that the function47 is stuck, however, we aren't actively checking for that48 '''49 best_guess_x = x_050 Δ = evaluate_guess(function, x_0)51 iteration = 05253 print(f'#0 iter | x={best_guess_x} | Δ={Δ} | f(x)={function(best_guess_x)}')5455 while best_guess_x is not None and Δ > precision and iteration < max_iterations:56 iteration += 15758 best_guess_x = newthon_raphson_step(function, best_guess_x)59 Δ = evaluate_guess(function, best_guess_x)6061 print(f'#{iteration} iter | x = {round(best_guess_x, 6):.6f} | f(x) = {round(function(best_guess_x), 6):.6f} | Δ = {round(Δ, 6):.6f}')6263 print('\n')64 print('== Newthon-Raphson final result ==')65 print(f'Root found: {evaluate_guess(function, best_guess_x) < precision}')66 print(f'Iterations: {iteration}')67 print(f'Starting guess: x = {x_0}')68 print(f'Best guess: x = {round(best_guess_x, 10):.10f}')69 print(f'Solving for best guess: f({round(best_guess_x, 2):.2f}) = {round(best_guess_x, 10):.10f}')70 print(f'Error: Δ = {round(Δ, 10):.10f}')71 print('====================================')7273 if iteration == max_iterations:74 print('''Maximum iterations reached - probably you reached a close loop, check for loops, try changing x_0 or increase the maximum iteration amount.''')75 return best_guess_x, Δ, iteration7677 78'''79a=1, b=-5 c=380'''81#f = (pow(x, 3)) - (5 * pow(x, 2)) + (3 * x) # alternative8283f = cubic_function(a=1, b=-5, c=3) # f(x) = x³ - 5x² + 3x84newthon_raphson(f, 10)85newthon_raphson(f, -10)86newthon_raphson(f, 2.15)
The function has roots as x = 0 | x = 0.7 | x = 4.3
First guess: x=10 we find a root at x=4.3 after 7 iterations
xxxxxxxxxx171#0 iter | x=10 Δ=530 f(x)=5302#1 iter | x = 7.389166 | f(x) = 152.615401 | Δ = 152.6154013#2 iter | x = 5.746512 | f(x) = 41.891150 | Δ = 41.8911504#3 iter | x = 4.807295 | f(x) = 9.968451 | Δ = 9.9684515#4 iter | x = 4.396350 | f(x) = 1.521767 | Δ = 1.5217676#5 iter | x = 4.306941 | f(x) = 0.064756 | Δ = 0.0647567#6 iter | x = 4.302784 | f(x) = 0.000137 | Δ = 0.0001378#7 iter | x = 4.302776 | f(x) = 0.000000 | Δ = 0.000000910== Newthon-Raphson final result ==11Root found: True12Iterations: 713Starting guess: x = 1014Best guess: x = 4.302775637815Solving for best guess: f(4.30) = 4.302775637816Error: Δ = 0.000000001317====================================Second guess: x=-10 and we find the root at x=0 after 10 iterations
xxxxxxxxxx201#0 iter | x=-10 Δ=1530 f(x)=-15302#1 iter | x = -6.203471 | f(x) = -449.754112 | Δ = 449.7541123#2 iter | x = -3.711532 | f(x) = -131.140034 | Δ = 131.1400344#3 iter | x = -2.101297 | f(x) = -37.659314 | Δ = 37.6593145#4 iter | x = -1.090559 | f(x) = -10.515290 | Δ = 10.5152906#5 iter | x = -0.488772 | f(x) = -2.777577 | Δ = 2.7775777#6 iter | x = -0.165962 | f(x) = -0.640173 | Δ = 0.6401738#7 iter | x = -0.030967 | f(x) = -0.097724 | Δ = 0.0977249#8 iter | x = -0.001465 | f(x) = -0.004405 | Δ = 0.00440510#9 iter | x = -0.000004 | f(x) = -0.000011 | Δ = 0.00001111#10 iter | x = 0.000000 | f(x) = 0.000000 | Δ = 0.0000001213== Newthon-Raphson final result ==14Root found: True15Iterations: 1016Starting guess: x = -1017Best guess: x = 0.000000000018Solving for best guess: f(0.00) = 0.000000000019Error: Δ = 0.000000000120====================================
Third guess:
We already know the roots 4.3 and 0, our function is cubic, so it might have three roots
Some guesses could be: x < 10, x > 10 or x between 0 and 4.3 Running the method with really large or really small values will return the already know roots - a good guess could be at the middle of the known roots, at x=2.15.
For the guess x=2.15 we find a root at x=0.697 after 3 iterations
xxxxxxxxxx131#0 iter | x=2.15 Δ=6.724124 f(x)=-6.7241242#1 iter | x = 0.698484 | f(x) = -0.003172 | Δ = 0.0031723#2 iter | x = 0.697226 | f(x) = -0.000005 | Δ = 0.0000054#3 iter | x = 0.697224 | f(x) = -0.000000 | Δ = 0.00000056== Newthon-Raphson final result ==7Root found: True8Iterations: 39Starting guess: x = 2.1510Best guess: x = 0.697224362311Solving for best guess: f(0.70) = 0.697224362312Error: Δ = 0.000000000113====================================
However, some thing might go wrong. some guess could put us in a closed loop, going back and forth between the same values, never getting closer tho the answer

In this example, the initial guess is , the calculated new guess lead us to , evaluating that results in again, so we would cycle on it forever

Running
xxxxxxxxxx21f = cubic_function(a=1, b=0, c=-2, d=2) # x³ -2x + 22newthon_raphson(f, 0)results in
xxxxxxxxxx261#0 iter | x=0 | Δ=2 | f(x)=22#1 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.0000003#2 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.9999404#3 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.0000005#4 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.9999406#5 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.0000007#6 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.9999408#7 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.0000009...10#995 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.00000011#996 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.99994012#997 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.00000013#998 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.99994014#999 iter | x = 1.000000 | f(x) = 1.000000 | Δ = 1.00000015#1000 iter | x = 0.000030 | f(x) = 1.999940 | Δ = 1.999940161718== Newthon-Raphson final result ==19Root found: False20Iterations: 100021Starting guess: x = 022Best guess: x = 0.000030010323Solving for best guess: f(0.00) = 0.000030010324Error: Δ = 1.999939979425====================================26Maximum iterations reached - probably you reached a close loop, check for loops, try changing x_0 or increase the maximum iteration amount.
Other problem
When you're close to a turning point, either a minimum or a maximum, the gradient will be very small that when you divide by it the new guess might return you a crazy value, therefore, it wont converge, just trow you somewhere else.

Example:
Using our already known function
xxxxxxxxxx21f = cubic_function(a=1, b=-5, c=3) # f(x) = x³ - 5x² + 3x2newthon_raphson(f, 3)If our first guess is the first iteration will trow us at

The gradient at this point is
xxxxxxxxxx441#0 iter | x=3 | Δ=9 | f(x)=-92#1 iter | x = 224999.983822 | f(x) = 11390369418629708.000000 | Δ = 11390369418629708.0000003#2 iter | x = 150000.382718 | f(x) = 3374913333419859.500000 | Δ = 3374913333419859.5000004#3 iter | x = 100000.740748 | f(x) = 999972222170842.250000 | Δ = 999972222170842.2500005#4 iter | x = 66667.736126 | f(x) = 296288333250360.375000 | Δ = 296288333250360.3750006#5 iter | x = 44445.711830 | f(x) = 87789128875373.984375 | Δ = 87789128875373.9843757#6 iter | x = 29631.036208 | f(x) = 26011609715097.308594 | Δ = 26011609715097.3085948#7 iter | x = 19754.579076 | f(x) = 7707142837706.546875 | Δ = 7707142837706.5468759#8 iter | x = 13170.274868 | f(x) = 2283597801325.270020 | Δ = 2283597801325.27002010#9 iter | x = 8780.738876 | f(x) = 676621562090.689331 | Δ = 676621562090.68933111#10 iter | x = 5854.381602 | f(x) = 200480458823.617523 | Δ = 200480458823.61752312#11 iter | x = 3903.476773 | f(x) = 59401612691.887161 | Δ = 59401612691.88716113#12 iter | x = 2602.873752 | f(x) = 17600477174.935402 | Δ = 17600477174.93540214#13 iter | x = 1735.805199 | f(x) = 5214955351.280502 | Δ = 5214955351.28050215#14 iter | x = 1157.759699 | f(x) = 1545171242.574341 | Δ = 1545171242.57434116#15 iter | x = 772.396381 | f(x) = 457828058.790962 | Δ = 457828058.79096217#16 iter | x = 515.488020 | f(x) = 135652455.080934 | Δ = 135652455.08093418#17 iter | x = 344.216550 | f(x) = 40193116.884430 | Δ = 40193116.88443019#18 iter | x = 230.036731 | f(x) = 11908935.709860 | Δ = 11908935.70986020#19 iter | x = 153.918597 | f(x) = 3528482.455130 | Δ = 3528482.45513021#20 iter | x = 103.175803 | f(x) = 1045415.127232 | Δ = 1045415.12723222#21 iter | x = 69.351243 | f(x) = 309711.463050 | Δ = 309711.46305023#22 iter | x = 46.807548 | f(x) = 91738.526207 | Δ = 91738.52620724#23 iter | x = 31.787566 | f(x) = 27162.842722 | Δ = 27162.84272225#24 iter | x = 21.788263 | f(x) = 8035.229690 | Δ = 8035.22969026#25 iter | x = 15.143750 | f(x) = 2371.729702 | Δ = 2371.72970227#26 iter | x = 10.748096 | f(x) = 696.273432 | Δ = 696.27343228#27 iter | x = 7.871932 | f(x) = 201.581722 | Δ = 201.58172229#28 iter | x = 6.042412 | f(x) = 56.186449 | Δ = 56.18644930#29 iter | x = 4.964147 | f(x) = 14.008927 | Δ = 14.00892731#30 iter | x = 4.450753 | f(x) = 2.472117 | Δ = 2.47211732#31 iter | x = 4.312801 | f(x) = 0.156335 | Δ = 0.15633533#32 iter | x = 4.302827 | f(x) = 0.000790 | Δ = 0.00079034#33 iter | x = 4.302776 | f(x) = 0.000000 | Δ = 0.000000353637== Newthon-Raphson final result ==38Root found: True39Iterations: 3340Starting guess: x = 341Best guess: x = 4.302775639342Solving for best guess: f(4.30) = 4.302775639343Error: Δ = 0.000000024544====================================
Changing to we got:

xxxxxxxxxx201#0 iter | x=3.1 | Δ=8.959 | f(x)=-8.9592#1 iter | x = 13.893417 | f(x) = 1758.350053 | Δ = 1758.3500533#2 iter | x = 9.925548 | f(x) = 515.024498 | Δ = 515.0244984#3 iter | x = 7.341307 | f(x) = 148.208208 | Δ = 148.2082085#4 iter | x = 5.717490 | f(x) = 40.607005 | Δ = 40.6070056#5 iter | x = 4.792381 | f(x) = 9.608784 | Δ = 9.6087847#6 iter | x = 4.391632 | f(x) = 1.441647 | Δ = 1.4416478#7 iter | x = 4.306544 | f(x) = 0.058577 | Δ = 0.0585779#8 iter | x = 4.302783 | f(x) = 0.000112 | Δ = 0.00011210#9 iter | x = 4.302776 | f(x) = 0.000000 | Δ = 0.000000111213== Newthon-Raphson final result ==14Root found: True15Iterations: 916Starting guess: x = 3.117Best guess: x = 4.302775637818Solving for best guess: f(4.30) = 4.302775637819Error: Δ = 0.000000001020====================================
We end up using less then a third of iterations to reach the same result
Bonus: simple polynomial printer
xxxxxxxxxx271# works for the simple cases2def simple_polynomial_to_str(*args):3 4 def power(of): # qyuck and dirty way5 if of == 2: return '²' # to represent a²6 if of == 3: return '³' # to represent a³7 if of == 4: return '⁴' # to represent a⁴8 if of == 1: return '' # to represent a¹ = a9 else: return f'^{of}' # to represent a^n e.g. a^-101011 def multiplier(value):12 if value == 1: return '' # 1*x = x13 if value == -1: return '-' # -1*x = -x14 else: return value1516 terms = []17 for index, item in enumerate(args):18 if item != 0:19 if index != len(args) - 1:20 terms.append(f'{multiplier(item)}x{power(len(args) - (index + 1) )}')21 else:22 terms.append(f'{multiplier(item)}') # the last item (n) is n * x⁰, that's the same as just n23 return '+'.join(terms).replace('+-', '-').replace('+', ' + ').replace('-', ' - ')2425print(simple_polynomial_to_str(1, -3, 1, 0, -2, 10))2627# x^5 - 3x⁴ + x³ - 2x + 2

This class is a bit confusing at first look, maybe you want to look at some other references first
Recommendations below
We've already learn that we can use the derivatives to measure the slope of a function at some point and that we could use that slope to navigate to lower or upper points in our function
We used the Newton-Raphson method to find the roots of a certain one-dimensional function numerically
In some scenarios we have a function that says how good or bad our model fits some data in this case, we may want to find troughs in that function if this means to reduce our model's badness in fitting
In a multivariate space, the derivative at a point isn't enough:
With 1-D plots, the slope indicated by a scalar can define how steep the function is at that point
However, in a multivariate case, let's say, 2-D: the slope of the function depends on which direction you're looking at
- Which line is the slope of the function?
The solution is using a vector instead. A vector is composed by:
- Starting point
- Direction
- Magnitude
This vector is called Grad and it's represented by as the grad of at some point
properties:
Points towards the steepest slope at a point - the direction in which the function will increase the fastest
- This means that we can walk towards to get down the hill
Has a magnitude of the slope at that point
For real problems - specially training neural networks - we need to find the troughs and peaks of a lot of multivariate functions a lot of times
Solving it algebraically is just not feasible - we need a solution that helps us to navigate trough that multivariate space, generally trying to minimize that function by walking down the hill using some numerical method
If we land at some random point in a function, we want to:
- Analyze how steep the hill is around us
- Pick the direction where the hill is the steepest
- Find which direction is downwards
- Walk in that direction some amount of steps that makes sense i.e. we don't want to land somewhere totally unknown
Let's start by looking at the function

Note that the function gets bigger when is larger and is positive and it gets smaller whenever gets negative
Spinning and looking down the axis we got a projection of a straight line

Spinning and looking to the axis we got an upward parabola for and an downward parabola for

And the function is equal to zero along both axis
The question is: how do i find the fastest or steepest way to get down in this graph?
We can find the gradient of the function in respect to each of its axis - we could differentiate for each variable by treating everything else as constants
Grad is a awesome vector - it's the thing that connects calculus to linear algebra

The Grad vector is defined as:
In this case:

Therefore
We thing about grad as the combination of two vectors, the first is
Both start from some given point
Let's ignore
The partial derivative in respect to at
will result in a vector that starts from and goes in the direction an amount corresponding to the slope in direction at that point
The second vector is
The partial derivative in respect to at some
will result in a vector that starts from and goes in the direction an amount corresponding to the slope in direction at that point
Then
So, if we sum it with , the resultant will be a vector that starts at and points towards the steepest hill around by a direction proportional to the steepness of the hill
The vector comes in handy because once we calculated it, we can calculate the derivative in any direction by multiplying it with some unit vector
OK, now we know the direction we want to go: , but, by how much should we walk
Turns out that can help us with that problem too
If we begin at some point
We will take small steps corresponding with the slope of the hill


