Numerical Methods¶

Lecture 3: Minimization¶

by: Tomasz Romańczukiewicz
web: th.if.uj.edu.pl/~trom/
rm B-2-03

Outline¶

Basically there three types of minimization techniques:

bracketing methods (more stable, but usually 1-dim),
iterative (faster),
stochastic method (slow but can find global minimum)

In principle if $f(x)=0$ for $x=x_*$ than $g(x)=[f(x)]^2$ has a minimum at $x_*$.
So it may seem that finding root of $f(x)$ is equivalen to finding minimum of $g(x)$. Analytically it might be true but numerically they are different procedures.

minimum of $g(x)$ may never give $g(x_*)=0$ and it is diff to tell if $x_*$ is a zero of $f$ or not
better if we search for $f=f_{min}$ but we know $f'(x)$, finding roots is more precise
functions are very flat around minima, precise $x_{min}$ more difficult to find that $f_{min}$
especially when $|f_{min}|\gg1$.
however in more dimensions it moght be easier to find minimum of $F(\vec x)$ than a solution of $\displaystyle \frac{\partial F}{\partial\vec x}=0$

Bracketing methods¶

All methods start with points satisfying the relations: $$a<b<c \qquad f(a)>f(b) \;\wedge \;f(b)<f(a)$$ Then we may expect (if the function is regular enough - continouous $f$ and $f'$) that $f$ has a minimum between $(a, c)$
If we don't know such three points we search for them:

take two points and choose the third down-hill
choose next point increasing the distance (for example twice) but shorter than some maximal step
reapeat until success

Ternary search¶

If we have three points bracketing the minimum we can use some interval divisions to localize the minimum
Interval $[a, c]$ can be devided into three intervals (equal - ternary search or not eg. golden ratio)

Example: $$f(x) = \frac{1-2x}{2x+10}e^{-x}$$

%matplotlib inline
import matplotlib.pyplot as pl
import numpy as np

def f(x):
    return np.exp(-x)*(1-2*x)/(2*x+10)

def show_minimization_steps(F, x1, x2, step=0, plot=True, prnt=False):
    """
        Function not optimized
        used only for presentation purposes
    """
    pl.figure(figsize=(6,3), dpi=150)
    x = np.linspace(x1, x2, 1000)
    y = F(x)    

    X = np.linspace(x1, x2, 4)
    Y = F(X)
    for n in range(step):
        if Y[1]<Y[2]: x1, x2 = X[0], X[2]
        else: x1, x2 = X[1], X[3]
        X = np.linspace(x1, x2, 4)
        Y = F(X)       
        if(prnt==True):  print("step :", n, X)
    if(plot==True): 
        pl.grid(True)
        pl.plot(x, y)    
        pl.plot(X, Y, 'o')
        pl.show()

show_minimization_steps(f, 0, 6, 0)

show_minimization_steps(f, 0, 6, 1)

show_minimization_steps(f, 0, 6, 2)

show_minimization_steps(f, 0, 6, 3)

show_minimization_steps(f, 0, 6, 4)

show_minimization_steps(f, 0, 6, 10, plot=False, prnt=True)

step : 0 [0.         1.33333333 2.66666667 4.        ]
step : 1 [0.         0.88888889 1.77777778 2.66666667]
step : 2 [0.88888889 1.48148148 2.07407407 2.66666667]
step : 3 [0.88888889 1.28395062 1.67901235 2.07407407]
step : 4 [0.88888889 1.15226337 1.41563786 1.67901235]
step : 5 [1.15226337 1.32784636 1.50342936 1.67901235]
step : 6 [1.15226337 1.2693187  1.38637403 1.50342936]
step : 7 [1.2693187  1.34735559 1.42539247 1.50342936]
step : 8 [1.2693187  1.32134329 1.37336788 1.42539247]
step : 9 [1.32134329 1.35602635 1.39070941 1.42539247]

<matplotlib.figure.Figure at 0x7fa5a0df5160>

Each iteration uses new middle points. Very inneficient.

In each iteration we reduce the segment to $2/3$ for the cost of at two calculations of $f(x)$

Perhaps we could use three points from the previous iteration and calculate 𝑓(𝑥) only once

But where to put the fourth point?

Best strategy: devide an interval into three in such a way that each two would have the length $\tau$ times the length of a large one.

$$|\;\; a\;\; |\;b\;|\;\; a\;\; |$$$$a+b=(2a+b)\tau \;\wedge\; a=(a+b)\tau$$$$\tau^2+\tau-1=0\;\;\Rightarrow\;\; \tau=\frac{\sqrt{5}-1}{2}=0.61803398875$$

Golden ratio minimizes number of steps on average (larger vs smaller segment)

We converge as $\tau^n$ (linear converngence) with $n$ calculations of $f$
Slower than bisection $(1/2)^n$ but faster than ternary search

Slighly better but far more complicated algorithm: Fibbonacci search
The legths of the intervals (from smaller to largest) form a Fibbonacci series

$$\lim_{n\to\infty}\frac{F_{n}}{F_{n+1}}=\tau$$

How accurate we can find $x_{min}$.
Around the minimum usually: $$f(x)=f(x_{min})+\frac12f''(x_{min})(x-x_{min})^2$$ But $f$ can be calulated with an error $f\pm\epsilon|f|$ which limits the precision of $x_{min}$ to

$$|x-x_{min}|=\sqrt{\epsilon}\sqrt{\frac{2|f(x_{min})|}{|f''(x_{min})|}}$$ Using 64-bit floats we obtain 32-bit position of the minimum.

def plot_accuracy_comparision():
    pl.figure(figsize=(10,3), dpi=150)
    x = np.linspace(0, 6, 1000)
    y1 = np.float16(1 + 0.001*(x-2)**2-2e-4*(x-2.5)**3)
    y2 = np.float16(0.002*(x-2)-6e-4*(x-2.5)**2)
    pl.subplot(1, 2, 1)
    pl.grid(True)
    
    pl.title('Minimum')
    pl.plot(x, y1)
    pl.legend(['f(x)'])
    pl.subplot(1, 2, 2)
    pl.grid(True)
    pl.title('zero')
    pl.plot(x, y2, 'red')
    pl.legend(["f'(x)"])
    pl.show()

Comparision of accuracy between minimization and root search¶

Expected:

ε = 2**(-11)  # 11 bits out of 16 used for mantissa
print("Accuracy of minimum: ", np.sqrt(ε*1/0.002))
plot_accuracy_comparision()

Accuracy of minimum:  0.49410588440130926

Parabola method¶

Through given non-colinear three points $a, b, c$ we can draw a parabola.
Vertex of the parabola

$$d=b-\frac12\frac{(b-a)^2[f(b)-f(c)]-(b-c)^2[f(b)-f(a)]} {(b-a)[f(b)-f(c)]-(b-c)[f(b)-f(a)]}$$

should approximate a minimum of $f(x)$ if we are close enough to the minimum.

Now there are four points $(a', b', c', d')$ (sorted $\{a,b,c,d\}$) limiting three intervals
if $f(b')>f(c')$ for next iteration we choose $(b', c', d')$
if $f(b')<f(c')$ for next iteration we choose $(a', b', c')$

def plot_parabola_method(f, a, b, c, step):
    """
        Function not optimized
        used only for presentation purposes
    """
    pl.figure(figsize=(6,3), dpi=150)
    x = np.linspace(0, 6, 1000)    
    X = np.array([a, b, c])
    parabola = (x-b)*(x-c)*f(a)/(a-b)/(a-c) + (x-a)*(x-c)*f(b)/(b-a)/(b-c) + (x-b)*(x-a)*f(c)/(c-b)/(c-a)
    for n in range(step):
        length = c-a
        parabola = (x-b)*(x-c)*f(a)/(a-b)/(a-c) + (x-a)*(x-c)*f(b)/(b-a)/(b-c) + (x-b)*(x-a)*f(c)/(c-b)/(c-a)
        d = b - 0.5 * ((b-a)**2*(f(b)-f(c))-(b-c)**2*(f(b)-f(a)) ) / \
                      ((b-a)*(f(b)-f(c))-(b-c)*(f(b)-f(a)) )
        X = np.array([a, b, c, d])
        X.sort()
        F = f(X)
        if F[1]<F[2]: a, b, c = X[0], X[1], X[2]
        else: a, b, c = X[1], X[2], X[3]    
            
        #print(X, " ", d, f(d))

    pl.plot(x, f(x), 'black')
    pl.ylim([-0.08,0.1])
   

    pl.plot(x, parabola, linestyle='dashed')
    pl.grid(True)
    pl.plot(X,f(X), 'o', ms=3)
    
    pl.show()

plot_parabola_method(f, 0, 1, 3, 0)

plot_parabola_method(f, 0, 1, 3, 1)

plot_parabola_method(f, 0, 1, 3, 2)

plot_parabola_method(f, 0, 1, 3, 3)

plot_parabola_method(f, 0, 1, 3, 10)

Better (superlinear) convergence than glden ration $\epsilon_n\sim n^{-1.325}$
But it can stuck ($d\approx c$), than other division(bisection) is used.
It can be unstable if the function is very different from parabola
Brent added new steps:
- if $d\neq (a,c)$
- or new segment would be larger then half of the current segment length then bisect: $d=\frac12(a+c)$ and proceed with the standard choice of the new segment

def plot_brent_method(f, a, b, c, step):
    """
        Function not optimized
        used only for presentation purposes
    """
    pl.figure(figsize=(6,3), dpi=150)
    x = np.linspace(0, 6, 1000)    
    X = np.array([a, b, c])
    parabola = (x-b)*(x-c)*f(a)/(a-b)/(a-c) + (x-a)*(x-c)*f(b)/(b-a)/(b-c) + (x-b)*(x-a)*f(c)/(c-b)/(c-a)
    bisect = False
    for n in range(step):
        length = c-a
        XXX = (a, b, c)
        bisect = False
        parabola = (x-b)*(x-c)*f(a)/(a-b)/(a-c) + (x-a)*(x-c)*f(b)/(b-a)/(b-c) + (x-b)*(x-a)*f(c)/(c-b)/(c-a)
        d = b - 0.5 * ((b-a)**2*(f(b)-f(c))-(b-c)**2*(f(b)-f(a)) ) / \
                      ((b-a)*(f(b)-f(c))-(b-c)*(f(b)-f(a)) )
        if d<a or d>c: 
            # print("Bisection outside")
            d=0.5*(a+c)
            bisect = True
        X = np.array([a, b, c, d])
        X.sort()
        F = f(X)
        if F[1]<F[2]: a, b, c = X[0], X[1], X[2]
        else: a, b, c = X[1], X[2], X[3]
        
        if c-a>0.5*length: 
            a, b, c = XXX
            d=0.5*(a+c)
            X = np.array([a, b, c, d])
            X.sort()
            F = f(X)
            bisect = True
            #  print("Bisection length")
            if F[1]<F[2]: a, b, c = X[0], X[1], X[2]
            else: a, b, c = X[1], X[2], X[3]
            
        #print(X, " ", d, f(d))

    pl.plot(x, f(x), 'black')
    pl.ylim([-0.08,0.1])
    if bisect==True: pl.text(4, 0.06, "Bisection")
    pl.plot(x, parabola, linestyle='dashed')
    pl.grid(True)
    pl.plot(X,f(X), 'o', ms=3)
    
    pl.show()

plot_brent_method(f, 0, 1, 3, 0)

 plot_brent_method(f, 0, 1, 3, 1)

 plot_brent_method(f, 0, 1, 3, 2)

 plot_brent_method(f, 0, 1, 3, 3)

 plot_brent_method(f, 0, 1, 3, 4)

 plot_brent_method(f, 0, 1, 3, 5)

Methods uising derivatives¶

Knowing the derivative of $f(x)$ allows to find a zero of $f'(x)$ which might be slightly more difficult but can be done more precisely

Global minimization¶

Stochastioc algorithms¶

Many stochastic algorithms are based on natural procesess:

simulated annealing
bee swarm
ant colony
genetic algorithms
differential evolultion
simplex (ameba) algorithm

Simulated annealing reproduces a physical process of heating a body followed by slow cooling down often used in metalurgy.

High temeratures allows the system to become anisotropic and probe many different states.
Cooling decreases the energy (test function) and allows to relax at minimum.
Non-zero temperature allows to jump outside the current (most likely) local minimum.
The deeper the local minimum the more difficult it is to jump out of it.
Jump probability decreases with the temprature drop
Finally the lowest energy state is found (in metalurgy: example uniform cristal)

Algorithm (details may vary):

randomly choose initial point $x$
randomly choose neighbour $y$ ($|x-y|$ can be proportional to temperature $T$, in physics gaussian distro)
if $f(y)<f(x)$ we accept $y$ as a new solution
if $f(y)>f(x)$ we accept $y$ as a new solution with probability $P(f(y)-f(x), T)$ in physics $P$ has a Boltzman distribution

$$P(\Delta E, T) = e^{-\beta \Delta E/T}$$
decrease the temperature example $T_n = (n_{max}-n)T_0$

Differential evolution from a set of agents (test solutions) in each iteration we create a new set by mutating the values. In 1-dim:

choose randomly $N$ agents $x_i$, $i=0\ldots N-1$
until finish:
- for each $x$ choose three agents $a, b$ and $c$
- calculate $y=a+F\cdot(b-c)$, $F\in[0,2]$ is a differential weight 9usually constant)
- if $f(y)\leq f(x)$ then $x$ is replaced by $y$
the result is an agent with the lowest $f(x)$

In higher dimensions we also choose which coordinates are mutated with cross over probability

def f(x):
    return 4*np.sin(0.5*x**2)+np.cos(10/(x**2+0.5))*(x-5)**2

def show_current_function(X, G, x, g, position, step):
    pl.subplot(3, 2, position)
    pl.plot(X, G, 'black')
    pl.grid(True) 
    pl.plot(x, g, "o", color='orange', ms=5)
    pl.text(1.2, 22, "step {:d}\nmin={:.6f}".format(step, np.amin(g)))

def diff_evolution_example():
    x1, x2, N = -3, 3, 10
    pl.figure(figsize=(12,7), dpi=120)
    # initial random choice
    x = x1 + (x2-x1)*np.random.rand(N)
    g = f(x)
    show_current_function(X, G, x, g, 1, 0)
    F = 0.2  # wieght
    for step in range(10):
        for n in range(N):            
            (a,b,c) = np.random.choice(x, 3)
            y = a + F*(b-c)
            gy = f(y)
            if gy<g[n] and x1<y<x2 :
                x[n], g[n] = y, gy
        if step%2==0:
            show_current_function(X, G, x, g, 2+step/2, step+1)
    pl.show()

diff_evolution_example()

This algorithm shows its strength in higher dimensions, where we also choose which directions are updated.
It quickly localizes the global minimum with some approximation
Stochastic methods usually converge slower
It might happen that agents cluster
small differences don't allow proper sampling (in biology it is bad if all individuals have similar DNA)

Tasks:¶

Implement golden ratio search and apply to some test function
Implement parabola (or better Brent) method
Implement one of the stochastic methods and test on some function