## Convergence of Online Adaptive and Recurrent Optimization Algorithms

**parameterised dynamical system online optimisation algorithm**, which conducts a

**gradient descent**on the

**parameter**. It is apt at

**modelling**wide classes of training algorithms, especially in

**machine learning**. We describe it here when applied to a

**recurrent neural network**, and present our results establishing the

**convergence**of the

**training procedure**.

Let us consider a **recurrent neural network**, which **activities** at time $t$, $a_t(1)$, $a_t(2),\,\ldots,\,a_t(\text{number of neurons})$ are **updated** at each time according to
$$a_{t}=\mathrm{F}_\text{eedforward}(a_{t-1},\,w,\,\text{input}^t),$$
where $w$ is the set of **weights** of the connections between neurons, and $\mathrm{F}_\text{eedforward}$ is the **feedforward pass** in the neural net.

To assess the **quality** of the **weights** with respect to the task at hand, we condiser a **loss function** $\ell$, which evaluates the **quality of the prediction** performed by a set of **activities** $a=(a(1),\ldots,a(\text{number of neurons}))$ when applied to some training example, by computing a real number $\ell(\text{example},a)$.

Let us note $\mathbf{a}_t(w)$ the set of activities of the network at time $t$ computed with weights $w$. Let us note further $\ell_{\leadsto t}(w)=\ell(\text{example}^t,\mathbf{a}_t(w))$ the loss at time $t$ seen as a function of the weights $w$.
**Optimising** the **weights** through **gradient descent** therefore amounts to performing weights updates of the form
$$ w \leftarrow w - \text{stepsize} \times \frac{\partial \ell_{\leadsto t}}{\partial \, w}(w).$$
Since we are working with a **recurrent** system, computing this derivative involves going backwards in time, using **backpropagation through time**.

Differentiating the loss by backpropagation through time

We know that $$\frac{\partial \ell_{\leadsto t}}{\partial \, w}(w) = \frac{\partial \ell}{\partial a}(\text{example}^t,\mathbf{a}_t(w)) \cdot \frac{\partial \mathbf{a}_t}{\partial w}(w).$$

The differential of the activities at time $t$ has to be computed through **backpropagation through time**:
$$ \frac{\partial \mathbf{a}_{t}}{\partial \, w}(w)
=\frac{\partial F_{\text{eedforward}}}{\partial a}(\mathbf{a}_{t-1},\,w,\,\text{input}^t) \cdot \frac{\partial \mathbf{a}_{t-1}}{\partial \, w}(w)
+\frac{\partial F_{\text{eedforward}}}{\partial w}(\mathbf{a}_{t-1},\,w,\,\text{input}^t).
$$

Once we know how to differentiate the losses $\ell_{\leadsto t}$, we way define the RTRL algorithm.

The Real Time Recurrent Learning algorithm

The RTRL algorithm proceeds **online** by computing an **approximation** of the differential of the activities with respect to the weights, or its **Jacobian**. It proceeds by mimicking the backpropagation through time update. In the updates below, the approximation coincides with the **true Jacobian** at **order 0** in the **stepsize**: when the weights are not updated, RTRL computes the correct Jacobian.

To prove its convergence, we first need a notion of locally optimal weights.

In **independant**, and **indentically distributed**, settings, we would require the common **expectation** of the **gradient** of the losses $\ell_{\leadsto t}$ **vanishes**, and the common **expectation** of the **hessians** of the losses is **positive definite**.

However, since the **inputs** may issue from potentially **infinite time-series**, exhibiting potentially **unbounded time dependencies**, we require instead **temporal averages** of the **gradients** and **hessians** of the losses to satisfy similar **first** and **second order conditions**.

The exponent $a$ quantifies the **speed** at which the system **averages** itself towards its **mean behaviour** at the **local optimum**.

In turns, this speed determines the **admissible range** of **stepsizes** one may use during **optimisation**. Classical Robbins and Monro conditions require that the stepsize sequence $(\eta_t)$ satisfies $\sum_{t}\eta_t = \infty$, and $\sum_{t} \eta_t^2 \lt \infty$. The second condition stems from the fact that, working in an i.i.d. setting, the **dynamics** at **equilibrium** is that above, for any $a\gt 1/2$.

Therefore, our definition of **local optimum** **generalises** these conditions to **more general dynamics**.

For the neural network to be able to **"learn"** something, it must **forget** **past activities** which are not the optimal ones, when run with **optimal weights**. We therefore require it has spectral radius less than one. This is a well-known **stability** condition in the study of dynamical systems.

A **linear mapping** $A$ has **spectral radius (strictly) less than $1$** when all its **eigenvalues** have **modulus (strictly) less than $1$**. In that case, there exists some integer $k\geq 1$ such that $A^k$ has **operator norm less than $1$**.

Since we work in a **nonlinear**, and **time-inhomogeneous** setting, we extend the definition slightly, and say the network has **spectral radius less than $1$** if there exist $0 \lt \alpha \leq 1$, and an integer $k\geq 1$, such that any **consecutive product** of $\mathbf{k}$ **differentials** $\frac{\partial F_{\text{eedforward}}}{\partial a}(a_{t-1}^*,w^*,\text{input}^t)$, has **operator norm less than $1-\alpha$**. Here, the activities $a_t^*$ designate the **optimal sequence of activities**, defined by $a_t^*=F_{\text{eedforward}}(a_{t-1}^*,w^*)$, for $t\geq 1$, where $w^*$ and $a_0^*$ are introduced in the definition of local optimum.

Together with **regularity assumptions** at the vicinity of the **optimal weights** and **optimal activities**, the **spectral radius** condition enforces **stability** of the system: provided the **weights** remain **close** to the **optimal weights**, the **activities** remain **close** to the **optimal activities**.

Since the **spectral radius** condition is valid only at **horizon** potentially **greater than one**, the sets which are **guaranteed** to contain the activities are more complicated than simple balls: in our work, we call them **stable tubes**.

To prove convergence, we construct an abstract model of a **gradient descent algorithm** applied to a **dynamical system**.

Our abstract model first comprises a **memory variable**, which represents everything which is **maintained in memory** by the system (in the case of neural networks, the **activities** and the **estimation** of the **Jacobian**). The behaviour of the latter depends on a **parameter** (in the case of neural networks, the **weights**). The **transition operator** takes as inputs the **memory variable**, and the **parameter**, and updates the **memory**.

Then, an operator computes **gradients**, which are eventually fed to an operator which **updates** the **parameter**.

The **abstract algorithm** thus writes as follows.

One of the main **difficulties** in showing this abstract algorithm **converges** is due to "everything moving at the same time": the memory influences the parameter update, through the computation of the gradient, but its update itself depends on the parameter.

To remedy this, we **compare** the true **abstract trajectory** to what we call an **open-loop trajectory**. In the latter, **parameter updates** are computed, but **not applied**: the memory and the gradients are computed with a **fixed parameter**. We compare these two trajectories on **time intervals** which **lenghts** depend on the **dynamics** of the **losses**.

The **open-loop trajectory** is amenable to analysis, and the difference with the true trajectory is at **second order** in the quantity controlling convergence, which allows us to obtain convergence of the true trajectory.

Since the RTRL algorithm maintains an **approximation** of the **differential of the state of the system** with respect to the **parameter**, its **storage cost quickly becomes prohibitive**, even for moderately large systems.

The No Back Track and UORO, for "Unbiased Online Recurrent Optimisation", algorithms, replace this differential by an **unbiased, rank-one, random approximation**.

They thus **reduce** the required **memory cost**, at the expanse of **introducting noise** in the learning process. Our results show that this does not prevent convergence: the **probability** they **converge**, to the **same optimum** as **RTRL**, is **arbitrarily close to one**, provided the **stepsize** is **small enough**.

On each interval on which we compare the true trajectory to the open-loop one, the **noise added** by No Back Track or UORO is of size
$$
\text{stepsize}\,\sum_{\text{interval}} \, \xi_t,
$$
where the $\xi_t$'s are **uncorreletad** random variables.

As a result, the size of the **perturbation** is $\text{stepsize}\times\sqrt{\text{length of interval}}$, which is **negligible** in front of the **quantity controlling convergence**, equal to $\text{stepsize}\times({\text{length of interval}})$.

## Learning the Learning Rate algorithm

**stochastic gradient algorithm**and its derivatives are the

**main optimisation algorithms**used to train

**deep systems**.

**descent step size**(or "

**learning rate**"). Unfortunately, the

**descent trajectory**is

**highly sensitive to this value**, as may be seen on the animation on the Home page.

**adaptively updates the step size**, by conducting a

**gradient descent on it**, at a

**cost similar**to that of the

**original algorithm**on top of which it is implemented.

## Adaptive confidence bands in non parametric regression

**adaptive confidence bands**exist in

**the nonparametric fixed design regression model**. In the course of the proof, we show that

**sup-norm adaptive estimators exist**as well in

**regression**.