$$ \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

I recently developed an interest in system identification for dynamical systems. The blog post will be a brief, and by no means exhaustive, introduction to the basics of some classical methods for system identification. For a thorough treatment of the subject, I recommend reading Ljung (1998) and Verhaegen (2007). You will find all the results below in these books.

Prediction Error Methods

Let's start with parametric estimation. Given a dataset $\left[u(1), y(1), \ldots, u(N), y(N)\right]$ and the predictor: $$ \begin{align} \label{eq:system} \hat{x}(t+1) = A \hat{x}(t) + B u(t)\\ \hat{y}(t) = C \hat{x}(t) + D u(t), \end{align} $$ we want to find matrices $A, B, C, D$ such that the $\hat{y}(t)$ is close to $y(t)$ for $t = 1, \ldots, N$. Note that the initial state $\hat{x}(0)$ is also a parameter to be estimated.

There are different methods to parameterize this system but one simple way is to let each entry in the matrices $A, B, C, D$ be an entry in the parameter vector $\theta$. If the dimension of $\hat{x}$, $\hat{y}$, and $u$ are $n$, $l$, and $m$ respectively, the number of parameters is $ p = n^2 + n(m+l) + ml$ ($+ n$ if adding the initial state vector). Since we want $\hat{y}(t)$ to be "close" to $y(t)$, we can use the following quadratic cost function (any norm in $\mathbb{R}^n$ would work): $$ \begin{align} J_N(\theta) = \sum_{t=1}^N \lVert y(k) - \hat{y}(k, \theta) \rVert_{2}^{2}. \label{eq:obj} \end{align} $$ We'd like to minimize Equation (\ref{eq:obj}) and thus the parameter estimate is defined as: $$ \begin{align} \hat{\theta}_N = \argmin_{\theta \in \Omega} J_N(\theta), \label{eq:opt} \end{align} $$ where $\Omega \subset \mathbb{R}^p$ is a set that contains permissible parameter vectors. For example, we might want to constrain the matrices such that the system is stable. $J_N(\theta)$ is not convex (an exercise in computing the Hessian $\nabla^2 J_N(\theta)$ and finding $x$ such that $x^T\nabla^2 J_N(\theta) x < 0$) and solving for the global minimum can be computationally intensive. For a brief description on why non-convex optimization can be difficult, take a look at Ben Recht's blog post. Several gradient-based optimization methods have been employed to solve (\ref{eq:opt}) such as Gauss-Newton. Any method that solves problems of the form of (\ref{eq:opt}) falls under the umbrella of prediction-error identification methods (PEM).

A natural question arises: how good is the estimate $\hat{\theta}_N$ as we get more data? We consider the asymptotic regime as $N \rightarrow \infty$. The basic result (under some non-restrictive assumptions about the data, cost criterion, and system) is: $$ \begin{align*} \hat{\theta}_N \to \argmin_{\theta \in \Omega} \mathbb{E}\ l\left(y(k) - \hat{y}(k, \theta)\right), \;\text{with probability 1 as}\ N \to \infty, \end{align*} $$ where $l$ is a suitable distance measure and the expectation is taken over all possible random disturbances that affect the data. This is a strong result that tells us the estimate converges to the best possible approximation of the true system subject to the constraints in $\Omega$ and our choice of parameterization. For a more precise statement of the theorem and a proof, check out the paper by Ljung (1978).

Subspace Identification Methods

The major issue with PEM is the non-convexity of the objective and the fact that it is prone to many local minima which makes the optimization difficult. In this section, we describe a non-parametric method that recovers the matrices ($A, B, C, D$) from the Markov parameters of the linear system. Since the focus of this post is on brevity, we will only describe the algorithm for deterministic linear systems to give a flavour of subspace identification methods.

Consider a linear time-invariant state-space system: $$ \begin{align} \label{eq:lti1} x(t+1) = A x(t) + B u(t)\\ y(t) = C x(t) + D u(t),\label{eq:lti2} \end{align} $$ with $A \in \mathbb{R}^{n \times n}$, $B \in \mathbb{R}^{n \times m}$, $C \in \mathbb{R}^{l \times n}$, and $D \in \mathbb{R}^{l \times m}$.

First, we need to define observability, reachability, Markov parameters, similarity transformations, and the Hankel matrix.

Definition 1: The system in (\ref{eq:lti1})-(\ref{eq:lti2}) is reachable if for any two states $x_a$ and $x_b$, there exists an input signal $u(k)$ for $k_a \leq k \leq k_b$ that will transfer the system from state $x(k_a) = x_a$ to $x(k_b) = x_b$.

Informally, reachability tells us if the input can steer the system from any state to another in a finite time interval. Reachability of this system can be determined if the rank of the matrix $\mathcal{C}_n$ is $n$ where $$ \begin{equation}\mathcal{C}_{n}=\left[\begin{array}{llll} B & A B & \cdots & A^{n-1} B \end{array}\right].\end{equation} $$

Definition 2: The system in (\ref{eq:lti1})-(\ref{eq:lti2}) is observable if any initial state $x(k_a)$ can be uniquely determined by the corresponding zero-input response $y(k)$ for $k_a \leq k \leq k_b$ with $k_b$ finite.

Roughly, observability tells us if we can estimate the states $x(t)$ by observing the outputs $y(t)$. The system is observable if the matrix $\mathcal{O}_n$ is of rank $n$ where $$ \begin{equation}\mathcal{O}_{n}=\left[\begin{array}{c} C \\ C A \\ \vdots \\ C A^{n-1} \end{array}\right].\end{equation} $$

Definition 3: The Markov parameters of system (\ref{eq:lti1})-(\ref{eq:lti2}) are defined as: \begin{equation}\mathbf{h}(t)=\left\{\begin{array}{ll} \label{eq:markov} D, & t=0 \\ C A^{t-1} B, & t>0 \end{array}\right.\end{equation}

This is essentially the impulse response of the system with the initial state assumed to be zero. The impulse response is the output of the system when $u(0) = 1$ and $0$ otherwise.

An important aspect of this problem is that there are different state representations $x(t)$ that can give the same input-output behaviour between $u(t)$ and $y(t)$. To see this, consider transforming the state in the system (\ref{eq:lti1})-(\ref{eq:lti2}) as follows: $$ x_T(t) = T^{-1} x(t), $$ where $T$ is an arbitrary invertible matrix called a similarity transformation. The system corresponding to this new state (with the same input-output behaviour) is: $$ \begin{align} \label{eq:lti3} x_T(t+1) = A_T x_T(t) + B_T u(t)\\ y(t) = C_T x_T(t) + D_T u(t),\label{eq:lti4} \end{align} $$ where $$ \begin{equation*}\begin{array}{ll} A_{T}=T^{-1} A T, & B_{T}=T^{-1} B \\ C_{T}=C T, & D_{T}=D. \end{array}\end{equation*} $$ Thus, we can only hope to recover the system matrices upto a similarity transformation $T$.

The last definition we require is the Hankel matrix $\mathcal{H}_{n+1,n+1}$. It is constructed from the Markov parameters as follows: $$ \begin{equation}\mathcal{H}_{n+1, n+1}=\left[\begin{array}{cccc} h(1) & h(2) & \dots & h(n+1) \\ h(2) & h(3) & \dots & h(n+2) \\ \vdots & \vdots & \ddots & \vdots \\ h(n+1) & h(n+2) & \dots & h(2 n+1) \end{array}\right]\end{equation} $$

Provided the system (\ref{eq:lti1})-(\ref{eq:lti2}) is observable and reachable, it has two properties:

Plugging in the definition of the Markov parameters (\ref{eq:markov}) into the definition of the Hankel matrix will show it is the product of the required matrices. The first statement is proved using the fact that the rank of the reachability and observability matrices are $n$ and then using Sylvester's rank inequality.

We are now ready to state the problem. Given a dataset $\left[u(1), y(1), \ldots, u(N), y(N)\right]$ generated by the reachable and observable system (\ref{eq:lti1})-(\ref{eq:lti2}), we want to find the system matrices $A, B, C, D$ upto a similarity transformation.

There are 2 steps to recovering the system matrices:

  1. Estimate the Markov parameters from the dataset.
  2. Given the Markov parameters, estimate the system matrices.
There exist multiple methods to estimate the Markov parameters from input-output data and in this post we will focus on the second step of the algorithm, assuming we already have an estimate of the Markov parameters. This step is based on the work of Ho and Kalman (1966).

  1. First, we recover $D_T$ instantly as $h(0) = D = D_T$.
  2. We construct the Hankel matrix as defined above using the Markov parameters and compute it's singular value decomposition: $$ \begin{equation}\mathcal{H}_{n+1, n+1}=\left[\begin{array}{ll} U_{n} & \bar{U}_{n} \end{array}\right]\left[\begin{array}{cc} \Sigma_{n} & 0 \\ 0 & 0 \end{array}\right]\left[\begin{array}{c} V_{n}^{\mathrm{T}} \\ \bar{V}_{n}^{\mathrm{T}} \end{array}\right]=U_{n} \Sigma_{n} V_{n}^{\mathrm{T}},\end{equation} $$ where $\Sigma_{n} \in \mathbb{R}^{n \times n}$ and is full rank, thus invertible. The decomposition of this form exists because rank$(\mathcal{H}_{n+1,n+1})$ is $n$ and $n < (n+1)l$ and $n < (n+1)m$ which are the number of rows and columns of the Hankel matrix.
  3. Since $\mathcal{H}_{n+1,n+1} = \mathcal{O}_{n+1} \mathcal{C}_{n+1}$, $$ \begin{equation}U_{n}=\mathcal{O}_{n+1} T=\left[\begin{array}{c} C T \\ C T T^{-1} A T \\ \vdots \\ C T\left(T^{-1} A T\right)^{n} \end{array}\right]=\left[\begin{array}{c} C_{T} \\ C_{T} A_{T} \\ \vdots \\ C_{T} A_{T}^{n} \end{array}\right],\end{equation} $$ where $T = \mathcal{C}_{n+1} V_n \Sigma_n^{-1}$.
  4. We can now recover $C_T$ from the first $l$ rows of $U_n$.
  5. We can recover $A_T$ by solving the equation: $U_n\left[1: (n-1)l, :\right] A_T = U_n\left[l+1:nl,:\right]$
  6. Finally, we can recover $B_T$ by the first $m$ columns of $\Sigma_n V_n^T$. To see this, $$ \begin{equation}\begin{aligned} U_n \Sigma_n V_n^T &= \mathcal{O}_{n+1} \mathcal{C}_{n+1} \\ \Sigma_{n} V_{n}^{T} &=T^{-1} \mathcal{C}_{n+1} \\ &=\left[T^{-1} B \quad T^{-1} A T T^{-1} B \quad \cdots \quad\left(T^{-1} A T\right)^{n} T^{-1} B\right] \\ &=\left[\begin{array}{cccc} B_{T} & A_{T} B_{T} & \cdots & A_{T}^{n} B_{T} \end{array}\right] \end{aligned}\end{equation} $$