---
title: "gmfd"
author: "Andrea Martino"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{gmfd}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
Package __gmfd__ (Generalized Mahalanobis Functional Distance) is an R package which gathers statistical methods for both inference and clustering of functional data based on a generalization of Mahalanobis distance. The package supports both univariate and multivariate functional data.

# Simulation of functional data
Let us consider a process $X(t) \in L^2(I)$, with $I \in \mathbb{R}$, with mean function $m(t) = \mathbb{E}[X(t)]$ and covariance operator $V$, i.e. V is a linear compact integral operator from $L^2(I)$ to $L^2(I)$ such that $(Va)(s) = \int_I v(s,t) a(t) dt \quad \forall a \in L^2(I)$, where $v$ is the covariance function defined as $v = \mathbb{E}[(X(t) - m(t))(X(s)-m(s))]$. Then, denote with $\{\rho_k, k \ge 1\}$ and $\{\theta_k, k \ge 1\}$ respectively the sequences of eigenvalues and eigenfunctions of $v$. The Karhunen-Loève expansion decomposes the process $X(t)$ in a sum of its mean $m(t)$ and the series of orthonormal functions $\theta_k(t)$, each one multiplied by zero-mean uncorrelated random variables $\sqrt{\rho_k} Z_k$, ($\rho_k>0, \mathbb{V}ar(Z_k) = 1)$. Then we can write 
\begin{equation*}
X(t) = m(t) + \sum_{k=1}^\infty Z_{k} \sqrt{\rho_k} \cdot \theta_k(t)
\end{equation*} 
The function `gmfd_simulate( size, mean, covariance, rho, theta )` can simulate an univariate sample of functional data where `size` represents the number of elements to be generated while `mean` is the center of the distribution. The user can choose to use the argument `covariance` for the covariance of the data or alternatively the sequences of eigenvalues and eigenfunctions of the covariance matrix.

```{r, fig.show='hold'}
library( gmfd )

# Define the parameters
n <- 50
P <- 100
K <- 150

# Grid of the functional dataset
t <- seq( 0, 1, length.out = P )

# Define the means and the parameters to use in the simulation
m1 <- t^2 * ( 1 - t )

rho <- rep( 0, K )
theta <- matrix( 0, K, P )
for ( k in 1:K ) {
  rho[k] <- 1 / ( k + 1 )^2
  if ( k%%2 == 0 )
    theta[k, ] <- sqrt( 2 ) * sin( k * pi * t )
  else if ( k%%2 != 0 && k != 1 )
    theta[k, ] <- sqrt( 2 ) * cos( ( k - 1 ) * pi * t )
  else
    theta[k, ] <- rep( 1, P )
}

# Simulate the functional data
X <- gmfd_simulate( size = n, mean = m1, rho = rho, theta = theta )

```

# `S3` class `funData` for functional data

For ease of manipulation and visualization, a `S3` class for both univariate and multivariate functional data has been created. A `funData` object represents a functional data and it is defined by the function `funData( grid, F )` where `grid` is the grid of evenly spaced points over which the functional data is defined and `F` is a matrix (if it is a univariate functional data) or a list (if it is a multivariate functional data).
A functional data as it has been just described can be then represented by using the function `plot.funData` which takes as argument all the usual customisable graphical parameters.

```{r, fig.show='hold', fig.align ='center'}

# Create a functional data object
FD1 <- funData( t, X )

# Plot the funData object
plot( FD1, col = "blue" , xlab = "grid", ylab = "data")

```


# Generalized Mahalanobis Distance

Let us consider a $J$-dimensional sample of $n$ realizations $\mathbf{X}_1(t), ...,\mathbf{X}_n(t)$ of a stochastic process in $(L^2(I))^J$, with $J\geq 1$. Let $\bar{\mathbf{X}}_n(t) = n^{-1}(\mathbf{X}_1(t) + \ldots + \mathbf{X}_n(t))$ be the empirical mean. The estimated covariance function is defined as follows:
\begin{equation}
\label{hat_v}
\hat{v}(s,t) := \frac{1}{n-1} \sum_{i=1}^n \big(\mathbf{X}_i(s) - \bar{\mathbf{X}}_n(s)\big)\big(\mathbf{X}_i(t) - \bar{\mathbf{X}}_n(t)\big)^\top,
\end{equation}
from which we can compute the sequences of its eigenfunctions $\{\hat{\boldsymbol{\varphi}}_{k}=(\hat{\varphi}_k^{(1)},..., \hat{\varphi}_k^{(J)})^\top,\, k\geq 1\}$ and the associated eigenvalues $\{\hat{\lambda}_k;\, k \geq 1\}$. Since in this case the covariance function is computed using $n$ curves, we have $\hat{\lambda}_k = 0$ for all $k\geq n$, and hence the functions $\{\hat{\boldsymbol{\varphi}}_k;\, k \geq n\}$ can be arbitrary chosen such that $\{\hat{\boldsymbol{\varphi}}_k;\, k \geq 1\}$ is an orthonormal basis of $(L^2(I))^J$. \par
The empirical version of the generalized Mahalanobis distance based on the covariance estimator $\hat{v}$ can be written as follows:
\begin{equation}\begin{aligned}
\label{def:hat_dp}
\hat{d}^2_p(\textbf{X}_i(t),\textbf{X}_j(t)) &= \sum_{k=1}^{\text{min}\{n-1,P\}} \hat{d}^2_{M,k}(\textbf{X}_i(t), \textbf{X}_j(t)) \hat{h}_k(p) \\
& + \sum_{k=\text{min}\{n-1,P\}+1}^{P} p \Big( \langle \textbf{X}_i(t) - \textbf{X}_j(t), \hat{\boldsymbol{\varphi}}_k \rangle \Big)^2,
\end{aligned}
\end{equation}
where $P$ is the length of the independent variable grid, while $\hat{d}^2_{M,k}(\cdot,\cdot)$ and $\hat{h}(p)$ represent the estimates of the square of the contribution to this distance along the $k$-th component and the regularizing function, respectively.
The function `funDist( FD1, FD2, metric, p, lambda, phi )` computes the distance between two functional data `FD1` and `FD2` by using the chosen `metric`. The last three parameters are used only for the generalized Mahalanobis distance.

```{r, fig.show = 'hold'}
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of data
lambda <- eigen( cov( FD1$data[[1]] ) )$values
phi <- eigen( cov( FD1$data[[1]] ) )$vectors

# Extract two curves from the samples to compute the distance between them
x <- funData( t, FD1$data[[1]][1, ] )
y <- funData( t, FD1$data[[1]][2, ] )

distance <- funDist( x, y, metric = "mahalanobis", p = 10^5, lambda, phi )
distance
```

It is also possible to compute the dissimilarity matrix of a given sample by using the function `gmfd_diss( FD, metric, p )`.

# Inference on the means of functional data: two sample hypotesis tests

We want now to compare the means of two functional data samples. We simulate two samples $\mathbf{X}_1(t), ...,\mathbf{X}_{n_1}(t)$ and $\mathbf{Y}_{n_2}(t), ...,\mathbf{Y}_{n_2}(t)$ and consider the following asymptotic hypotesis test:
\begin{equation}
H_0 : m_{1} = m_{2} \qquad \text{vs} \qquad H_1: m_{1} \neq m_{2}.
\end{equation}
where $m_1$ and $m_2$ are the real means of the two simulated samples. We can infer on the means of the two samples by using the function `gmfd_test(FD1, FD2, conf.level, stat_test, p)` where we have the two samples `FD1` and `FD2`, the confidence level for the test `conf.level`, a string to choose the test statistic to use `stat_test` and the parameter of the regularizing function `p`. The function then returns the value of the test statistics, the value of the quantile and the p-value for the test.

```{r, fig.show = 'hold'}

# Simulate another functional sample
s <- 0
for ( k in 4:K ) {
  s <- s + sqrt( rho[k] ) * theta[k, ]
}

m2 <- m1 + s
Y <- gmfd_simulate( n, m2, rho = rho, theta = theta )
FD2 <- funData( t, Y )

test_output <- gmfd_test(FD1, FD2, conf.level = 0.95, stat_test = "mahalanobis", p = 10^5)
test_output
```

# Clustering: the k-means algorithm

The functional $k$-means clustering algorithm is an iterative procedure, alternating a step of cluster assignment, where all the curves are assigned to a cluster, and a step of centroid calculation, where a relevant functional representative (the centroid) for each cluster is identified. More precisely, the algorithm is initialized by fixing the number $k$ of clusters and by randomly selecting a set of $k$ initial centroids $\{\boldsymbol{\chi}_1^{(0)}(t), \ldots , \boldsymbol{\chi}_k^{(0)}(t)\}$ among the curves of the dataset. Given this initial choice, the algorithm iteratively repeats the two basic steps mentioned above. Formally, at the $m^{th}$ iteration of the algorithm, $m\geq 1$, the two following steps are performed:

- __Step 1 (cluster assignment step)__: each curve is assigned to the cluster with the nearest centroid at the $(m-1)^{th}$ iteration, according to the distance $\hat{d}_p$. Formally, the $m^{th}$ cluster assignment $C_i^{(m)}$ of the $i^{th}$ statistical unit, for $i=1,\ldots,n$, can be written as follows:
\begin{equation}	C_i^{(m)} := \underset{l=1,\ldots,k}{\operatorname{argmin}}\:\hat{d}_p(\textbf{X}_i(t), \boldsymbol{\chi}_l^{(m-1)}(t));
\end{equation} 

- __Step 2 (centroid calculation step)__: the computation of the centroids  at the $m^{th}$ iteration is performed by solving the optimization problems: for any $l=1,\ldots,k$,
\begin{equation}	\boldsymbol{\chi}_l^{(m)}(t) := \underset{\boldsymbol{\chi} \in (L^2(I))^J}{\operatorname{argmin}}  \sum_{i:C_i^{(m)} = l} \hat{d}_p(\textbf{X}_i(t),\boldsymbol{\chi}(t))^2,
\end{equation}	
where $C_i^{(m)}$ is the cluster assignment of the $i^{th}$ statistical unit at the $m^{th}$ iteration.
The algorithm stops when the same cluster assignments are obtained at two subsequent iterations, i.e. the set of cluster assignments $\{C_1^{(\bar{m})},\ldots,C_n^{(\bar{m})}\}$ and the set of centroids $\{\boldsymbol{\chi}_1^{(\bar{m})}(t),\ldots, \boldsymbol{\chi}_k^{(\bar{m})}(t)\}$ are considered final solutions of the algorithm if $\bar{m}$ is the minimum integer such that $C_i^{(\bar{m}+1)} \equiv C_i^{(\bar{m})}$ for all $i=1,\ldots,n$.

We apply the procedure by merging the two samples $\mathbf{X}_1(t), ...,\mathbf{X}_{n_1}(t)$ and $\mathbf{Y}_{n_2}(t), ...,\mathbf{Y}_{n_2}(t)$ previously simulated using the function `gmfd_kmeans( FD, n.cl , metric, p )` where `n.cl` is the number of cluster. It returns a vector of the clusters and a vector or a list of vectors of the centers, other than a plot of the clusters along with their empirical means.

```{r, fig.show='hold', fig.width=6,  fig.align ='center'}

# We estimate the eigenvalues and eigenfunctions of the covariance matrix of all merged data
lambda <- eigen( cov( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$values
phi <- eigen( cov ( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$vectors

# Functional data sample containing the merged samples
FD <- funData( t, rbind( X, Y ) )

kmeans_output <- gmfd_kmeans( FD, n.cl = 2, metric = "mahalanobis", p = 10^5 )

```