Analytics /By Yunzhu Liu

##### Index selection using PCA

One of the common problems we face in token analysis is the lack of a market index that can be used as a benchmark. This brings many difficulties such as being unable to determine a specific token’s risk and returns relative to the market. In this article, we aim to create our own index through Principal Component Analysis.

**What is PCA?**

Principal Component Analysis (PCA) is a useful tool for exploratory data analysis. It is usually used on large datasets with multiple variables for each sample, which cannot be easily analysed using simple regression models. With PCA, we reduce the dimensionality of large data sets while preserving most of the information. This is done by finding the eigenvector and eigenvalue pairs of the dataset. Eigenvector represents the direction of variance, and the corresponding eigenvalue is the magnitude of variance. In simpler terms, we are analysing the variables that make up a complex data set, studying the the variance brought by each variable, and simplifying the original dataset into a smaller one that consist of fewer variables, or the “principal components”.

**Procedure**

While mathematics behind PCA is relatively complex and may be difficult to understand, having an idea of how it works help us to apply this useful tool better. In general, PCA can be conducted in four steps:

Step 1: Standardisation of data

Standardisation ensures that all initial variables contribute equally to the analysis. It reduces the bias caused by different numerical values of variables. Standardisation is done by subtracting the mean and dividing it by the standard deviation for each value of each variable.

Step 2: Calculation of covariance for all variable pairs

Covariance gives information on if there is any relationship between any two variables. When variables are highly correlated, their information value and contribution to the whole sample may be repeated, and we can consider removing one of them to make a simpler data set. This can be done by computing the covariance matrix.

Step 3: Computation of eigenvectors and eigenvalues

We then rank eigenvalues of the correlation and covariance matrices for each variable in descending order. The first principal component (PC1) is the one with highest eigenvalue, the second principal component (PC2) has the second highest eigenvalue, and so on. Percentage of variance accounted for by each component is obtained through dividing the eigenvalue of each component by the sum of eigenvalues.

Step 4: Selection

With the result from the previous step, we keep variables that account for the highest amount of information, and discard those with very small contributions.

**An actual PCA example**

In this analysis, our aim is to create an index using 15 types of Layer 1 and Layer 2 tokens: BTC, ETH, DOT, SOL, ADA, LUNA, ATOM, AVAX, MATIC, DOGE, BCH, XLM, THETA, PPAY and POA. We are using their daily prices from October 2020 to October 2021. The main idea of our index selection mechanism is that, we treat daily market return as a series of correlated samples and daily return of each token are variables. Daily return of token here is being calculated as the percentage change in token price between current and previous day. By selecting variables with the most information and assigning weights based on their level of importance, we can formulate a new index.

The PCA function is built in most of the statistical packages. In this example, we will be using *prcomp* in the *Stats* package of R. This carries out singular value decomposition on mean centred data, breaking down the large dataset into matrices with eigenvectors and singular values.

We first normalise daily returns, and compute the correlation and covariance matrix on the normalised value.

Fig 1. Correlation matrix

Fig 2. Covariance matrix

At this point, we see ALGO being the most different token from the rest, as it is the only token having negative correlation and covariance with others.

Next, we can pass the normalised daily returns to *prcomp* function in R. Using the standard deviation printed in PCA summary, we can calculate the amount of standard deviation explained by each component:

Fig 3. PCA output

From the summary, we see that the first 5 principal components BTC, ETH, DOT, SOL and ADA account for over 50% of variance, and the first 10 components can account for almost 80%. This significantly downsizes our variable size yet still retaining most of the information in the original dataset.

To better visualise the relationship between components, we can create a biplot that includes the position of each sample in terms of PC1 and PC2. The left and bottom axes are PCA scores of the samples (dots), while the top and right axes tell how strongly each characteristic (vectors) influence the principal components. The pair of red arrows that are most perpendicular to each other are best the portfolio complements,. In our case, they are DOGE & LUNA or SOL & XLM. The rest of arrows are closer to each other, indicating that the respective variables move in a similar way.

Fig 4. PCA biplot

Another tool that helps us in choosing the suitable components is the scree plot. Scree plot shows the amount of variation each PC captures from the data. In an ideal situation, the scree plot gives a steep curve that bends and then flattens out, showing the first few PCs that contribute to a large proportion of information. In our example, the distinct cutting point at PC1 due to the generally high level of correlations among the variables been used.

Fig 5. PCA scree plot

**Potential limitations of PCA**

One of the most major limitations of PCA is that proportion of variance explained by each component significantly depends on the choice of sample basket, in specific the number of variables in the original dataset and level of correlation among variables. Different choices of a variable pool may lead to very different result due to subjective bias. In addition, since the final model is a linear combination of the principal components, excessive removal of the originally correlated variables may make it less readable and interpretable.

**Conclusion**

PCA is a powerful tool in extracting the underlying structure of a data set. It selects variables that contribute the most in explaining for variance across the sample, which can be further used to build a simplified index. The resulting index retains majority of the information, yet has significantly smaller complexity. Although PCA does not directly tell the coefficients for the matrix, such information on which variables have more importance in the sample gives us a valuable source of comparison. This can be potentially used to approximate token returns, especially if we know it has high correlation with the components in the reduced index.