Principal Components Analysis (PCA) on Soil Characteristics

R
Modeling
Quarto
Author

Zoe Zhou

Published

February 17, 2025

Cattle on Namibia Rangeland. Photo by Tim Brunauer on behalf of Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)

About

Principal Component Analysis (PCA) is an ordination method that allows us to extract as much information as possible from multivariate data by reducing it to a simplified number of dimensions. In this study, we will use PCA to eliminates multicollinearity, identify key variables and visualize interactions in soil characteristics that may influence plant trait responses to livestock grazing. By reducing the complexity of the dataset, PCA will help uncover patterns and relationships that are critical for understanding the ecological dynamics of grazing systems.

Data Summary

Soil Environmental Data: The data used in this study is derived from a study on the effects of grazing on soil properties. (Wesuls et al., 2012) The dataset includes a variety of environmental variables grouped into categories such as grazing parameters, soil chemical and physical properties, soil surface characteristics, and topographical parameters. Below is a summary of the environmental variables used in the analysis:

Category Variables Description
Grazing parameters logDist Logarithm of distance from watering point (m)
GrazInt Grazing intensity (unitless)
Soil chemical parameters pH pH value (measured in CaCO₃, unitless)
Conductivity Electrical conductivity (µS/cm)
Cl Chloride concentration (ppm)
NO2 Nitrite concentration (ppm)
NO3 Nitrate concentration (ppm)
PO4 Phosphate concentration (ppm)
SO4 Sulphate concentration (ppm)
Na Sodium concentration (ppm)
NH4 Ammonium concentration (ppm)
K Potassium concentration (ppm)
Mg Magnesium concentration (ppm)
Ca Calcium concentration (ppm)
Soil physical parameters Skeleton Skeleton fraction of the soil (% of particles >0.2 cm)
Soil depth Soil depth (cm)
Soil surface parameters Fine Cover of fine material <0.2 cm (%)
Gravel Cover of gravel 0.2–2.0 cm (%)
Stones Cover of stones >2 cm (%)
Blocks Cover of blocks >60 cm (%)
Wood Cover of dead wood (%)
Litter Cover of litter (%)
Dung Cover of dung (%)
Biocrust Cover of biological soil crust (%)
Topographical parameters Inclination Inclination (% slope)

Citation: Wesuls, D., Oldeland, J., and Dray, S. (2012). Disentangling plant trait responses to livestock grazing from spatio-temporal variation: the partial RLQ approach. Journal of Vegetation Science, 23: 98-113. https://doi.org/10.1111/j.1654-1103.2011.01342.x

Analysis Outline

  1. Preliminary data exploration
  2. Data Wrangling
  3. Run PCA function
  4. Principal Components
  5. Scree Plots
  6. PCA Biplot
  7. Discussion

Set Up

We will use the following libraries and set-up through this analysis

Code
library(tidymodels)
library(tidyverse)
library(ggfortify)
library(kableExtra)
library(skimr)
library(patchwork)

Preliminary Data Exploration

From the data summary, we observed several challenges in the distribution figures. Many variables exhibited skewed distributions, particularly chemical concentration variables, which required transformations to normalize their ranges. Additionally, some variables, such as the Blocks column, had a high proportion of missing or zero values, making them less informative for analysis.

Click to expand table
Code
# Load data  
soil <- read.csv("data/grazing_env.csv")
#head(soil)
#summary(soil)
#glimpse(soil)
skim(soil)
Data summary
Name soil
Number of rows 378
Number of columns 25
_______________________
Column type frequency:
numeric 25
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Inclination 0 1 0.14 0.37 0.00 0.00 0.00 0.00 3.00 ▇▁▁▁▁
GrazInt 0 1 0.05 0.03 0.01 0.03 0.04 0.06 0.21 ▇▃▁▁▁
logDist 0 1 5.39 1.23 1.61 4.39 5.63 6.44 7.25 ▁▂▅▆▇
pH 0 1 5.71 0.98 4.00 4.84 5.50 6.61 7.74 ▆▇▃▆▃
Conductivity 0 1 56.56 72.78 6.20 15.62 28.15 69.95 633.00 ▇▁▁▁▁
Skeleton 0 1 16.15 16.97 0.47 3.80 9.82 22.09 69.55 ▇▂▁▁▁
Wood 0 1 0.38 0.89 0.00 0.10 0.10 0.50 15.00 ▇▁▁▁▁
Litter 0 1 10.77 13.57 0.00 1.00 3.00 20.00 60.00 ▇▁▂▁▁
Dung 0 1 0.72 1.44 0.00 0.10 0.10 0.50 15.00 ▇▁▁▁▁
Biocrust 0 1 0.95 3.36 0.00 0.00 0.00 0.10 40.00 ▇▁▁▁▁
Fine 0 1 67.69 17.21 0.00 56.00 70.00 80.00 99.00 ▁▁▅▇▆
Gravel 0 1 17.04 14.10 0.00 5.00 15.00 25.00 70.00 ▇▆▂▁▁
Stones 0 1 2.90 4.52 0.00 0.00 2.00 3.00 38.00 ▇▁▁▁▁
Blocks 0 1 0.04 0.33 0.00 0.00 0.00 0.00 5.00 ▇▁▁▁▁
Soildepth 0 1 32.80 19.45 0.00 15.00 27.88 56.00 60.00 ▅▆▅▂▇
Cl 0 1 1.19 0.93 0.00 0.63 0.95 1.51 7.55 ▇▂▁▁▁
NO2 0 1 0.00 0.02 0.00 0.00 0.00 0.00 0.30 ▇▁▁▁▁
NO3 0 1 0.56 1.32 0.00 0.06 0.19 0.57 18.74 ▇▁▁▁▁
PO4 0 1 0.09 0.16 0.00 0.00 0.01 0.13 1.82 ▇▁▁▁▁
SO4 0 1 1.20 0.90 0.00 0.61 1.08 1.56 4.81 ▇▇▂▁▁
Na 0 1 2.06 2.23 0.05 0.57 1.12 3.40 10.53 ▇▃▁▁▁
NH4 0 1 0.11 0.31 0.00 0.00 0.00 0.00 1.40 ▇▁▁▁▁
K 0 1 1.11 1.39 0.02 0.33 0.48 1.33 6.17 ▇▁▂▁▁
Mg 0 1 2.19 1.95 0.00 0.53 1.56 3.40 10.98 ▇▅▁▁▁
Ca 0 1 2.74 2.14 0.00 1.18 2.02 3.96 10.05 ▇▅▂▂▁

Figure 1 is a correlation heatmap that shows the pairwise correlations between variables in the dataset. The color scale on the right indicates the strength and direction of the correlations, ranging from -1.0 (blue) for strong negative correlations to 1.0 (brown) for strong positive correlations. For example, logDist is negatively correlated with Grazint, pH, Conductivity and Dung. Solidepth is negatively correlated with pH, Skeleton and Conductivity (blue), while positively correlated with Fine (brown). Highly correlated variables such as Ca, Na, K and Mg may indicate redundancy.

Code
# Create correlation heatmap
# Select numeric columns for correlation analysis
numeric_features <- soil[sapply(soil, is.numeric)]

# Compute the correlation matrix
cor_matrix <- cor(numeric_features, use = "complete.obs")

# Create df
cor_df <- as.data.frame(as.table(cor_matrix))

# Create the heatmap
ggplot(cor_df, aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +  # Add gridlines
  scale_fill_gradient2(low = "lightblue", high = "#964B00", mid = "white", midpoint = 0, 
                       limit = c(-1, 1), space = "Lab", name = "Correlation") +
  labs(title = "Correlation Heatmap of Soil Data", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Figure 1: Pairwise Correlation Heatmap of Soil Data

Manually identifying redundant variables from the heatmap can be tedious, which necessitate a principal component analysis. Before we start, let’s exam for NA values and select only numeric variables.

PCA requires continuous numeric data with no NAs. So we must drop categorical and character data, and exclude any rows with NAs. We should also rescale so all numeric variables have a mean 0 and sd 1.

Data Wrangling

For variables with many zero values, particularly the Blocks column (97% blank), we opted to filter it out entirely. For chemical concentration variables, we applied a log transformation to reduce skew and compresses the range of values.

Click to expand plots
Code
# select variables to transform
chem_vars <- c('Cl', "NO2", "NO3", "PO4", "SO4", "Na", "NH4", "K", "Mg", "Ca")

# Apply log transformation
soil_log <- soil %>% 
  drop_na() %>% 
  select(-Blocks) %>% 
  mutate(across(all_of(chem_vars), ~log(. +1))) #%>% 
  #mutate(GrazInt = factor(GrazInt, levels = c('Low', 'Medium','High', 'Very High')))

# Check results
#skim(soil_log)

# Visualize results
soil_log_long <- soil_log %>% 
  pivot_longer(names_to = 'name', values_to = 'value', where(is.numeric)) 

ggplot(soil_log_long, aes(x=value))+
  geom_histogram()+
  facet_wrap(~name, scales="free_x")+
  theme_minimal()

Figure 2. Histograms of Variables Distribution

Principal Component Analysis

# Run PCA function
soil_pca <- soil_log %>% 
  select(where(is.numeric))%>% 
  prcomp(center = TRUE, scale = TRUE)
Click to expand summary results
Code
# Check results
summary(soil_pca)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     2.0770 1.9051 1.5629 1.26768 1.21341 1.12259 1.05198
Proportion of Variance 0.1797 0.1512 0.1018 0.06696 0.06135 0.05251 0.04611
Cumulative Proportion  0.1797 0.3310 0.4328 0.49971 0.56106 0.61357 0.65968
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.99704 0.94626 0.93328 0.81894 0.77381 0.76795 0.72014
Proportion of Variance 0.04142 0.03731 0.03629 0.02794 0.02495 0.02457 0.02161
Cumulative Proportion  0.70110 0.73841 0.77470 0.80265 0.82760 0.85217 0.87378
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.70379 0.67570 0.59953 0.57469 0.56590 0.49866 0.48562
Proportion of Variance 0.02064 0.01902 0.01498 0.01376 0.01334 0.01036 0.00983
Cumulative Proportion  0.89442 0.91344 0.92842 0.94218 0.95552 0.96588 0.97571
                          PC22    PC23    PC24
Standard deviation     0.45498 0.44855 0.41810
Proportion of Variance 0.00863 0.00838 0.00728
Cumulative Proportion  0.98433 0.99272 1.00000

The loading (eigenvalues) of variables for 24 principal components is listed in table below.

Click to expand table
Code
kable(soil_pca$rotation)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
Inclination 0.2539393 0.0650623 0.2297164 0.0580529 0.0476872 0.3963533 -0.0449732 -0.0515420 -0.1964872 0.0607843 -0.4447328 0.2533805 -0.2782397 0.1914872 -0.1345121 -0.0027120 0.3931800 -0.1207966 -0.1179230 -0.0730326 0.1209062 0.0247223 -0.2185124 -0.1523718
GrazInt 0.0198537 0.2639080 0.0845635 -0.0248342 -0.0579691 0.4852557 -0.3343718 -0.0983452 -0.3454373 0.0412619 -0.0155562 -0.2148365 0.3416707 0.1121799 0.2005560 0.0286634 -0.2009362 0.1422254 0.1215824 0.1278545 -0.3194311 0.1604583 0.0457422 0.0497922
logDist 0.0578106 -0.3565990 0.1763430 0.2271823 -0.1947831 0.0591965 0.0980750 0.1479102 0.0953776 -0.2135787 -0.0916105 0.1475339 -0.3661647 -0.0083419 -0.0337423 0.0158058 -0.0536997 0.1966714 0.3221000 0.2277691 -0.4264260 0.2515008 0.2240570 0.0309153
pH 0.0159855 0.3799065 0.1383138 -0.1710469 -0.1037849 -0.0447218 0.2331377 0.1407851 0.0120061 0.1572887 0.3274969 0.1251140 0.1154500 -0.0606484 -0.1342897 0.1513944 0.5023992 0.1922263 -0.0651812 -0.1388447 -0.1718654 0.2772770 0.3009512 -0.0667539
Conductivity -0.0804869 0.4215503 0.0192748 -0.0569194 0.1430074 -0.1488532 -0.0034214 0.0890602 0.0051359 -0.0648229 -0.0544584 -0.1943055 -0.3158206 -0.0237637 0.1143789 0.2285838 0.2296048 0.1381356 0.2556716 0.4356479 -0.0241378 -0.3243819 -0.2276640 0.2578194
Skeleton 0.3740760 0.0955476 0.2222969 0.0673576 0.1451579 0.1122915 -0.0064025 -0.0476296 0.0018446 -0.0517589 0.0360828 0.0953704 -0.0069136 -0.2332236 0.0774118 -0.0289380 -0.0691977 -0.2898124 0.1104439 -0.0864533 0.2178310 -0.1189558 0.4797966 0.5351514
Wood -0.0014977 -0.0762013 0.1177754 0.0870432 -0.1615695 -0.4277991 -0.3305480 -0.1984217 -0.6718946 -0.1478084 0.2330092 0.1829620 -0.1047250 -0.1037805 -0.1129499 -0.0299904 0.0813919 -0.0015864 -0.0513372 0.0553450 0.0138313 -0.0662852 0.0203841 -0.0348499
Litter -0.3366835 -0.0244869 0.1551774 0.0743487 -0.2582520 0.0012811 0.0482718 0.0599108 -0.0472552 0.0058470 -0.3961124 -0.1217574 0.1899654 0.0047011 -0.4294992 0.2030788 -0.0803051 0.2042751 -0.2964977 0.1946763 0.2399902 -0.0373025 0.2350417 0.2445129
Dung -0.0932240 0.2359014 -0.1921818 -0.2905806 0.2155088 -0.2237510 -0.0512624 -0.0963766 -0.0203824 0.2973056 -0.3586543 0.5060489 -0.0728913 -0.1090018 -0.0738766 0.0246898 -0.3244309 -0.0595670 0.0150675 0.0647375 -0.2566416 0.1465322 0.0825262 0.0413395
Biocrust -0.0601542 0.0081315 0.2725246 0.2167519 -0.2491879 -0.1476091 0.2242942 0.0434494 -0.1524751 0.6852049 -0.0206843 -0.1890456 -0.2158228 0.0080686 0.3306872 -0.1117271 -0.1388469 -0.0496236 -0.0207427 -0.0502446 0.0584419 0.0783582 -0.0642877 0.0565628
Fine -0.3331280 0.0769206 -0.1968402 -0.1370929 -0.2137517 0.2533994 0.0767520 0.0013240 -0.0950304 -0.1240232 0.1500182 0.1811970 -0.1302849 -0.1366587 0.0847356 -0.1021856 -0.0002402 -0.1463968 0.2352979 0.1782701 0.4837707 0.4581900 -0.1174352 0.1020399
Gravel 0.2848551 -0.1133042 -0.0472434 0.0980092 0.4410881 -0.2413856 -0.1924158 -0.0401444 0.0374486 0.1072176 -0.0801840 -0.2234664 0.0455110 0.0341883 -0.0175163 -0.1241602 0.1160451 0.3289112 -0.0514915 0.2005419 0.2496451 0.5316026 0.0021983 0.0710435
Stones 0.2450656 0.2144571 0.3165837 0.0365426 -0.0468223 0.0734916 0.0475801 0.0263857 0.1337868 -0.1585352 0.1677852 0.2499540 -0.0720814 -0.2404235 0.1126783 0.1757577 -0.4277106 0.3925879 -0.2567525 0.0533731 0.2138729 -0.0186729 -0.2071613 -0.2162201
Soildepth -0.1125974 -0.2178989 -0.3697970 -0.0687573 0.1405507 0.2928878 -0.0229655 -0.0458967 -0.1389780 0.1632949 0.1937010 0.1362654 -0.3336418 0.1611334 0.2048896 0.0629920 0.0444800 0.4057586 -0.2906494 -0.0392950 0.0601039 -0.2226013 0.2750225 0.1531502
Cl -0.1722264 0.0935049 0.0200394 0.5297524 0.2227122 -0.0306259 -0.0216152 -0.0974893 0.0914743 0.0546746 0.2249745 0.2384655 0.0898750 0.4456517 0.0024589 0.4409059 -0.0753521 -0.2424615 0.0377073 0.1160407 0.0702210 0.0875443 0.0635886 -0.0642748
NO2 0.0267892 -0.0812970 -0.0309618 -0.0067018 0.0641105 0.0100664 -0.3276688 0.8965307 -0.1141700 0.1270301 0.0411850 0.1222369 0.0732664 -0.0101842 -0.0204803 0.0520621 -0.0471019 -0.0796327 0.0366642 -0.0184513 0.0854992 -0.0537895 -0.0309727 -0.0224835
NO3 -0.2576116 0.2884030 0.0436264 0.0541128 0.1050382 -0.0820376 -0.2141835 0.0510311 0.0013296 -0.2431456 -0.1749648 -0.2990270 -0.3957910 0.0219281 0.0548266 0.0618273 -0.1352112 0.0273654 0.0621720 -0.5145279 0.0869837 0.1402002 0.2751608 -0.2137856
PO4 -0.2808329 0.1423000 0.0987316 0.2824848 0.0353945 -0.1128036 -0.0144136 0.0789924 0.0906207 -0.2171343 -0.2108518 0.3049105 0.2542037 0.0202246 0.4551069 -0.4711552 0.2224280 0.1543632 -0.1259659 0.0184160 -0.0198214 -0.0866207 0.0686905 0.0431674
SO4 -0.1228572 0.1941863 -0.1106035 0.3734078 0.3147183 0.2071764 0.2263396 0.1063871 -0.1636480 -0.0214117 0.1780244 -0.1261630 -0.1472230 -0.2940371 -0.3654662 -0.3338114 -0.1058572 -0.1171068 -0.2563159 0.0880423 -0.2321107 0.0262218 -0.0973960 0.0248773
Na 0.1868670 0.1289495 -0.3734227 0.3476698 -0.2322611 -0.0606111 0.0440100 0.0086923 -0.0505677 -0.0130020 -0.0917187 0.1120927 0.0933374 -0.0090546 -0.1151274 0.1035680 -0.0080006 0.2801555 0.1943636 -0.4525322 -0.0302682 0.0464908 -0.3170906 0.3837800
NH4 -0.1225743 0.0545190 0.0710605 0.1265304 -0.1869802 0.1134638 -0.6077156 -0.1809255 0.4724127 0.3005274 0.1731947 0.0700292 -0.1240187 -0.1682535 -0.2375240 -0.1850958 0.1152072 0.0465455 0.0554706 0.0193226 -0.0083214 -0.0898006 -0.0255858 0.0248830
K 0.2447722 0.1790877 -0.3574563 0.2291239 -0.1539171 0.0152788 0.0975051 -0.0122431 -0.0748274 0.1302239 -0.1341643 -0.0202668 0.0780070 -0.0946630 -0.0680839 -0.1111737 -0.0099697 0.0915677 0.2965476 0.2224671 0.2271014 -0.2281904 0.3558868 -0.4949076
Mg 0.2182573 0.1008650 -0.3390312 0.1122050 -0.3310605 -0.0684199 -0.1604597 0.0427000 0.1217436 -0.1172659 -0.1060288 -0.1159445 -0.1158754 -0.1532328 0.2728809 0.2126120 0.1002475 -0.3122382 -0.5064922 0.1850132 -0.1546163 0.1937166 0.0539820 0.0178529
Ca 0.2153951 0.2740671 -0.0223808 -0.1535181 -0.2309902 -0.1172696 0.0160856 0.0948907 0.0725268 -0.1089284 0.1430596 0.0002931 -0.1553977 0.6471494 -0.1906402 -0.4123894 -0.2142564 -0.0201166 -0.1122681 0.1015696 0.0248113 0.0083516 0.0520306 0.1359311

Principal Components Results

Interpretation: PC1 explains 17.97% of the variance, and PC2 explains 15.12%, together accounting for approximately 33.1% of the total variance.

In PC1, Skeleton (0.37), Gravel (0.28), Inclination (0.25), Stones (0.25), and K (0.24) are the strongest positive contributors. Litter (-0.34), Fine (-0.33), PO4 (-0.28), and NO3 (-0.26) are the strongest negative contributors.

In PC2, Conductivity, pH, Ca, NO3, and Dung are the strongest positive contributors, while logDist and Soildepth are the strongest negative contributors.

PC3’s Key Contributors includes Stones, Biocrust, Skeleton, and Inclination as the strongest positive contributors. Soildepth, Na, K, and Mg are the strongest negative contributors.

Scree Plots

A scree plot is created for visualizating PC contributions (Figure 3).

Code
# screeplot
#screeplot(soil_pca, type='lines')

# create df
pc_names <- colnames(soil_pca$rotation)
sd_vec <- soil_pca$sdev
var_vec <- sd_vec^2

pct_expl_df <- data.frame(v=var_vec,
                          pct_v = var_vec/sum(var_vec),
                          pc = pc_names)
pct_expl_df$pc <- factor(pct_expl_df$pc, levels = pc_names)
# plot 
ggplot(pct_expl_df, aes(x = pc, y = pct_v)) +
  geom_col(fill = "lightblue",alpha = 0.7) +
  labs(title = "Scree Plot", x = 'Principal component', y = 'Variance explained')+
  scale_y_continuous(labels = scales::percent)+
  theme_minimal()+
  theme(panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))+
  geom_text(aes(label = scales::percent(pct_v, accuracy = 0.1)), 
            angle = 25, hjust = -0.25, size = 3) 

Figure 3. Scree-plot explaining variance captured by each component

Biplot

A biplot for PCA showing 1. The loading of variables for the first two principal components(brown arrows), and 2. The score of each observations based on the first two principal components is created for futher explaination (Figure 4).

The length of the arrows indicates the strength of the contribution of each variable to the principal components. Longer arrows represent stronger contributions. The direction of the arrows shows how variables are correlated with each other and with the principal components.The points are colored based on the “Skeleton” variable, with darker points indicating higher skeleton content.

Code
autoplot(soil_pca, 
         data = soil_log,
         loadings=TRUE,
         colour = 'Skeleton',
         loadings.label=TRUE,
         loadings.colour = "#964B00",
           loadings.label.colour = "#964B00",
           loadings.label.vjust = -0.5
        ) +
  scale_color_gradient(low="lightblue", high='darkblue') +
  theme_minimal()

Figure 4. PCA Biplot: Soil Environmental Variables

Discussion

This biplot provides a clear visualization of how soil properties and environmental factors vary and interact across the dataset. To get ~80% of variance explained, we must include 11 principle components.

Skeleton is strongly positively correlated with PC1, indicating that areas with higher skeleton content have higher PC1 scores.

Variables such as Skeleton, Gravel, Inclination, Ca, and Stones have arrows pointing in the same direction along PC1. Variables like Conductivity, pH, GrazInt, and Dung are more aligned with PC2. This suggests that PC2 captures a gradient related to soil chemistry and grazing intensity.