Code
# Import libraries
library(tidyverse)
library(tidymodels)
library(here)
library(cowplot)
library(patchwork)
library(dplyr)
library(lubridate)
library(ggplot2)
library(kableExtra)
library(stats)
library(yardstick)
Zoe Zhou
February 18, 2025
Saw palmetto (Serenoa repens) and scrub palmetto (Sabal etonia) are both keystone species in south-central Florida that support the pine flatwoods ecosystem. This analysis uses logistic regression to classify these two native palmetto species based on plant characteristics. The goal is to evaluate whether variables such as plant height, canopy dimensions, and the number of green leaves can effectively predict species classification. The highlight of this study includes data visualization, logistic regression modeling, and models assessment with cross-validation.
Palmetto Monitoring Data The dataset used for this analysis can be accessed through EDS Data Portal. It provides detailed measurements of the growth and survival characteristics of Serenoa repens and Sabal etonia, two prominent palmetto species in Florida. The specific variables included in the dataset are:
Column Name | Description | Type |
---|---|---|
year |
Sample year. | date |
plant |
Plant ID number. | float |
species |
Palmetto species (Serenoa repens or Sabal etonia). | string |
site |
Site name. | string |
habitat |
Habitat type. | string |
treatment |
Experimental treatment applied. | string |
survival |
Survival from previous census (1981–2017). | string |
height |
Maximum height (1981–2017). | float |
length |
Widest length of the canopy (1981–2017). | float |
width |
Widest width of the canopy perpendicular to the canopy length (1981–2017). | float |
green_lvs |
Count of green leaves (1981–2017). | float |
scape |
Count of inflorescence scapes (1981–2017). | float |
new_lvs |
Count of new leaves (1982–2017). | float |
biomass |
Calculated biomass estimate of dry mass (1989–2017). | float |
canopy |
Average percent canopy cover across the four cardinal directions, taken in January 1993. | float |
lf_long |
Leaf longevity (1990–1997). | float |
comments |
Notes made in 2017. | string |
Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5
We will use the following libraries and set-up through this analysis
From the metadata, code 1 represents Serenoa repens, code 2 represents Sabal etonia. We will first load the dataset and preprocess the data by selecting the relevant columns and removing missing values, then factorize the string type species column into dummy variables: 0 for Serenoa repens and 1 for Sabal etonia.
# Load the dataset
df <- read_csv(here("posts","2025-02-18-plant-classifier","data", "palmetto.csv"))
# Preprocess data
df_clean <- df %>%
select(species, height, length, width, green_lvs) %>%
drop_na() %>%
mutate(species = factor(species, levels = c('1','2'), labels = c(0, 1)))
kable(head(df_clean), title = "Palmetto Data cleaned")
species | height | length | width | green_lvs |
---|---|---|---|---|
0 | 51 | 69 | 63 | 6 |
1 | 79 | 129 | 71 | 3 |
0 | 80 | 90 | 66 | 6 |
1 | 59 | 80 | 72 | 2 |
0 | 83 | 96 | 79 | 6 |
1 | 93 | 119 | 90 | 2 |
Both species exhibit overlapping distributions for height, width, and length, indicating these variables alone may not perfectly distinguish between Serenoa repens and Sabal etonia.
Height (Figure 1): The height distribution of Sabal etonia is slightly higher than that of Serenoa repens.
Width (Figure 2): Sabal etonia shows a slightly higher peak in the middle range (100–120 cm) compared to Serenoa repens.
Length (Figure 3): Significant overlap exists, but Sabal etonia tends to grow longer (100–200 cm), while Serenoa repens is more concentrated in shorter lengths (<200 cm).
Green Leaves (Figure 4): Serenoa repens generally has more green leaves than Sabal etonia, highlighting a potential growth pattern difference.
##| fig-cap: "Figure 1: Histogram of Height by Species"
#| fig-subcap: "This histogram shows the distribution of plant height (in cm) for two species of palmetto: Serenoa repens (red bars) and Sabal etonia (blue bars). The x-axis represents plant height, while the y-axis represents the count of plants within each height bin."
# height comparison
height <- ggplot(df_clean, aes(x=height, fill = species)) +
geom_histogram(position ="dodge", alpha = 0.8) +
labs(title = "Height Distribution by Species",
x = "Height (cm)",
y = "Count",
fill = "Species") +
scale_fill_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme_minimal() +
theme(legend.position = "none")
#ggplot(df_clean, aes(x=height, fill = species)) +
# geom_histogram(position ="identity", alpha = 0.5)
##| fig-cap: "Figure 2: Histogram of Width (in cm) by Species"
#| fig-subcap: "This histogram shows the distribution of plant width (in cm) for two species of palmetto: Serenoa repens (red bars) and Sabal etonia (blue bars). The x-axis represents plant height, while the y-axis represents the count of plants within each height bin."
# width comparison
width <- ggplot(df_clean, aes(x=width, fill = species)) +
geom_histogram(position ="dodge", alpha = 0.8) +
labs(title = "Width Distribution by Species",
x = "Width (cm)",
y = "Count",
fill = "Species") +
scale_fill_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme_minimal() +
theme(legend.position = "none")
##| fig-cap: "Figure 3: Histogram of Maximum Canopy Length by Species"
#| fig-subcap: "The chart highlights subtle differences in growth patterns between the two species, with Sabal etonia tending to grow longer canopy than Serenoa repens. "
# length comparison
length <- ggplot(df_clean, aes(x=length, fill = species)) +
geom_histogram(position ="dodge", alpha = 0.8) +
labs(title = "Length Distribution by Species",
x = "Length (cm)",
y = "Count",
fill = "Species") +
scale_fill_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme_minimal() +
theme(legend.position = "none")
##| fig-cap: "Figure 4: Green Leaves Counts by Species"
#| fig-subcap: "Differences in growth patterns between the two species is shown here, with Sabal etonia tending to grow less green leaves than Serenoa repens."
# leaves comparison
leaves <- ggplot(df_clean, aes(x=green_lvs, fill = species)) +
geom_histogram(position ="dodge", alpha = 0.8) +
labs(title = "Number of Green Leaves by Species",
x = "Number of Green Leaves",
y = "Count",
fill = "Species") +
scale_fill_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme_minimal() +
theme(legend.position = "right")
We will create paired scatter plots to visualize the relationship between plant characteristics. The scatter plots show the relationship between green leaves and width (Figure 5) and height and canopy length (Figure 6). The plots are colored by species to highlight differences in growth patterns. The scatter plot of green leaves vs. width (Top) shows less overlap compared to the height vs. canopy length plot (Bottom), indicating that green leaves and width may be more useful for species classification. However, all variables exhibit some degree of overlap, emphasizing the need for a combination of predictors to improve classification accuracy.
# leaves vs. width
p1 <- ggplot(df_clean, aes(x = width, y = green_lvs, color = species)) +
geom_point(alpha = 0.6) +
labs(
title = "Green Leaves vs. Width by Species",
x = "Width (cm)",
y = "Number of Green Leaves",
color = "Species"
) +
scale_color_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme(legend.position = "none") +
theme_minimal()
# Height vs. Length
p2 <- ggplot(df_clean, aes(x = height, y = length, color = species)) +
geom_point(alpha = 0.6) +
labs(
title = "Height vs. Canopy Length by Species",
x = "Height (cm)",
y = "Max Canopy Length (cm)",
color = "Species"
) +
scale_color_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia")) +
theme(legend.position = "right") +
theme_minimal()
p1/p2
Then, we will use a heatmap to visualize the pairwise correlations among all numeric variables in the palmetto dataset. The heatmap suggests that height
, length
, canopy
, biomss
and width
are highly positively correlated, which can be explained by larger plants having larger dimensions and more biomass. The variable species
shows a negatively correlation with green_lvs
, new_lvs
and biomass
, with the strength of correlation descending in that order.
# try a heatmap
# Select numeric columns for correlation analysis
numeric_features <- df[sapply(df, is.numeric)]
numeric_features <- numeric_features[, !(colnames(numeric_features) %in% c("year", "site", "survival","plant"))]
# Compute the correlation matrix
cor_matrix <- cor(numeric_features, use = "complete.obs")
# Convert the correlation matrix to a long format for ggplot2
cor_data <- as.data.frame(as.table(cor_matrix))
# Create the heatmap
ggplot(cor_data, aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") + # Add gridlines
scale_fill_gradient2(low = "lightblue", high = "salmon", mid = "white", midpoint = 0,
limit = c(-1, 1), space = "Lab", name = "Correlation") +
labs(title = "Correlation Heatmap of Palmetto Data", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
We will split the data into training and testing sets using a 70/30 split and check for class imbalance.
set.seed(123)
# Split the data
df_split <- initial_split(df_clean, prop = 0.7, strata = species)
df_train <- training(df_split)
df_test <- testing(df_split)
# Check class imbalance
train_class <- df_train %>%
group_by(species) %>%
summarize(n = n()) %>%
ungroup() %>%
mutate(prop = n / sum(n))
test_class <- df_test %>%
group_by(species) %>%
summarize(n = n()) %>%
ungroup() %>%
mutate(prop = n / sum(n))
Set engine
Model 1: log odds of species based on height, length, width and leaves as predictor variables.
Model 2: log odds of species based on height, width and green leaves as predictor variables.
Randomly divide training data into k groups, train models with k-1 folds, leaving one out for validation. Compute performance metrics for k number of models. Average k test errors to get an estimate of the model’s performance on unseen data.
Model 1
Model 2
Model Comparison
Model | Accuracy_Mean | ROC_AUC_Mean |
---|---|---|
Model 1 | 0.9180062 | 0.9732010 |
Model 2 | 0.8998379 | 0.9640701 |
Model Comparison
From 10 fold cross-validation, Model 1 has a higher accuracy and ROC AUC compared to Model 2, suggesting that model 1 is better at classification.
# Finalize the model using entire data set
log_1_final <- log_1 %>%
last_fit(split = df_split)
# Extract actual models
model_final <- log_1_final %>%
extract_fit_parsnip()
model_table <- model_final %>%
tidy() %>%
mutate(
term = case_when(
term == "height" ~ "Maximum Height",
term == "length" ~ "Maximum Canopy Length",
term == "width" ~ "Maximum Canopy Width",
term == "green_lvs" ~ "Count of Green Leaves",
TRUE ~ term
),
p.value = ifelse(as.numeric(p.value) < 0.01,'<0.01', as.character(p.value))
) %>%
mutate(p.value = as.character(p.value)) %>%
kable(format = "html", caption = "Logistic Regression Model Results") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
model_table
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.2721657 | 0.1709524 | 19.14081 | <0.01 |
Maximum Height | -0.0285369 | 0.0027702 | -10.30135 | <0.01 |
Maximum Canopy Length | 0.0471518 | 0.0022824 | 20.65872 | <0.01 |
Maximum Canopy Width | 0.0377992 | 0.0025076 | 15.07359 | <0.01 |
Count of Green Leaves | -1.9338916 | 0.0471980 | -40.97402 | <0.01 |
Each coefficient estimate shows how much the likelihood of the outcome being species Sabal etonia changes when the corresponding predictor increases by one unit.
This section evaluates how successfully the selected model would classify a plant as the correct species. I use a confusion matrix and histogram of predicted probabilities to assess the model’s performance.
# Make confusion matrix with true/false positive and true/false negative
conf_matrix <- result %>%
conf_mat(truth = species, estimate = .pred_class)
cm <- autoplot(conf_matrix, type = "heatmap") +
scale_fill_gradient(low = "salmon", high = "lightblue") +
labs(
title = "Confusion Matrix Heatmap",
x = "Predicted Species",
y = "Actual Species"
) +
scale_x_discrete(labels = c("Serenoa repens", "Sabal etonia")) + # Custom labels for x-axis
scale_y_discrete(labels = c("Sabal etonia","Serenoa repens")) + # Custom labels for y-axis
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
legend.position = "none"
)
cm
# Pivot result table for plotting
result_long <- result %>%
pivot_longer(cols = c(.pred_0, .pred_1), names_to = "plant", values_to = "probability")
# Plot predicted probabilities
prob_p <- ggplot(result, aes(x=.pred_1, fill = as.factor(species)))+
geom_density(stat ='density',position="identity",alpha=0.7)+
labs(title="Density Plot of Predicted Probabilities",
x="Predicted Probability of Sabal etonia",
y="Frequency",
fill="Species")+
scale_fill_manual(values = c("0" = "#964B00", "1" = "skyblue"),
labels = c("Serenoa repens", "Sabal etonia"))+
theme_minimal()+
theme(
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
legend.position = "right"
)
prob_p
This study utilizes plant height (in cm), canopy length (in cm), canopy width (in cm), and the count of green leaves to develop a regression model aimed at predicting two closely related plant species. The results indicate promising accuracy, achieving 91.7% for Serenoa repens and 91.13% for Sabal etonia. The model’s sensitivity is calculated as 91.83%, indicating its ability to correctly identify positive cases, while its specificity is 91.01%, reflecting its accuracy in identifying negative cases.