Exploring tree outcomes following fires

Overview

Basically, there’s this awesome dataset on tree survival following fires, the Fire and Tree Mortality Database, and I want to go exploring & compare fire survival across species. Some fun with tidymodels, data visualization, binary logistic regression, and my first shot at using the fantastic geomtextpath package!

Citations:

Cansler, C. Alina; Hood, Sharon M.; Varner, J. Morgan; van Mantgem, Phillip J.; Agne, Michelle C.; Andrus, Robert A.; Ayres, Matthew P.; Ayres, Bruce D.; Bakker, Jonathan D.; Battaglia, Michael A.; Bentz, Barbara J.; Breece, Carolyn R.; Brown, James K.; Cluck, Daniel R.; Coleman, Tom W.; Corace, R. Gregory; Covington, W. Wallace; Cram, Douglas S.; Cronan, James B.; Crouse, Joseph E.; Das, Adrian J.; Davis, Ryan S.; Dickinson, Darci M.; Fitzgerald, Stephen A.; Fulé, Peter Z.; Ganio, Lisa M.; Grayson, Lindsay M.; Halpern, Charles B.; Hanula, Jim L.; Harvey, Brian J.; Hiers, J. Kevin; Huffman, David W.; Keifer, MaryBeth; Keyser, Tara L.; Kobziar, Leda N.; Kolb, Thomas E.; Kolden, Crystal A.; Kopper, Karen E.; Kreitler, Jason R.; Kreye, Jesse K.; Latimer, Andrew M.; Lerch, Andrew P.; Lombardero, Maria J.; McDaniel, Virginia L.; McHugh, Charles W.; McMillin, Joel D.; Moghaddas, Jason J.; O’Brien, Joseph J.; Perrakis, Daniel D.B.; Peterson, David W.; Prichard, Susan J.; Progar, Robert A.; Raffa, Kenneth F.; Reinhardt, Elizabeth D.; Restaino, Joseph C.; Roccaforte, John P.; Rogers, Brendan M.; Ryan, Kevin C.; Safford, Hugh D.; Santoro, Alyson E.; Shearman, Timothy M.; Shumate, Alice M.; Sieg, Carolyn H.; Smith, Sheri L.; Smith, Rebecca J.; Stephenson, Nathan L.; Steuver, Mary; Stevens, Jens T.; Stoddard, Michael T.; Thies, Walter G.; Vaillant, Nicole M.; Weiss, Shelby A.; Westlind, Douglas J.; Woolley, Travis J.; Wright, Michah. 2020. Fire and tree mortality database (FTM). Fort Collins, CO: Forest Service Research Data Archive. Updated 24 July 2020. https://doi.org/10.2737/RDS-2020-0001

Cansler, C. Alina; Hood, Sharon M.; Varner, J. Morgan; van Mantgem, Phillip J.; Agne, Michelle C.; Andrus, Robert A.; Ayres, Matthew P.; Ayres, Bruce D.; Bakker, Jonathan D.; Battaglia, Michael A.; Bentz, Barbara J.; Breece, Carolyn R.; Brown, James K.; Cluck, Daniel R.; Coleman, Tom W.; Corace, R. Gregory; Covington, W. Wallace; Cram, Douglas S.; Cronan, James B.; Crouse, Joseph E.; Das, Adrian J.; Davis, Ryan S.; Dickinson, Darci M.; Fitzgerald, Stephen A.; Fulé, Peter Z.; Ganio, Lisa M.; Grayson, Lindsay M.; Halpern, Charles B.; Hanula, Jim L.; Harvey, Brian J.; Hiers, J. Kevin; Huffman, David W.; Keifer, MaryBeth; Keyser, Tara L.; Kobziar, Leda N.; Kolb, Thomas E.; Kolden, Crystal A.; Kopper, Karen E.; Kreitler, Jason R.; Kreye, Jesse K.; Latimer, Andrew M.; Lerch, Andrew P.; Lombardero, Maria J.; McDaniel, Virginia L.; McHugh, Charles W.; McMillin, Joel D.; Moghaddas, Jason J.; O’Brien, Joseph J.; Perrakis, Daniel D.B.; Peterson, David W.; Prichard, Susan J.; Progar, Robert A.; Raffa, Kenneth F.; Reinhardt, Elizabeth D.; Restaino, Joseph C.; Roccaforte, John P.; Rogers, Brendan M.; Ryan, Kevin C.; Safford, Hugh D.; Santoro, Alyson E.; Shearman, Timothy M.; Shumate, Alice M.; Sieg, Carolyn H.; Smith, Sheri L.; Smith, Rebecca J.; Stephenson, Nathan L.; Steuver, Mary; Stevens, Jens T.; Stoddard, Michael T.; Thies, Walter G.; Vaillant, Nicole M.; Weiss, Shelby A.; Westlind, Douglas J.; Woolley, Travis J.; Wright, Michah C. 2020. The Fire and Tree Mortality Database, for empirical modeling of individual tree mortality after fire. Scientific Data 7: 194. https://doi.org/10.1038/s41597-020-0522-7

Attach packages & read in the data

library(tidyverse)
library(here)
library(naniar)
library(tidymodels)
library(geomtextpath)
library(paletteer)
trees <- read_csv(here("content", "post", "2022-03-10-tree-mortality-fires", "data", "Data", "FTM_trees.csv")) # Tree outcomes and records

Important information: See attributes in _metadata_RDS-2020-0001.html for variable definitions.

Exploratory data visualization

Counts of tree species in the dataset:

# Find the top 20 most counted tree species
trees <- trees %>% 
  mutate(sci_name = paste(Genus, Species_name)) %>% 
  filter(sci_name != "Pinus jeffreyi or ponderosa")

tree_count_top_20 <- trees %>% 
  count(sci_name) %>% 
  mutate(sci_name = fct_reorder(sci_name, n)) %>% 
  slice_max(n, n = 20)

tree_20_gg <- ggplot(data = tree_count_top_20, aes(x = sci_name, y = n)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(y = "\nObservations in dataset",
       x = "Scientific name")

Counts of live (0) and dead (1) for the top 20 most recorded trees in the dataset:

# Make a long form of the trees dataset (top 20 most observed tree species)
trees_long <- trees %>% 
  pivot_longer(cols = yr0status:yr10status, names_to = "yr_outcome", values_to = "live_dead") %>% 
  mutate(yr_since_fire = as.numeric(parse_number(yr_outcome)),
         live_dead_chr = case_when(
           live_dead == 0 ~ "live",
           live_dead == 1 ~ "dead"
         )) %>% 
  filter(sci_name %in% tree_count_top_20$sci_name)

trees_live_dead <- trees_long %>% 
  count(sci_name, yr_since_fire, live_dead_chr) %>% 
  drop_na()
  
tree_survival_gg <- ggplot(data = trees_live_dead, aes(x = yr_since_fire, y = n)) +
  geom_col(aes(fill = live_dead_chr), position = "fill") +
  scale_fill_manual(values = c("lightsalmon", "forestgreen"),
                    name = "Live / dead:") +
  scale_x_continuous(breaks = c(0, 5, 10), labels = c("0", "5", "10")) +
  theme_minimal() +
  labs(x = "Years since fire",
       y = "Proportion live / dead",
       title = "Tree survival post-fire",
       subtitle = "Only includes the 20 most observed trees in the dataset",
       caption = "Data: Fire and tree mortality database (FTM)") +
  facet_wrap(~sci_name)

We can already see some interesting differences in survival across species. For example, Picea mariana and Abies lasiocarpa experience quick mortality within the first year; others like Pinus jeffreyi and Abies concolor appear more resilient. However, near-complete mortality is observed across all species within 10 years.

Ponderosa pines - diving a bit deeper

Since it is the most observed species in the dataset and because it happens to be one of my favorite trees, I’ll dive a bit deeper into factors that may influence Pinus ponderosa mortality post-fire.

ponderosa <- trees_long %>% 
  filter(sci_name == "Pinus ponderosa")

First, let’s take a look at mortality over time (years since fire):

survival_gg <- ggplot(data = ponderosa, aes(x = yr_since_fire, y = live_dead)) +
  geom_jitter(alpha = 0.008) +
  labs(x = "Years since fire",
       y = "Tree status (live / dead)",
       title = "Ponderosa pine mortality post-fire",
       caption = "Data: Fire and tree mortality database (FTM)") +
  scale_y_continuous(breaks = c(0, 1), labels = c("Live", "Dead")) +
  scale_x_continuous(breaks = c(0, 5, 10), labels = c("0", "5", "10")) +
  theme_minimal()

Classification: binary logistic regression in tidymodels

Create the training & testing sets

ponderosa <- ponderosa %>% 
  drop_na(yr_since_fire, live_dead) %>% 
  mutate(live_dead = as.factor(live_dead))

# Make the training & testing dataset:
ponderosa_split <- ponderosa %>% 
  initial_split(prop = 4/5)

# Confirm the splits (Analysis/Assess/Total): 
ponderosa_split
## <Analysis/Assess/Total>
## <293297/73325/366622>
# Extract the training and testing sets: 
ponderosa_train <- training(ponderosa_split)
ponderosa_test <- testing(ponderosa_split)

# Check them out a bit: 
ponderosa_train %>% 
  count(yr_since_fire, live_dead)
## # A tibble: 22 × 3
##    yr_since_fire live_dead     n
##            <dbl> <fct>     <int>
##  1             0 0         39815
##  2             0 1           535
##  3             1 0         39058
##  4             1 1         10455
##  5             2 0         26918
##  6             2 1         12846
##  7             3 0         23654
##  8             3 1         14864
##  9             4 0          8630
## 10             4 1         15046
## # … with 12 more rows
ponderosa_test %>% 
  count(yr_since_fire, live_dead)
## # A tibble: 22 × 3
##    yr_since_fire live_dead     n
##            <dbl> <fct>     <int>
##  1             0 0          9854
##  2             0 1           154
##  3             1 0          9821
##  4             1 1          2514
##  5             2 0          6790
##  6             2 1          3248
##  7             3 0          5899
##  8             3 1          3637
##  9             4 0          2173
## 10             4 1          3707
## # … with 12 more rows

Make a recipe

# Just using the single predictor here:
ponderosa_recipe <- recipe(live_dead ~ yr_since_fire, data = ponderosa)
ponderosa_recipe 
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1

Make the model

ponderosa_model <- 
  logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") # Binary classificiation

Make the workflow

ponderosa_wf <- workflow() %>% 
  add_recipe(ponderosa_recipe) %>% 
  add_model(ponderosa_model)

Fit the model:

ponderosa_fit <- ponderosa_wf %>% 
  last_fit(ponderosa_split)

# Which returns high accuracy and roc_auc:
ponderosa_fit %>% collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.807 Preprocessor1_Model1
## 2 roc_auc  binary         0.880 Preprocessor1_Model1

Proof of concept: check out the test set predictions

…just for the first 20 rows:

ponderosa_fit %>% 
  collect_predictions() %>% 
  head(20)
## # A tibble: 20 × 7
##    id               .pred_0 .pred_1  .row .pred_class live_dead .config         
##    <chr>              <dbl>   <dbl> <int> <fct>       <fct>     <chr>           
##  1 train/test split  0.565   0.435      3 0           1         Preprocessor1_M…
##  2 train/test split  0.912   0.0882    11 0           0         Preprocessor1_M…
##  3 train/test split  0.565   0.435     14 0           0         Preprocessor1_M…
##  4 train/test split  0.246   0.754     19 1           1         Preprocessor1_M…
##  5 train/test split  0.838   0.162     29 0           1         Preprocessor1_M…
##  6 train/test split  0.394   0.606     32 1           1         Preprocessor1_M…
##  7 train/test split  0.912   0.0882    39 0           0         Preprocessor1_M…
##  8 train/test split  0.0102  0.990     60 1           1         Preprocessor1_M…
##  9 train/test split  0.912   0.0882    61 0           0         Preprocessor1_M…
## 10 train/test split  0.838   0.162     62 0           0         Preprocessor1_M…
## 11 train/test split  0.912   0.0882    65 0           0         Preprocessor1_M…
## 12 train/test split  0.565   0.435     68 0           1         Preprocessor1_M…
## 13 train/test split  0.140   0.860     71 1           1         Preprocessor1_M…
## 14 train/test split  0.565   0.435     79 0           0         Preprocessor1_M…
## 15 train/test split  0.722   0.278     82 0           0         Preprocessor1_M…
## 16 train/test split  0.394   0.606     87 1           1         Preprocessor1_M…
## 17 train/test split  0.0756  0.924     90 1           1         Preprocessor1_M…
## 18 train/test split  0.0394  0.961     91 1           1         Preprocessor1_M…
## 19 train/test split  0.0102  0.990    103 1           1         Preprocessor1_M…
## 20 train/test split  0.0394  0.961    126 1           1         Preprocessor1_M…

Confusion matrix of truth / predictions

Recall here: 0 = “Live”, 1 = “Dead”

ponderosa_fit %>% 
  collect_predictions() %>% 
  conf_mat(truth = live_dead, estimate = .pred_class)
##           Truth
## Prediction     0     1
##          0 32364  9553
##          1  4596 26812

Fit on entire dataset

ponderosa_model_full <- fit(ponderosa_wf, ponderosa)

ponderosa_model_full
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##   (Intercept)  yr_since_fire  
##       -2.3424         0.6929  
## 
## Degrees of Freedom: 366621 Total (i.e. Null);  366620 Residual
## Null Deviance:       508200 
## Residual Deviance: 318000    AIC: 318000

Making new predictions

Let’s say we want to predict survival of other ponderosa pines based solely on years post-fire:

# Make a data frame containing a "yr_since_fire" variable as a new model input:
new_yr <- data.frame(yr_since_fire = c(0, 0.4, 1, 2.2, 5.7, 8.3))

# Then use the model to predict outcomes, bind together: 
example_predictions <- data.frame(new_yr, predict(ponderosa_model_full, new_yr))

example_predictions
##   yr_since_fire .pred_class
## 1           0.0           0
## 2           0.4           0
## 3           1.0           0
## 4           2.2           0
## 5           5.7           1
## 6           8.3           1

This does seem to align with what we’d expect based on the data visualization. We can also find the probability of “Dead” (outcome = 1) using the model predictions, adding type = "prob" within the predict() function.

predict_over <- data.frame(yr_since_fire = seq(from = 0, to = 10, by = 0.1))
predictions_full <- data.frame(predict_over, predict(ponderosa_model_full, predict_over, type = "prob"))
names(predictions_full) <- c("yr_since_fire", "prob_alive", "prob_dead")

# Plot probability of mortality:
ponderosa_prob_alive <- ggplot() +
  geom_line(data = predictions_full, aes(x = yr_since_fire, y = prob_alive), color = "gray30", size = 1) +
  labs(x = "Years since fire",
       y = "Probability of tree being alive",
       title = "Predicted Ponderosa pine mortality post-fire",
       caption = "Data: Fire and tree mortality database (FTM)") +
  scale_y_continuous(breaks = c(0, 0.5,  1), 
                     labels = c("0%", "50%", "100%"),
                     limits = c(0, 1)) +
  scale_x_continuous(breaks = c(0, 5, 10), labels = c("0", "5", "10")) +
  theme_minimal()

Extending the model

I want to extend this model for the 20 most observed trees in the dataset (so will include species as a predictor variable).

Create the training & testing sets

trees_20 <- trees_long %>% 
  filter(sci_name %in% c(tree_count_top_20$sci_name)) %>% 
  drop_na(yr_since_fire, live_dead) %>% 
  mutate(live_dead = as.factor(live_dead))

# Make the training & testing dataset:
trees_20_split <- trees_20 %>% 
  initial_split(prop = 4/5)

# Confirm the splits (Analysis/Assess/Total): 
trees_20_split
## <Analysis/Assess/Total>
## <830790/207698/1038488>
# Extract the training and testing sets: 
trees_20_train <- training(trees_20_split)
trees_20_test <- testing(trees_20_split)

# Check them out a bit: 
trees_20_train %>% 
  count(yr_since_fire, live_dead)
## # A tibble: 22 × 3
##    yr_since_fire live_dead     n
##            <dbl> <fct>     <int>
##  1             0 0         77746
##  2             0 1          2902
##  3             1 0         73109
##  4             1 1         30620
##  5             2 0         51806
##  6             2 1         47314
##  7             3 0         43230
##  8             3 1         54831
##  9             4 0         22408
## 10             4 1         55584
## # … with 12 more rows
trees_20_test %>% 
  count(yr_since_fire, live_dead)
## # A tibble: 22 × 3
##    yr_since_fire live_dead     n
##            <dbl> <fct>     <int>
##  1             0 0         19380
##  2             0 1           752
##  3             1 0         18654
##  4             1 1          7574
##  5             2 0         12730
##  6             2 1         11752
##  7             3 0         10731
##  8             3 1         13637
##  9             4 0          5618
## 10             4 1         13859
## # … with 12 more rows

Make a recipe

# Just using the single predictor here:
trees_20_recipe <- recipe(live_dead ~ yr_since_fire + sci_name, data = trees_20)
trees_20_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2

Make the model

trees_20_model <- 
  logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") # Binary classificiation

Make the workflow

trees_20_wf <- workflow() %>% 
  add_recipe(trees_20_recipe) %>% 
  add_model(trees_20_model)

Fit the model:

trees_20_fit <- trees_20_wf %>% 
  last_fit(trees_20_split)

# Which returns high accuracy and roc_auc:
trees_20_fit %>% collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.820 Preprocessor1_Model1
## 2 roc_auc  binary         0.893 Preprocessor1_Model1

Confusion matrix of truth / predictions

Recall here: 0 = “Live”, 1 = “Dead”

trees_20_fit %>% 
  collect_predictions() %>% 
  conf_mat(truth = live_dead, estimate = .pred_class)
##           Truth
## Prediction      0      1
##          0  57541  19948
##          1  17434 112775

Fit on entire dataset

trees_20_model_full <- fit(trees_20_wf, trees_20)

trees_20_model_full
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                   (Intercept)                  yr_since_fire  
##                      -1.89320                        0.61524  
##         sci_nameAbies grandis       sci_nameAbies lasiocarpa  
##                       0.69818                        2.72269  
##           sci_nameAcer rubrum   sci_nameCalocedrus decurrens  
##                       0.54265                        0.24419  
##  sci_nameJuniperus scopulorum     sci_nameLarix occidentalis  
##                       0.94475                       -1.21916  
##     sci_namePicea engelmannii          sci_namePicea mariana  
##                       1.88703                        5.68710  
##      sci_namePinus albicaulis         sci_namePinus contorta  
##                       2.17161                        1.78648  
##        sci_namePinus echinata         sci_namePinus jeffreyi  
##                       0.56136                       -0.54492  
##     sci_namePinus lambertiana        sci_namePinus palustris  
##                       0.30286                       -1.42278  
##       sci_namePinus ponderosa            sci_namePinus taeda  
##                      -0.21649                        0.06294  
##   sci_namePopulus tremuloides  sci_namePseudotsuga menziesii  
##                       1.30545                       -0.27530  
##    sci_nameTsuga heterophylla  
##                       0.82540  
## 
## Degrees of Freedom: 1038487 Total (i.e. Null);  1038467 Residual
## Null Deviance:       1358000 
## Residual Deviance: 818700    AIC: 818700

Mortality (probability)

# Make a data frame containing a "sci_name" and "yr_since_fire" variable as a new model input:
new_data <- data.frame(sci_name = rep(unique(trees_20$sci_name), 100)) %>% 
  arrange(sci_name)

new_data <- data.frame(new_data, yr_since_fire = rep(seq(from = 0, to = 10, length = 100), 20))

tree_20_predictions <- data.frame(new_data, predict(trees_20_model_full, new_data, type = "prob"))
names(tree_20_predictions) <- c("sci_name", "yr_since_fire", "prob_alive", "prob_dead")

# Plot probability of mortality:
all_prob_gg <- ggplot() +
  geom_textline(data = tree_20_predictions, 
                aes(x = yr_since_fire, 
                    y = prob_alive, 
                    label = sci_name,
                    group = sci_name,
                    color = sci_name), 
                size = 2.5,
                show.legend = FALSE) +
  labs(x = "Years since fire",
       y = "Probability of tree being alive",
       title = "Predicted tree mortality post-fire",
       caption = "Data: Fire and tree mortality database (FTM)") +
  scale_y_continuous(breaks = c(0, 0.5,  1), 
                     labels = c("0%", "50%", "100%"),
                     limits = c(0, 1)) +
  scale_x_continuous(breaks = c(0, 5, 10), labels = c("0", "5", "10")) +
  scale_color_paletteer_d("ggthemes::Tableau_20") +
  theme_minimal()

More opportunities

There are a bunch of other variables in this dataset that would be worth considering - like how scorched the tree is post-fire, how large it was to start (height and diameter), evidence of beetle infestation, and more - I’m looking forward to coming back to this dataset in the future & revisiting this model with additional investigation of those variables.

Avatar
Allison Horst
Assistant Teaching Professor

My teaching interests are data science, statistics, and science communication.