1992 - 2015 US Wildfires

Analysis of Wildfire Data Using SQL & Python
Data
Code

Overview & Objectives
The goal of this project was to identify the reasons for, seasonality, and geographical foci of wildfires in the US. The data used for this project was posted in Kaggle by user Rachel Tatman and consists of geographical and spatial data regarding fire locations, temporal data regarding fire discovery and containment times and dates, amongst other variables, for fires across the US from 1992 to 2015. The questions that inspired the project and that I wished to find answers to are as follows:
  1. Have fires become more or less frequent overtime and can we spot seasonality?
  2. What areas are most impacted by fires and what areas suffer from the largest fires?
  3. How have causes of fires and containment times changed over time?
Method & Skills
The dataset has over 1.8 million data points, with the original tables being organized in a SQLite database. To analyze this data, both SQLite and Python were used to perform the initial exploration, cleaning, and joining of data, whilst Python was mainly used to design and create visuals that better explained the relationships within the dataset. The following skills were valuable in the data wrangling and analysis & visualization phases, respectively:
  • SQLite: COUNT, AVG, DATE, DATETIME, STRFTIME, SUBSTR, CAST, CASE, WHERE, JOIN, GROUP BY, ORDER BY, and Common Table Expression (CTE).
  • Python: NumPy, Pandas, SciPy, Matplotlib, Plotly, seaborn, and sqlite3.
Key Insights
It is important to note that the data covers wildfires that were published by federal, state, and local fire organizations over the years 1992 – 2015 for the purposes of supporting the national Fire Program Analysis system. Regarding the questions that inspired this project, the answers are as follows:
  1. Fire count totals do not have a strong positive correlation with time aggregated over years (r = 0.19), yet average fire size somewhat does (r = 0.59). There is a strong seasonality regarding fires, with most fires and fires of the largest size occurring in the summer months of the year and least occurring in the winter months.
  2. The states impacted with the most fires over the years covered in this dataset are California, Georgia, and Texas. The states suffering from the largest fires on average are Alaska, Nevada, and Idaho. The geographical region suffering from the most fires is the south.
  3. Containment times exhibit the same seasonality as fires and have not changed drastically over the years. Causes of fires also exhibit seasonality, with lightning-caused fires occurring most during summer months and human-caused fires decreasing the fastest during the peak of lightning-caused fires. Within human-caused fires, independently caused fires have become less frequent over time, instead being replaced by more fires caused by equipment, infrastructure, and debris burning.
Project Summary

We begin this project by querying the variables of interest into a master table to perform an initial exploration of the data. Using sqlite3 on Python, we can connect to the original database file and use SQLite to extract the variables in the manner shown below. Note that most queries performed in this project will not be shown in order to avoid redundancy. From the initial exploration, we gather that the data is mostly complete, with only some variables like NWCG_REPORTING_UNIT_GEO_AREA and NWCG_REPORTING_UNIT_STATE having missing values, explained by the fact that some of the reporting agencies work nation-wide and thus don't have an associated geographic region or state.

Moving on to the first question of interest, we would like to take a temporal look at fire count and fire size. To do so, we perform a quick query grouping the count of fires and the average fire sizes by three temporal scopes: date (MM-DD-YYYY), month, and year. Below we have a dashboard showing the output of using seaborn and Matplotlib to visualize the queried fire counts. Visually, we can see that there is a seasonality in terms of fire count, with spring and summer months hosting greater number of fires and autumn and winter months hosting less. In order to get a better understanding of the type of seasonal trends occurring, it would be better to run a deeper time series analysis on the data; however, this remains outside the scope of this particular project. One thing to note regarding the trend is the particular dip in fire count in May and June compared to their surrounding months. This is slightly counterintuitive and would require additional data to fully understand. What we will grasp later in the analysis is that during these months, people tend to move away from activities that cause fires, as it is the season where most lightning-caused fires tend to occur. It could be the case that the reduction of human-caused fires is being captured here. Finally, we see that there is no correlation between year and fire count, indicating that there isn't a general trend of increasing number of wildfire reports over the years.

Moving on, we now take a look at the same type of analysis but for average fire size instead. Some may argue that looking at averages is not as beneficial due to the large impact of outliers on the aggregate; nonetheless, I believe it is important in this case to consider the largest fires. These are the fires that cause the most damage and the ones that receive the most attention and documentation. Looking at the average fire size by date, we see that there isn't a real trend in terms of magnitude, but we generally see an increase in fire size in the middle of the year vs. end/beginning of the year. The month graph corroborates this, showing the largest fires occurring in the summer months. Finally, there is a somewhat positive trend between fire size and year, indicating a general increase in fire severity over time.

We can now move on to the second question, regarding where fires have occurred. Using Plotly, we can plot fire count by state on a map of the US. From this map, we can see that California, Georgia, and Texas were the most impacted by fires within this dataset. This map is also dynamic, allowing users to hover over the states to acquire information regarding the state's average fire size (note, this is not possible here are this is a static image of the original Plotly map). Hovering, we also note that Alaska, Nevada, and Idaho experience the largest fires on average. Although it is nice to aggregate the entire dataset like this, it is also somewhat helpful to see how fires spread over time. We will do this next, producing a map that shows the spread of the largest fires aggregated monthly for all the years present in the dataset.

Again, what we see below is the static form of a dynamic Plotly map, but what I witnessed while replaying the map's evolution is that there seemed to be a pattern of how fires were spreading over time. They seemed to originate south, travel upwards, and then travel back south. To better sense if this pattern actually exists, we can aggregate fire counts by region by month, where regions are as defined by the NWCG.

Below we have the plot of fire count aggregate by region and month. Note, for this case, we have Texas as being part of the "Southwest" and Idaho being part purely of the "Rocky Mountains" and not the "Northern Rockies". What we see is that the spike of fires in the earlier months are championed by the southern states, mainly Georgia, from the graph above. Then, they calm down (although still maintaining a large count, by comparison), giving way for the eastern states, spiking in April, following spikes of most other regions during summer months. Finally, the southern states retake an important lead in the last months of the year. Given this information, it would be important to analyze human activity and/or climate patterns in the southern region to see what is causing them to have so many more fires in comparison, especially in the colder months.

Code

Moving on, we take a brief detour to take a look at what type of aid different states are receiving, be them in-state or out-of-state. To do so, we perform a special query to calculate what is shown as the "aid ratio" which is the count of fires reported by organizations based out-of-state vs. the count of fires reported by organizations based in-state.

Code

Here we see that South Dakota, although not having the most or largest fires, requires the most aid out-of-state compared to any other state by a long shot. Also, most of the fires these out-of-state organizations have to deal with are not small, being between 0.25 and 9.9 acres in size. If efforts were being made to draft new organizational bases in certain states, South Dakota might be a good state to focus on so not as many out-of-state resources are used. Of course, it would also be important to further investigate the data before any strong conclusions are made, as South Dakota might not be straining as many resources as this graph may be suggesting.

Code

Now we take a final turn to look for answers to the third question of interest, regarding fire causes and containment times. Below we see a query grouping fires as either human-caused or lightning-caused, ignoring reports that marked fires as "miscellaneous" or "missing/undefined" in their cause.

Code

Shown below are fire counts for fires caused by human activity and fires caused by lightning. The lightning-caused fires seem to share similar seasonality as we have been seeing, with spikes during the summer months, most likely attributable to climate cycles. On the other hand, fire counts for human-caused fires do not show an obvious seasonality. However, it is interesting that for most years, counts of these fires spike in the spring months (March, April), which would make sense taking a look at the first dashboard as well as the graph showing fire count by region and month. Also, it is interesting that the counts begin to decrease as lightning-caused fires spike, indicating what could be a "reaction". Human-caused fires could be toned back during the months when lightning-caused fires are most probable due to limited human activity. It would be interesting to see what is causing these dips.

Code

Here, we separate human-caused fires into two further categories: independently caused (think smoking, campfires, etc.) and equipment, infrastructure, and debris burning (think faulty maintenance, industrial, etc.). By converting to percentages, we can see how the composition of fires (in terms of cause without miscellaneous or undefined fires) changes over time. The most intriguing thing to notice is the dropping fire count percentage for independently caused fires vs. fires caused by equipment, infrastructure, and debris burning.

Code

The last thing we will look like is containment times, which is calculated in the way seen in the query below.

Code

Plotting containment times in hours over time, we see that there is seasonality, with times being the highest during summer months (most likely due to increasing fire counts and average fire sizes), and the lowest during the winter months. However, we don't see an overall trend of changing containment times, indicating that there hasn't been much change in terms of fire response over time. Nonetheless, it is true that average fire size has increased whilst fire count has remained somewhat constant. As such, bigger fires are being contained in a similar amount of time as smaller fires before. This is a good sign.

Code

Overall, answers to the posed questions were sufficiently provided, although more work could be done to deepen this analysis. For example, if more recent data could be added to this dataset, it could be used in machine learning models to predict fire causes or fire size class. Additionally, it would be interesting to dive deeper into the spatial data that was present in the original database but not touched in this project. I am also particularly interested by what seems to be a “reaction” by humans whenever lightning-caused fires increase, indicated by the negative slope of human-caused fires during the lighting-caused fire peaks. Maybe it would be beneficial to look at company or local government rulings on wildfires that limit human activity during wildfire season to inform any conclusion concerning that occurrence.