Accident Analytics: A Deep Dive into US Traffic DataΒΆ

In this project, we will focus on explanatory data visualization and practice the following:

  • Analyze Traffic Accident Severity: Explore accident severity across different states and times to identify which conditions lead to more serious incidents.
  • Identify Time-Based Patterns: Investigate temporal trends in accidents, such as peak accident hours, days, or seasonal variations across years.
  • Geographical Analysis of Accident Hotspots: Determine accident hotspots in various regions of the US and correlate them with environmental and road conditions.
  • Understand the Impact of Weather and Road Conditions: Examine how weather, road surface conditions, and visibility impact accident occurrence and severity.
  • Discover Correlations Between Traffic Accidents and External Factors: Analyze correlations between accidents and external factors such as traffic density, speed limits, or road types.
  • Predictive Analysis (Optional): Develop models to predict the likelihood of accidents based on various features like time, location, weather, and road conditions.

Introducing the DatasetΒΆ

This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records.

for some reason this appaears centered and i want it to the left? Dataset Overview

The dataset used in this project is US_Accidents_March2023.csv, containing detailed information about U.S. traffic accidents. Below is a summary of the key features:

Feature Description
ID Unique identifier for each accident
Start_Time The start time of the accident
End_Time The end time of the accident
Severity Accident severity level (1 = Minor, 4 = Fatal)
State The U.S. state where the accident occurred
City The city where the accident occurred
Weather_Condition Weather conditions during the accident
Visibility Visibility at the time of the accident (in miles)
Temperature Temperature at the time of the accident (in Fahrenheit)
Start_Lat Latitude of the start point (GPS coordinates)
Start_Lng Longitude of the start point (GPS coordinates)

This dataset offers opportunities to explore traffic accident trends and the impact of factors like weather and road conditions.


Import Libraries and Load the DataΒΆ

The first step, is loading the libraries. I will need to use to load and explore the data. I will be using the following ones:

  • Dask
  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
  • GC
InΒ [1]:
# Data processing and manipulation
import dask.dataframe as dd                         
import pandas as pd                                 
import numpy as np                                  
import re                                           

# Visualization libraries
import matplotlib.pyplot as plt                     
from matplotlib.ticker import StrMethodFormatter     
plt.rcParams['axes.grid'] = False
import seaborn as sns      
import folium
from folium.plugins import MarkerCluster                        

# Machine learning and preprocessing
from sklearn.preprocessing import LabelEncoder      

# Memory management
import gc                                          

First, I’ll load the dataset using Dask for efficient memory handling, check the size, and preview the first few rows to get a feel for the data. This sets the stage for deeper analysis of patterns like accident severity and time-based trends.

InΒ [2]:
# Load the dataset as a Dask DataFrame
df = dd.read_parquet('/Users/er/Desktop/Data Analysis/Projects/Python/US Accidents/USTrafficAccidents/Data/Parquet/US_Accidents_March23.parquet')

# Compute dataset dimensions
num_rows, num_cols = df.shape[0].compute(), df.shape[1]

# Print dataset overview
print(f"Number of features (columns): {num_cols}")
print(f"Total accidents recorded (rows): {num_rows}")

# Display the first 5 rows of the DataFrame
df.head(5)
Number of features (columns): 46
Total accidents recorded (rows): 7728394
Out[2]:
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 Source2 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.01 ... False False False False False False Night Night Night Night
1 A-2 Source2 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.01 ... False False False False False False Night Night Night Day
2 A-3 Source2 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.01 ... False False False False True False Night Night Day Day
3 A-4 Source2 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.01 ... False False False False False False Night Day Day Day
4 A-5 Source2 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.01 ... False False False False True False Day Day Day Day

5 rows Γ— 46 columns

General Trends and Exploratory Data AnalysisΒΆ

This section delves into understanding the underlying patterns and trends within the dataset. Exploratory Data Analysis (EDA) serves as a crucial step in the data analysis process, allowing us to uncover insights that might not be immediately evident. We aim to analyze:

Overall Accident Volume: Assess the total number of accidents within the dataset's timeframe to provide a foundational understanding of the data's scale.

InΒ [3]:
# Load the dataset as a Dask DataFrame
df = dd.read_parquet('/Users/er/Desktop/Data Analysis/Projects/Python/US Accidents/USTrafficAccidents/Data/Parquet/US_Accidents_March23.parquet')

# Compute the shape of the DataFrame
num_rows = df.shape[0].compute()                   # Total number of rows (accidents)
print(f"There are a total of {num_rows} accidents in the dataset's time range.")
There are a total of 7728394 accidents in the dataset's time range.

Temporal Trends: Investigate trends over the years to determine whether accident rates are increasing or decreasing. This insight can be essential for identifying potential safety improvements or areas requiring more attention.

What is the trend in accidents over the years? Are accidents increasing or decreasing over time?

InΒ [4]:
# Define a list of years for analysis
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

# Initialize total accident count and dictionary for yearly counts
total_count = 0
yearly_counts = {}

# Count occurrences of accidents per year
for year in years:
    count = df['Start_Time'].str.contains(str(year)).sum().compute()
    yearly_counts[year] = count
    total_count += count
    #print(f"Accidents in {year}: {count}")

# Print total accidents from 2016 to 2023
#print(f"Total accidents from 2016 to 2023: {total_count}")

# Plotting the results
plt.figure(figsize=(10, 6))
plt.bar(yearly_counts.keys(), yearly_counts.values(), color='lightsteelblue')

# Format the Y-axis for readability
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))

# Label axes and title
plt.xlabel('Year')
plt.ylabel('Number of Accidents')
plt.title('Accidents per Year (2016-2023)')
plt.xticks(years)

plt.tight_layout()  # Adjust layout
plt.show()
No description has been provided for this image

Accident Data Summary and Analysis

From 2016 to 2022, there has been a noticeable and concerning upward trend in the number of traffic accidents each year. In 2016, there were 410,821 recorded accidents, and by 2022, that number had skyrocketed to 1,762,452. Here’s the yearly breakdown:

  • 2016: 410,821 accidents
  • 2017: 718,093 accidents
  • 2018: 893,426 accidents
  • 2019: 954,303 accidents
  • 2020: 1,178,913 accidents
  • 2021: 1,563,753 accidents
  • 2022: 1,762,452 accidents
  • 2023: 246,633 accidents (so far)

In total, from 2016 to 2023, there have been over 7.7 million accidents. This consistent rise in accidents highlights growing road safety concerns over the years. Although 2023’s data is incomplete, early numbers suggest that this upward trend may continue. To better understand and anticipate future trends, we plan to implement a forecasting model to project accident rates for the remainder of 2023 and beyond.


Seasonal Patterns: Analyze the distribution of accidents across months and days of the week. Identifying seasonal trends can help in understanding when accidents are most likely to occur, aiding in resource allocation for traffic management and safety initiatives.

What is the distribution of accidents across different months? Are there seasonal trends?

InΒ [5]:
# List of years
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

# Initialize dictionary to hold total counts for each month
monthly_totals = {f"{str(month).zfill(2)}": 0 for month in range(1, 13)}

# For loop to count occurrences for each month across all years
for year in years:
    for month in range(1, 13):
        year_month = f"{year}-{str(month).zfill(2)}"  # Format as 'YYYY-MM'
        
        # Using Dask to compute the count
        count = (df['Start_Time'].str.contains(year_month).sum().compute())
        
        # Accumulate the count to the corresponding month
        monthly_totals[str(month).zfill(2)] += count

# Prepare data for plotting
months = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]
values = list(monthly_totals.values())

# Plotting the total accidents per month
plt.figure(figsize=(12, 6))
plt.bar(months, values, color='lightblue')
plt.title('Cumulative Monthly Accident Trends (2016-2023)')
plt.xlabel('Months')
plt.ylabel('Number of Accidents')
plt.xticks(rotation=45, ha='right')
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))  # Format Y-axis
plt.tight_layout()
plt.show()
No description has been provided for this image

Analysis of Seasonal Trends

U.S. traffic accidents follow a clear seasonal trend, with winter months (December and January) being the most hazardous, likely due to a combination of weather and holiday travel. In contrast, the summer months, particularly July, experience fewer accidents. This trend is crucial for public awareness, policy-making, and resource allocation for traffic safety initiatives.


Now, we are going to explore seasonal trends in U.S. traffic accidents from 2016 to 2023. These visualizations break down the number of accidents by month, allowing us to spot any patterns or fluctuations across different years and seasons.

InΒ [6]:
# List of years
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

months = [
    'January',
    'February',
    'March',
    'April',
    'May',
    'June',
    'July',
    'August',
    'September',
    'October',
    'November',
    'December'
]

# Initialize dictionary to hold values for each month across the years
monthly_trends = {f"{str(month).zfill(2)}": [] for month in range(1, 13)}

# For loop to count occurrences for each month across all years
for year in years:
    for month in range(1, 13):
        year_month = f"{year}-{str(month).zfill(2)}"  # Format as 'YYYY-MM'
        count = df['Start_Time'].str.contains(year_month).sum().compute()
        
        # Append the count to the corresponding month list
        monthly_trends[str(month).zfill(2)].append(count) 

# Create a 3x4 grid for the subplots (3 rows and 4 columns)
fig, axs = plt.subplots(3, 4, figsize=(20, 15))
axs = axs.flatten()  # Flatten the 2D array of axes for easy indexing

# Loop through each month in monthly_trends to generate bar charts
for i, month in enumerate(monthly_trends):
    axs[i].bar(years, monthly_trends[month], color='skyblue')
    
    # Get the month name from the months list based on the month number
    month_index = int(month) - 1  # Convert to 0-based index
    axs[i].set_title(f"{months[month_index]} Trends in Traffic Accidents (2016-2023)")
    axs[i].set_xlabel("Year")
    axs[i].set_ylabel("Number of Accidents")
    axs[i].set_xticks(years)  # Ensure all years are labeled on the x-axis

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
No description has been provided for this image

The trend shows a general increase in accidents from 2016 to 2022, with a significant dip in 2020 likely due to COVID-19 lockdowns. Accidents rebounded sharply in 2021, and winter months (November, December, January) consistently show the most accidents. Notably, 2023 sees a decline in accidents for the early months (January to March) compared to previous years, possibly indicating improved safety measures or other factors. Summer months (July-September) are relatively stable with moderate growth.


Time of Day Analysis: Explore at what times most accidents happen (morning, afternoon, night). This information can be invaluable for informing traffic safety campaigns and planning.

Which days of the week are most accidents likely to occur?

InΒ [7]:
# Ensure 'Start_Time' is in datetime format, using mixed format to handle variations
df['Start_Time'] = dd.to_datetime(df['Start_Time'], format='mixed', errors='coerce')

# Extract only the date (removes the time)
df['date_only'] = df['Start_Time'].dt.date

# Extract the day of the week (e.g., 'Monday', 'Tuesday') and store in 'weekday' column
df['weekday'] = df['Start_Time'].dt.day_name()

# Create a new DataFrame with only the 'weekday' column
weekday_df = df[['weekday']]

# Count how many times each day appears and sort by counts in descending order
weekday_counts = weekday_df['weekday'].value_counts().compute().sort_values(ascending=False)

# Print the counts for each day
#print("Counts of each weekday:")
#print(weekday_counts)

# Plotting the counts
plt.figure(figsize=(10, 6))
weekday_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Accidents by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Count of Accidents')
plt.xticks(rotation=45)

# Set the y-axis limit to start from 0 and end at the maximum count
plt.ylim(0, weekday_counts.max() * 1.1)  # Optional padding above the max count

# Format the Y-axis for readability
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{int(x):,}'))

plt.tight_layout()  # Adjust layout
plt.show()
No description has been provided for this image

Fridays lead with the highest number of accidents, totaling 1.37 million. Thursday, Wednesday, and Tuesday follow closely, each with around 1.3 million accidents, making the mid-to-late workweek the most accident-prone. Mondays also see a high number of accidents, with 1.21 million, reflecting a busy start to the week.

In contrast, weekends have fewer accidents, especially Sunday, with just 562,744 accidents, likely due to lighter traffic and fewer commuters.

Summary:

  • Weekdays, particularly Friday, experience the most accidents.
  • Weekends, especially Sunday, are the safest, likely because of lower traffic volumes and fewer commuters.

At what time of day do most accidents happen (morning, afternoon, or night)?

InΒ [8]:
# Ensure 'Start_Time' is in datetime format
df['Start_Time'] = dd.to_datetime(df['Start_Time'], format='mixed', errors='coerce')

# Extract the hour from 'Start_Time'
df['hour'] = df['Start_Time'].dt.hour

# Define a function to categorize the time of day
def categorize_time_of_day(hour):
    if 0 <= hour < 6:
        return 'Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'  # For hours 21 and 24

# Apply the categorization function using Dask's map_partitions
df['time_of_day'] = df['hour'].map(categorize_time_of_day, meta=('x', 'object'))

# Count how many times each time of day appears and sort by counts in descending order
time_of_day_counts = df['time_of_day'].value_counts().compute().sort_values(ascending=False)

# Print the counts for each time of day
print("Counts of each time of day:")
print(time_of_day_counts)

# Plotting the counts
plt.figure(figsize=(10, 6))
time_of_day_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Accidents by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Count of Accidents')
plt.xticks(rotation=45)
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{int(x):,}'))  # Format y-axis with commas
plt.tight_layout()  # Adjust layout
plt.show()
Counts of each time of day:
time_of_day
Afternoon    2884131
Morning      2631665
Night        1260209
Evening       952389
Name: count, dtype: int64
No description has been provided for this image

Summary of Accident Counts by Time of Day

Overview of Accident Distribution:

  • Afternoon: 2,884,131 accidents
  • Morning: 2,631,665 accidents
  • Night: 1,260,209 accidents
  • Evening: 952,389 accidents

Key Insights:

  • Afternoon Peak: The afternoon sees the highest number of accidents, likely due to increased traffic from commuters returning home.
  • Morning Activity: Mornings follow closely, aligning with rush hour and higher traffic congestion.
  • Nighttime Incidents: While nighttime accidents are fewer, they may involve greater severity due to reduced visibility.
  • Evening Decline: Evenings experience the fewest accidents, possibly due to lower traffic volumes.

Conclusion: Most accidents occur during daytime hours, particularly in the afternoon and morning. These insights can help inform traffic management strategies, such as:

  • Enhanced Enforcement: Targeting peak hours to reduce risky driving.
  • Awareness Campaigns: Promoting safe driving during high-traffic periods.
  • Infrastructure Improvements: Optimizing traffic signals and road design for enhanced safety.

Geographic Distribution of AccidentsΒΆ

Understanding the geographic distribution of accidents is key to identifying high-risk areas and tailoring interventions effectively. In this section, we will explore:

State-Level Analysis: Identify which states have the highest and lowest accident counts, providing insights into regional differences in traffic safety.

Which states had the highest and lowest number of accidents?

InΒ [9]:
# Check if 'State' exists in the DataFrame
if 'State' not in df.columns:
    raise ValueError("The column 'State' does not exist in the DataFrame.")

# Get unique state names and check for duplicates
unique_states = df['State'].unique().compute()
has_duplicates = len(unique_states) != len(set(unique_states))

# Count accidents for each state using groupby and sort in descending order
state_counts = df['State'].value_counts().compute().sort_values(ascending=False)

# Get the top 5 and least 5 states
top_5_states = state_counts.head(5)
least_5_states = state_counts.tail(5)

# Print the total number of accidents
# total_accidents = state_counts.sum()
# print(f"Total number of accidents: {total_accidents}")

# Print the top 5 and least 5 states
# print("Top 5 States with the Most Accidents:")
# for state, count in top_5_states.items():
#     print(f"{state}: {count}")

# print("\nBottom 5 States with the Least Accidents:")
# for state, count in least_5_states.items():
#     print(f"{state}: {count}")

# Plotting the results for all states
plt.figure(figsize=(12, 8))
plt.bar(state_counts.index, state_counts.values, color='lightsteelblue')
plt.title('Accidents per State', fontsize=14)
plt.xlabel('States', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plotting for Top 5 States
plt.figure(figsize=(10, 6))
plt.bar(top_5_states.index, top_5_states.values, color='lightsteelblue')
plt.title('Top 5 States with Most Accidents', fontsize=14)
plt.xlabel('States', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Plotting for Least 5 States
plt.figure(figsize=(10, 6))
plt.bar(least_5_states.index, least_5_states.values, color='salmon')
plt.title('States with Least Accidents', fontsize=14)
plt.xlabel('States', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Analysis of Accident Data by State

States with the Most Incidents:

  • California (CA): 1,741,433 incidents
  • Florida (FL): 880,192 incidents
  • Texas (TX): 582,837 incidents
  • South Carolina (SC): 382,557 incidents
  • New York (NY): 347,960 incidents

States with the Least Incidents:

  • Wyoming (WY): 3,757 incidents
  • North Dakota (ND): 3,487 incidents
  • Maine (ME): 2,698 incidents
  • Vermont (VT): 926 incidents
  • South Dakota (SD): 289 incidents

This analysis highlights the significant disparities in the number of accidents across different states, with California, Florida, Texas, South Carolina, and New York experiencing the highest incidents, while Wyoming, North Dakota, Maine, Vermont, and South Dakota report the fewest.


City-Level Insights: Dive deeper into urban areas to discover which cities experience the most accidents. This data can help municipalities target their traffic safety initiatives more effectively.

Which cities have the highest and lower number of accidents?

InΒ [10]:
# Count accidents for each city
city_counts = df['City'].value_counts().compute()  # Get counts directly

# Sort the city counts in descending order
sorted_city_counts = city_counts.sort_values(ascending=False)

# Get the top 5 cities with the most incidents
top_5_cities = sorted_city_counts.head(5)

# Get the bottom 5 cities with the least incidents
least_5_cities = sorted_city_counts.tail(5)

# Print the results
print("The first 5 cities with the most incidents are:")
for city, count in top_5_cities.items():
    print(f"{city} ({count})")

print("\nThe cities with the least incidents are:")
for city, count in least_5_cities.items():
    print(f"{city} ({count})")

# Check for duplicates
has_duplicates = df['City'].nunique() < len(df['City'])

# Print sorted results
print("\nHas duplicates:", has_duplicates)

# Plotting the top 5 cities
plt.figure(figsize=(12, 6))
plt.bar(top_5_cities.index, top_5_cities.values, color='lightsteelblue')
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.xlabel('Cities')
plt.ylabel('Number of Accidents')
plt.title('Top 5 Cities with Most Accidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plotting the least 5 cities
plt.figure(figsize=(12, 6))
plt.bar(least_5_cities.index, least_5_cities.values, color='lightcoral')
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.xlabel('Cities')
plt.ylabel('Number of Accidents')
plt.title('Least 5 Cities with Fewest Accidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
The first 5 cities with the most incidents are:
Miami (186917)
Houston (169609)
Los Angeles (156491)
Charlotte (138652)
Dallas (130939)

The cities with the least incidents are:
Willow City (1)
Window Rock (1)
Wingina (1)
Yeso (1)
Young (1)

Has duplicates: <dask_expr.expr.Scalar: expr=(DropDuplicates(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=Assign(frame=ReadParquetFSSpec(c4b3c72)))))))['City'], split_every=False)).count() < 7728394, dtype=bool>
No description has been provided for this image
No description has been provided for this image

Which regions or states have the highest accident severity, and what factors contribute to more serious accidents in these areas?

*Note: This question focuses on states with the highest accident severity (level 4 or fatal) and differs from the analysis of states with the most accidents overall, regardless of severity.*

InΒ [11]:
# Dictionary to map state abbreviations to full names
state_names = {
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas',
    'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts',
    'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana',
    'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico',
    'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma',
    'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina',
    'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont',
    'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming'
}
# Group by 'State' and 'Severity' and count the occurrences of each
state_severity_counts = df.groupby(['State', 'Severity']).size()

# Compute the result
result = state_severity_counts.compute()

# Convert to a Pandas DataFrame for easier manipulation
result_df = result.reset_index(name='Count')

# Filter the DataFrame for Severity 4 (Critical Incidents)
severity_4_df = result_df[result_df['Severity'] == 4]

# Sort by 'Count' and get the top 5 states
top_5_states_severity_4 = severity_4_df.sort_values(by='Count', ascending=False).head(5)

# Assign different colors to each state using the updated method
colors = plt.colormaps.get_cmap('Set1')(np.linspace(0, 1, len(top_5_states_severity_4)))

# Plot the top 5 states with different colors
plt.figure(figsize=(10, 6))
bars = plt.bar(top_5_states_severity_4['State'], top_5_states_severity_4['Count'], color=colors)

# Set plot title and labels
plt.title('States with the Highest Accident Severity Category 4 (Critical Incident)')
plt.xlabel('State')
plt.ylabel('Count of Critical Incidents (Severity 4)')
plt.xticks(rotation=45)

# Create the legend based on the top 5 states
state_legend = {abbr: state_names[abbr] for abbr in top_5_states_severity_4['State']}

# Create colored legend
for i, bar in enumerate(bars):
    plt.text(1.05, 0.9 - (i * 0.1), f"{top_5_states_severity_4['State'].iloc[i]}: {state_legend[top_5_states_severity_4['State'].iloc[i]]}",
             transform=plt.gca().transAxes, fontsize=10, verticalalignment='center', color=bar.get_facecolor())

# Add space for the legend
plt.tight_layout()

# Show plot
plt.show()
No description has been provided for this image

Cluster Map Visualization: Generate a cluster map that groups areas with high and low accident concentrations across the U.S. This visual tool will highlight accident clusters, allowing for the identification of accident-prone regions. By clustering similar areas based on accident frequency, it facilitates targeted strategic planning for traffic safety measures and resource allocation.

InΒ [12]:
# Load the Dask DataFrame
df = dd.read_parquet('/Users/er/Desktop/Website Esteban/errosal.github.io/Python/US_Traffic_Accidents/US_Accidents_March23.parquet')

# Filter relevant columns and rows for clustering (Severity greater than 3)
df_filtered = df[['Start_Lat', 'Start_Lng', 'Severity']]
df_filtered = df_filtered[df_filtered['Severity'] > 3]  # Only severity greater than 3

# Compute the filtered DataFrame
df_computed = df_filtered.compute()

# Step 1: Count accidents by severity (for severity > 3)
severity_counts = df_computed['Severity'].value_counts()
print("Accident Count by Severity (Severity > 3):")
print(severity_counts)

# Step 2: Identify the severity level with the most accidents
max_severity = severity_counts.idxmax()

# Initialize a map centered around the average location of the accidents
avg_lat = df_computed['Start_Lat'].mean()
avg_lng = df_computed['Start_Lng'].mean()
map_cluster = folium.Map(location=[avg_lat, avg_lng], zoom_start=5)

# Initialize the marker cluster
marker_cluster = MarkerCluster().add_to(map_cluster)

# Step 3: Add points to the map and color code them
for idx, row in df_computed.iterrows():
    # Default marker color
    marker_color = 'blue'
    
    # Set color to red if it's the severity with the highest count
    if row['Severity'] == max_severity:
        marker_color = 'red'
    
    # Add marker to the map
    folium.Marker(
        location=[row['Start_Lat'], row['Start_Lng']],  # Corrected index here
        popup=f'Severity: {row["Severity"]}',
        icon=folium.Icon(color=marker_color)
    ).add_to(marker_cluster)

# Save and display the map
map_cluster.save('accident_cluster_map.html')
map_cluster

Since the Cluster Map is too large to be hosted on GitHub, this video is a demonstration of it.


This section of the notebook explores the time-related patterns of accidents and their severity. By analyzing when accidents happen and how severe they are, we can identify high-risk periods and factors contributing to more serious incidents.

  • Time-Related Trends:

    • Analyze accidents by time of day and rush hour patterns.
    • Investigate yearly trends to identify whether accident rates are increasing or decreasing over time.
  • Accident Severity and Influencing Factors:

    • Assess the distribution of accidents by severity level (minor, moderate, severe, fatal).
    • Explore external factors such as weather, road conditions, and time of day that may correlate with higher accident severity.

What are the peak hours for accidents?

InΒ [20]:
df = dd.read_parquet('/Users/er/Desktop/Data Analysis/Projects/Python/US Accidents/USTrafficAccidents/Data/Parquet/US_Accidents_March23.parquet') 

df['hour'] = df['Start_Time'].str.slice(11, 13)

hourly_unique_counts = df.groupby('hour')['ID'].nunique().compute()

hourly_unique_counts_sorted = hourly_unique_counts.sort_index(ascending=True)

print(hourly_unique_counts_sorted)

total_unique_counts = hourly_unique_counts_sorted.sum()
print("Total unique accidents:", total_unique_counts)

plt.figure(figsize=(10, 6))
hourly_unique_counts_sorted.plot(kind='bar', color='skyblue')

# Adding titles and labels
plt.title('Accidents by Hour (Unique Counts)', fontsize=16)
plt.xlabel('Hour of the Day (00-23)', fontsize=12)
plt.ylabel('Number of Unique Accidents', fontsize=12)

# Show the plot
plt.xticks(rotation=0)  # Keep x-axis labels horizontal
plt.show()
hour
00    112378
01     97071
02     93227
03     83863
04    159852
05    228182
06    405837
07    587472
08    577576
09    363034
10    342706
11    355040
12    355001
13    396445
14    448846
15    525855
16    581969
17    576015
18    432042
19    295121
20    225226
21    191452
22    167645
23    126539
Name: ID, dtype: int64
Total unique accidents: 7728394
No description has been provided for this image

Is there a pattern of accidents during rush hours (morning and evening)?

InΒ [21]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the Seaborn style
sns.set(style='whitegrid')

# Create a bar plot with Seaborn
plt.figure(figsize=(10, 6))
sns.barplot(x=hourly_unique_counts_sorted.index, y=hourly_unique_counts_sorted.values, color='skyblue')

# Adding titles and labels
plt.title('Accidents by Hour (Unique Counts)', fontsize=16)
plt.xlabel('Hour of the Day (00-23)', fontsize=12)
plt.ylabel('Number of Unique Accidents', fontsize=12)

# Highlight morning rush hour (7 AM to 9 AM)
plt.axvspan(7, 9, color='yellow', alpha=0.3, label='Morning Rush Hour (7 AM - 9 AM)')

# Highlight evening rush hour (4 PM to 6 PM)
plt.axvspan(16, 18, color='orange', alpha=0.3, label='Evening Rush Hour (4 PM - 6 PM)')

# Add a legend to explain the shaded areas
plt.legend()

# Show the plot
plt.xticks(rotation=0)  # Keep x-axis labels horizontal
plt.show()
No description has been provided for this image

How do accidents change over the course of a year?

InΒ [22]:
# Assuming df is your Dask DataFrame with 'Start_Time' column
# List of years
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

months = [
    'January', 'February', 'March', 'April', 'May', 'June', 
    'July', 'August', 'September', 'October', 'November', 'December'
]

# Initialize a dictionary to hold values for each month across the years
monthly_trends = {year: {month: 0 for month in months} for year in years}

# For loop to count occurrences for each month across all years
for year in years:
    for month in range(1, 13):
        year_month = f"{year}-{str(month).zfill(2)}"  # Format as 'YYYY-MM'
        count = df['Start_Time'].str.contains(year_month).sum().compute()
        
        # Store the count in the corresponding year and month
        monthly_trends[year][months[month - 1]] = count

# Set up the bar chart for each year
for year in years:
    counts = [monthly_trends[year][month] for month in months]
    
    # Calculate percentage changes for the current year
    percentage_changes = [0]  # Start with 0% for the first month
    for i in range(1, len(counts)):
        if counts[i-1] != 0:  # Avoid division by zero
            percentage_change = ((counts[i] - counts[i-1]) / counts[i-1]) * 100
        else:
            percentage_change = 0  # Set to 0 if previous month's count is 0
        percentage_changes.append(percentage_change)

    # Plotting
    plt.figure(figsize=(10, 6))
    x = np.arange(len(months))  # The x locations for the groups
    bar_width = 0.4  # Width of the bars

    plt.bar(x, counts, width=bar_width, color='skyblue', label='Accidents')
    plt.title(f"Monthly Accidents in {year}", fontsize=16)
    plt.xlabel("Month", fontsize=12)
    plt.ylabel("Number of Accidents", fontsize=12)
    plt.xticks(x, months, rotation=45)  # Rotate month labels for better readability
    plt.ylim(0, max(counts) * 1.1)  # Set y-limit for better visualization
    plt.legend(title='Metrics')

    # Adding percentage change annotations
    for month_index in range(1, len(months)):
        if counts[month_index - 1] != 0:  # Avoid division by zero
            plt.annotate(f"{percentage_changes[month_index]:.2f}%", 
                         xy=(x[month_index], counts[month_index]), 
                         ha='center', 
                         va='bottom', 
                         color='red', fontsize=10)

    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

What is the distribution of accidents by severity levels (minor, moderate, severe, fatal)?

InΒ [23]:
# Group by severity level

# Load the dataset
df = dd.read_parquet('/Users/er/Desktop/Data Analysis/Projects/Python/US Accidents/USTrafficAccidents/Data/Parquet/US_Accidents_March23.parquet')  

severity_counts = df['Severity'].value_counts().compute()
total_severity = severity_counts.sum()

print(severity_counts)
print(total_severity)

# Get the counts of each severity level
severity_counts = df['Severity'].value_counts().compute()

# Map severity numbers to descriptive names
severity_labels = {
    1: 'Minor Incident',
    2: 'Moderate Incident',
    3: 'Severe Incident',
    4: 'Fatal Incident',
}

# Update index to descriptive names
severity_counts.index = severity_counts.index.map(severity_labels)

# Create a categorical index with the specified order
ordered_severity = ['Minor Incident', 'Moderate Incident', 'Severe Incident', 'Fatal Incident']
severity_counts = severity_counts.reindex(ordered_severity)

# Adjust the scale (convert counts to millions)
severity_counts_millions = severity_counts / 1_000_000

# Plot the distribution
plt.figure(figsize=(8, 5))
severity_counts_millions.plot(kind='bar', color='skyblue')
plt.title('Distribution of Accidents by Severity Levels')
plt.xlabel('Severity Level')
plt.ylabel('Number of Accidents (in millions)')
plt.xticks(rotation=45)
plt.grid(axis='y')

# Set y-ticks for better readability
plt.yticks([i for i in range(0, int(severity_counts_millions.max()) + 2)], 
           [f'{i}M' for i in range(0, int(severity_counts_millions.max()) + 2)])

# Show the plot
plt.tight_layout()
plt.show()
Severity
3    1299337
1      67366
2    6156981
4     204710
Name: count, dtype: int64
7728394
No description has been provided for this image

What factors (e.g., weather, road conditions, time of day) seem to correlate with higher accident severity?

InΒ [24]:
# Load the dataset
df = dd.read_parquet('/Users/er/Desktop/Data Analysis/Projects/Python/US Accidents/USTrafficAccidents/Data/Parquet/US_Accidents_March23.parquet')

# Select relevant columns for analysis
columns_of_interest = [
    'Severity', 
    'Temperature(F)', 
    'Humidity(%)', 
    'Wind_Speed(mph)', 
    'Precipitation(in)', 
    'Distance(mi)',         
    'Traffic_Signal',        
]

# Filter the DataFrame to include only relevant columns
df_filtered = df[columns_of_interest]

# Drop rows with NaN values in the selected columns
df_filtered = df_filtered.dropna()

# Filter the DataFrame to include only rows where Severity equals 4
df_severity_1 = df_filtered[df_filtered['Severity']==4]

# Compute the filtered Dask DataFrame to convert it to a Pandas DataFrame
df_severity_1 = df_severity_1.compute()

# Calculate the correlation matrix for the filtered DataFrame
correlation_matrix = df_severity_1.corr()

# Print the correlation matrix
print("Correlation Matrix for Severity Level 1:")
print(correlation_matrix)

# Create a heatmap for the correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True, square=True)
plt.title('Heatmap of Correlation Matrix for Severity Level 1 Accidents')
plt.show()

# Print the first few rows (now a Pandas DataFrame)
print(df_severity_1.head())

# Print the number of rows in the filtered DataFrame for Severity 1
print(f"Number of rows with Severity level 1: {df_severity_1.shape[0]}")
Correlation Matrix for Severity Level 1:
                   Severity  Temperature(F)  Humidity(%)  Wind_Speed(mph)  \
Severity                NaN             NaN          NaN              NaN   
Temperature(F)          NaN        1.000000    -0.288157        -0.031746   
Humidity(%)             NaN       -0.288157     1.000000        -0.205287   
Wind_Speed(mph)         NaN       -0.031746    -0.205287         1.000000   
Precipitation(in)       NaN       -0.000622     0.081820         0.026067   
Distance(mi)            NaN       -0.025354     0.012569         0.036061   
Traffic_Signal          NaN        0.029288    -0.044182         0.018568   

                   Precipitation(in)  Distance(mi)  Traffic_Signal  
Severity                         NaN           NaN             NaN  
Temperature(F)             -0.000622     -0.025354        0.029288  
Humidity(%)                 0.081820      0.012569       -0.044182  
Wind_Speed(mph)             0.026067      0.036061        0.018568  
Precipitation(in)           1.000000      0.002701        0.001823  
Distance(mi)                0.002701      1.000000       -0.086258  
Traffic_Signal              0.001823     -0.086258        1.000000  
No description has been provided for this image
        Severity  Temperature(F)  Humidity(%)  Wind_Speed(mph)  \
14035          4            63.0         70.0             13.8   
58391          4            59.0         93.0              5.8   
133648         4            89.1         63.0             12.7   
135764         4            75.0         94.0             17.3   
140384         4            80.1         85.0             10.4   

        Precipitation(in)  Distance(mi)  Traffic_Signal  
14035                0.00          0.01           False  
58391                0.01          0.01           False  
133648               0.00          0.00           False  
135764               0.27          0.00           False  
140384               0.00          0.00           False  
Number of rows with Severity level 1: 132126

ConclusionΒΆ

Summary of Accident Analytics: A Deep Dive into US Traffic Data

The project analyzed over 7.7 million traffic accidents across the United States from 2016 to 2023, focusing on accident severity, geographic hotspots, and external factors such as weather and road conditions.

Key Insights:

  • Accident Trends: A steady increase in traffic accidents was observed from 2016 to 2022, with a notable dip in 2020 due to COVID-19 restrictions.
  • Time-Based Patterns: Accidents peaked during morning (6-9 AM) and evening (4-6 PM) rush hours, reflecting the role of heavy traffic.
  • Geographic Hotspots: States like California, Florida, and Texas had the highest accident rates, while less populated states like Wyoming and South Dakota saw fewer incidents.
  • Weather Impact: Severe accidents were more likely to occur during rainy or cloudy conditions, although fair weather accounted for the majority of accidents overall.
  • Road Conditions: Presence of traffic signals and road features (e.g., traffic calming measures) influenced accident outcomes, especially for severe or fatal incidents.
  • Seasonality: Winter months (December and January) were the most accident-prone, likely due to weather conditions and holiday travel.

These insights highlight the importance of addressing accident-prone areas, improving road safety during high-risk times, and focusing on external factors like weather and traffic conditions to reduce accident severity.