Waiting for Our Autonomous Cars… a Look at Vehicle-Pedestrian Accidents

Dan Anderson
5 min readMar 7, 2020

Technology holds great promise to improve our lives including saving lives. According to a US Department of Transportation study, innovations including autonomous vehicles could save over 35,000 lives a year.

Until that particular holy grail appears, a tragic set of outcomes are pedestrians who are killed or injured in vehicular accidents each year. According to the CDC, in 2016 alone, almost 6000 pedestrians were killed and 129,000 required emergency care for “non-fatal crash-related injuries” (link)

Let’s take advantage of a rich dataset of vehicular-pedestrian accidents and see what findings we can draw.

The Data & Data Pre-Processing

Our data comes from chapelhillopendata.org (link) and describes vehicle-pedestrian accidents over 11 years. It is a robust collection of categorical data depicting road conditions, demographic data, and incident observations.

GOAL: PREDICT SERIOUS PEDESTRIAN OUTCOMES

While the incident data observed on the scene such as the presence of alcohol, driver age group, and pedestrian gender is interesting. Our goal is to use general data such as weather, posted speed limit, and road conditions to determine if pedestrian outcomes can be reliably predicted. (notebook link; github.com)

PREP DATA

Given our goal above only the a priori data attributes will be used for building a prediction model. This helps prevent data leakage and focuses on dimensions that could be actionable by both policymakers and motorists. So we’ll close on these general data attributes:

Focus on General Data Attributes (Outcome = ‘PedInjury’)

The dataset is very clean but we’ll tweak some of the categorical labels to be a little friendlier.

Simplify Labels

So let’s get a sense of the outcomes and see what we’re dealing with. Out of 33,707 incidents, 4,587 (13.6%) resulted in a pedestrian death or serious injury.

Number of Pedestrians by Type of Outcome (13.6% killed or serious injury)

Let’s take a sneak peek and look at a potential factor in these incidents — weather. Surprisingly, the vast majority (3,474 out of 4,578 | 76%) of the serious pedestrian outcomes occurred in “clear” weather conditions.

“Clear” Weather Conditions for Most Serious Outcome Incidents (~76%)

How about another attribute that would seem to influence serious outcomes — the hour of the day? In this case, intuition appears to hold as outcomes worsen at night.

Serious Outcomes and Hour of the Day

At a cursory level, the data contains some intuitive observations (pedestrians experiencing serious injuries, the importance of the time of day ) and some counter-intuitive (clear weather for most bad outcomes).

Predictive Modeling

The dataset is comprised of categorical attribute data and an outcome variable. Prior to modeling, we’ll execute one more data wrangling operation creating a boolean outcome column in which a True value indicates a pedestrian death or serious injury and False indicates some other outcome.

BASELINE MODEL

Our first step is to generate a baseline model to which we can compare a predictive model. Using the sklearn package’s DummyClassifier model, simulated guessing yields a prediction score of 76.2%.

Use sklearn’s DummyClassifier as a Baseline Prediction (.7623)

RANDOM FOREST MODEL

With a baseline model in hand, we employed a Random Forest Classifier model against categorical data attributes encoded as ordinal integers. The random forest model improves prediction accuracy over the baseline from 76% to 86%.

Random Forest Classifier Model
Improved Accuracy vs. Baseline (.8604 vs. .7623)

MODEL EVALUATION

An alternative approach using a Logistic Regression model yields similar accuracy scores (0.86) as the classifier model. In this case, we’ll stick with the classifier and get a sense of how that model performs.

The classifier model’s performance is reflected in its ROC (Receiver Operating Characteristic) AUC score (0.7431) which illustrates how well the model distinguishes between our two outcomes.

ROC AUC (0.7431)

While the prediction accuracy seems good, the ROC score, while better than merely guessing, does not excel but is acceptable for our purposes to draw some insights into the data.

Discussion

At the end of the day, we have a good, clean dataset depicting an important set of outcomes (pedestrian deaths/injuries from vehicular accidents). A straightforward random forest classifier model appears to do a good job predicting known outcomes with solid performance distinguishing between serious and other types of pedestrian outcomes.

What matters most? Does our initial intuition hold true? We can use Permutation Importance to assess how important an independent feature is to predicting outcomes. The top five important features are:

Top 5 Features by Importance

All of these features make intuitive sense as outcome factors: Light Conditions, Traffic Controls, Month of the Year, Type of Road, and Speed Limit.

The least important features are:

Bottom 5 Features by Importance

Perhaps counterintuitively, features such as Weather and certain Road Characteristics were not important in predicting outcomes.

One deduction we can make is that the visibility of pedestrians by drivers may be a culprit. Important features such as Light Conditions, Month of Crash (winter months mean shorter daylight hours), and even Speed Limit hint that lowered driver visibility may be a major cause.

A deeper analysis is needed but it seems tangible actions can be taken to improve safety… while we wait for our autonomous vehicles to arrive.

--

--

Dan Anderson

Product development mensch. I dig Go, Angular, MongoDB, and Lean Startup. Studying Data Science at Lambda School (DSPT4)