Jen Arriaza

Logo

Data Engineer at Rolls-Royce
New York

I build analytic tools to understand critical industry problems, and develop strategies based on quantitative insights.



hello@jenarriaza.com
Github
LinkedIn
Instagram

Back

Classification of Accident Severity Using Historical NYC Crash Data

NYC_CRASH_DATA_Banner

View Repository

Based on historical data available with respect to NYC crash incidents, I developed a model to classify whether an accident is deemed severe/not severe. The model is a binary classifier in which target variable serious accident is either True/False dependent on whether at least one injury was reported.

The dataset contains approximately 1.2 million records, each representing a motor vehicle accident that occurred within the 5 boroughs of NYC between April 2012 and August 2022. This is the entirety of the crash data available as of August 2022, accessed directly from NYC OpenData.

Classification Summary & Model Performance

Model performance analysis revealed that the contributing factors and vehicle types involved have significant importance with regard to predicting severity classification, respectively. This is consistent with previous studies concluding that point of impact and/or vehicle size are highly relevant factors in crash severities. Additionally, results indicated that off-street and hour of day in which an accident occurred also have relatively higher importance in predicting severity.

For model selection, I employed VotingClassifier to assess various classification algorithms. The below visualizations depict the performance results macro averages. The Random Forest Classifier returned the best model performance metrics.

          precision    recall  f1-score   support

   False       0.69      0.78      0.73    123286
    True       0.75      0.65      0.69    123397
accuracy                           0.71    246683
macro avg      0.72      0.71

Classification Report

nyc_classification_auc

Area Under Curve

nyc_classification_precisionRecall

Precision/Recall

nyc_classification_confusionMatrix

Confusion Matrix

Methodology

To handle imbalances in the data (such as over-representation of “fender bender” accidents), data pre-processing involved a down-sampling technique which restored balance ahead of training the models. While the dataset is mostly complete, any missing values were handled through interpolation of mean values.

Deep-Dive Data Mining: Analyzing Pre & Post-COVID Car Crash Data

This exploratory analysis further examined data for patterns and points of interest pertaining to car accidents in the NYC boroughs. Results are useful for development of technology and to inform programs that can mitigate the economic impacts and loss of life due to traffic accidents. This project was completed using mainly Python libraries and Tableau.
View the full report here

Insights

image

A times series analysis of data from January 2019 to March 2022 shows that accident numbers fell to about half at the peak of quarantine. But what’s more interesting is what numbers did not fall:

image

Less accidents, but rising fatalities. Although there are significantly less vehicles and overall accidents on NYC roads as expected post-COVID, pedestrian deaths continue on an upward trend as seen in the graph. This phenomena may be due to increased driver anxiety, larger vehicles, and changes in social norms as discussed in this New York Times article.

Further exploration of the data revelead other interesing factors, i.e.; locations with high fatalities, and sedans vs. SUVs. For the full project report click here.

image