Classification

Classification model using K-Means, and analysis on identifying weather trends within US Timezones, Utilizing attributes such as severity, precipitation(in), and weather type

Introduce the Problem

Introducing the Data

Pre-Processing Data

Data Visualization

Storytelling

Impact Session

References &
Codes

Within this project, the main objective will be to analyze how weather data can be utilized in order for clothing companies to make the best decision for business through weather trends. This will allow for companies to determine whether they will be able to maximize their profits based on the weather within different regions across the United States. The questions that will be answered through this project are as follows:

Can we determine what weather type has the most likelihood of occurring nationwide (by Airport Code) in order for businesses to sale products on a year round basis
What are the weather types that have a stronger correlation to Precipitation and Severity?
What cluster is most affected by each weather type?

By answering these questions, clothing companies will be able to have a foundation of estimated inventory to prevent a stock surplus of clothing items.

This data was found on kaggle (https://www.kaggle.com/datasets/sobhanmoosavi/us-weather-events?resource=download ) Which was taken from a collection of weather reports statewide that include data across 49 states, recording 7.5 million events. The weather forecasts were collected from over 2,000 airport based weather stations across the United States from 2016-2019. The features from the dataset most important for this project contain weather event type (rain, snow, hail, fog, severe-cold, storm, and other precipitation), eventID (primary key), severity, timezone and state will be utilized for the visualization of this project.

The first step taken to pre-process this data was to clean the data by filtering out data that has severity as unknown or other, being that I wanted to analyze the relationship between Severity of weather types to the amount of precipitation accumulated in each event. Secondly, I only used features that I found relevant to the objective. These features include EventId, Type, Severity, Precipitation, TimeZone, City, State, and AirportCode.

To better understand the data, I created a scale of how many occurrences of each weather type across the US, in order to get a better understanding of what to expect in the result of my model.

Screen Shot 2022-12-01 at 5.12.54 PM.png

I wanted to identify if there was a correlation between the Severity of a weather event and its weather type, Precipitation and weather type, and Severity and Precipitation.This allows for a better understanding of the clusters and how they may form once the K-Means model is performed.

Screen Shot 2022-12-01 at 5.14.33 PM.png

Screen Shot 2022-12-01 at 5.15.52 PM.png

Screen Shot 2022-12-01 at 5.33.05 PM.png

Based on the data gathered the clusters formed were separated in 4 different bins. Timezones that are associated with each cluster and the most displayed weather type are as follows:

cluster 0: US/Central (Fog)

cluster 1: US/Eastern (Cloudy)

cluster 2: US/Mountain (Snow/Fog)

cluster 3: US/ Pacific (Rain)

With this data, we can continue to interpret that the business need of companies within the Pacific region should focus their sales on clothing that customers would need most during rainy seasons.

The Mountain timezone shows that snow and fog are the two most prevalent weather types within the region. Therefore, companies within this timezone should center their focus on clothing gear that will keep customers comfortable during cold weather, such as winter coats, snow shoes and fleece clothing.

Timezones within Central and Eastern region experience Cloudy weather along with Fog more often than the rest. Business within this area need to focus on clothing that will be equipped for high humidity and winds; such as wind breakers, vests and ponchos.

When reflecting over the data and its results, I believe that there are various factors that could have impacted my findings.

One being the scale of the records of weather events, being that the dataset had approximately 1.7million records of weather events, creating an accurate cluster from this dataset was difficult to analyze.

Secondly, I believe that would have been an impact due to the number of bins created, since clustering takes the nearest value, with the dataset being as large as it was; the model could have benefited from more cluster bins.

Data collected from: https://www.kaggle.com/datasets/sobhanmoosavi/us-weather-events?resource=download