Back to Projects

Yelp & Weather Intelligence Pipeline

End-to-end data engineering pipeline correlating weather patterns with Yelp restaurant sentiment using PySpark, Snowflake, Airflow, and Tableau — processing 10M+ records.

PySparkSnowflakeAirflowTableauVADER NLPPython
View on GitHub

Problem Statement

Does weather actually affect how people dine out and rate restaurants? To answer this at scale, I built a full data engineering pipeline ingesting the Yelp Academic Dataset (JSON/CSV) and live weather data from OpenWeatherMap, transforming and joining them in PySpark, warehousing in Snowflake, and orchestrating the pipeline with Airflow — culminating in a Tableau dashboard that reveals actionable weather–revenue correlations for restaurant operators.

Approach & Methodology

1

Data Ingestion

Pulled the Yelp Academic Dataset (reviews + business metadata, ~8GB JSON) via bulk download and supplemented with OpenWeatherMap historical weather records fetched via paginated REST API calls, covering 15 major US cities over 5 years.

2

PySpark ETL

Cleaned and deduplicated 10M+ records using distributed PySpark on a 3-node local cluster. Joined review and weather datasets on composite (city_slug, date) keys. Handled schema drift, null imputation, and timezone normalization.

3

Sentiment Analysis

Applied VADER NLP to raw review text to produce compound sentiment scores per review. Bucketed scores into positive (≥0.05), neutral, and negative (≤−0.05). Validated a random 2,000-review sample against manual labels, achieving 85%+ accuracy.

4

Snowflake Data Warehouse

Loaded enriched data into a Snowflake star schema: fact table daily_review_weather (grain: business × date) with dimension tables for weather_condition, business, and calendar. Applied auto-clustering on date partitions for sub-5s query latency.

5

Airflow Orchestration

Scheduled a daily incremental DAG in Apache Airflow: API pull → PySpark transform → VADER scoring → Snowflake upsert. Configured email alerting on task failure and SLA misses. Backfilled 5 years of historical data on first run.

6

Tableau Visualization

Connected Tableau Desktop to Snowflake via native connector. Built 12 interactive dashboards covering: precipitation vs. review volume, temperature vs. star rating distribution, sentiment heatmaps by city and season, and revenue anomaly detection overlaid with weather events.

Architecture

architecture-diagram
AIRFLOW ORCHESTRATIONYelp Dataset6M reviewsOpenWeatherMapHistorical APIPySpark ETLDistributedVADER NLPSentiment ScoreSnowflake DWStar SchemaTableauExecutive DashClustered on date_key + weather_categoryVADER compound score → Positive/Neutral/Negative10M+ reviews · Historical weather aligned by geo + timestamp

Results & Impact

10M+

Records Processed

Yelp reviews + weather data

0.71

Peak Sentiment

Freezing weather paradox

< 5s

Query Latency

Snowflake star-schema DW

3.85★

Avg Review Stars

Across all weather types

50.7%

Volume Drop (Rain)

vs. pleasant-day baseline

Discovered the "Cold Weather Sentiment Paradox": Freezing weather drops volume to 101/day but yields the highest average sentiment index (0.71)
Identified Extreme Heat as the major deterrent to dining out, dropping review volume to ~30/day with the lowest sentiment (0.65)
Found that Rainy/Snowy weather causes a 50.7% drop in volume (143/day vs 290/day) but retains a resilient sentiment index identical to pleasant days (0.69)
Generated Regional Penalty Heatmaps highlighting specific cities where weather unfairly skews ratings, isolating weather biases
Designed a Snowflake star-schema data warehouse with sub-5-second query latency processing 10M+ records

The Weather Paradox — Key Findings

Extreme Heat
Volume30/day
Sentiment0.65

Major deterrent — lowest volume AND sentiment

Pleasant
Volume305/day
Sentiment0.69

Baseline — highest volume, average sentiment

Rainy / Snowy
Volume143/day
Sentiment0.69

Volume halved but sentiment holds identical to pleasant

Freezing
Volume101/day
Sentiment0.71

Cold Weather Paradox — lowest volume, highest sentiment

Tableau Dashboard

Live · Interactive Dashboard
Open in Tableau Public
public.tableau.com · Weather-Driven Consumer Experience

Loading dashboard…

Cold Weather Paradox

Volume drops (101/day) but sentiment peaks at 0.71

Extreme Heat Effect

Drops to ~30 reviews/day with lowest sentiment (0.65)

Rain Resilience

50% volume drop (143/day) yet sentiment stays stable at 0.69