Yelp & Weather Intelligence Pipeline

End-to-end data engineering pipeline correlating weather patterns with Yelp restaurant sentiment using PySpark, Snowflake, Airflow, and Tableau — processing 2M+ records.

PySparkSnowflakeAirflowTableauVADER NLPPython

View on GitHub

Problem Statement

Does weather actually affect how people dine out and rate restaurants? To answer this at scale, I built a full data engineering pipeline ingesting the Yelp Academic Dataset (JSON/CSV) and live weather data from OpenWeatherMap, transforming and joining them in PySpark, warehousing in Snowflake, and orchestrating the pipeline with Airflow — culminating in a Tableau dashboard that reveals actionable weather–revenue correlations for restaurant operators.

Results & Impact

2M+

Records Processed

Yelp reviews + weather data

0.71

Peak Sentiment

Freezing weather paradox

< 5s

Query Latency

Snowflake star-schema DW

3.85★

Avg Review Stars

Across all weather types

50.7%

Volume Drop (Rain)

vs. pleasant-day baseline

Discovered the "Cold Weather Sentiment Paradox": Freezing weather drops volume to 101/day but yields the highest average sentiment index (0.71)

Identified Extreme Heat as the major deterrent to dining out, dropping review volume to ~30/day with the lowest sentiment (0.65)

Found that Rainy/Snowy weather causes a 50.7% drop in volume (143/day vs 290/day) but retains a resilient sentiment index identical to pleasant days (0.69)

Generated Regional Penalty Heatmaps highlighting specific cities where weather unfairly skews ratings, isolating weather biases

Designed a Snowflake star-schema data warehouse with sub-5-second query latency processing 2M+ records

Tableau Dashboard

Live · Interactive Dashboard

Open in Tableau Public

public.tableau.com · Weather-Driven Consumer Experience

Loading dashboard…

Cold Weather Paradox

Volume drops (101/day) but sentiment peaks at 0.71

Extreme Heat Effect

Drops to ~30 reviews/day with lowest sentiment (0.65)

Rain Resilience

50% volume drop (143/day) yet sentiment stays stable at 0.69

Approach & Methodology

Data Ingestion

Pulled the Yelp Academic Dataset (reviews + business metadata, ~8GB JSON) via bulk download and supplemented with OpenWeatherMap historical weather records fetched via paginated REST API calls, covering 15 major US cities over 5 years.

PySpark ETL

Cleaned and deduplicated 2M+ records using distributed PySpark on a 3-node local cluster. Joined review and weather datasets on composite (city_slug, date) keys. Handled schema drift, null imputation, and timezone normalization.

Sentiment Analysis

Applied VADER NLP to raw review text to produce compound sentiment scores per review. Bucketed scores into positive (≥0.05), neutral, and negative (≤−0.05). Validated a random 2,000-review sample against manual labels, achieving 85%+ accuracy.

Snowflake Data Warehouse

Loaded enriched data into a Snowflake star schema: fact table daily_review_weather (grain: business × date) with dimension tables for weather_condition, business, and calendar. Applied auto-clustering on date partitions for sub-5s query latency.

Airflow Orchestration

Scheduled a daily incremental DAG in Apache Airflow: API pull → PySpark transform → VADER scoring → Snowflake upsert. Configured email alerting on task failure and SLA misses. Backfilled 5 years of historical data on first run.

Tableau Visualization

Connected Tableau Desktop to Snowflake via native connector. Built 12 interactive dashboards covering: precipitation vs. review volume, temperature vs. star rating distribution, sentiment heatmaps by city and season, and revenue anomaly detection overlaid with weather events.

Architecture

architecture-diagram

The Weather Paradox — Key Findings

Extreme Heat

Volume30/day

Sentiment0.65

Major deterrent — lowest volume AND sentiment

Pleasant

Volume305/day

Sentiment0.69

Baseline — highest volume, average sentiment

Rainy / Snowy

Volume143/day

Sentiment0.69

Volume halved but sentiment holds identical to pleasant

Freezing

Volume101/day

Sentiment0.71

Cold Weather Paradox — lowest volume, highest sentiment

Back to Projects GitHub