Yelp & Weather Intelligence Pipeline
End-to-end data engineering pipeline correlating weather patterns with Yelp restaurant sentiment using PySpark, Snowflake, Airflow, and Tableau — processing 10M+ records.
Problem Statement
Does weather actually affect how people dine out and rate restaurants? To answer this at scale, I built a full data engineering pipeline ingesting the Yelp Academic Dataset (JSON/CSV) and live weather data from OpenWeatherMap, transforming and joining them in PySpark, warehousing in Snowflake, and orchestrating the pipeline with Airflow — culminating in a Tableau dashboard that reveals actionable weather–revenue correlations for restaurant operators.
Approach & Methodology
Data Ingestion
Pulled the Yelp Academic Dataset (reviews + business metadata, ~8GB JSON) via bulk download and supplemented with OpenWeatherMap historical weather records fetched via paginated REST API calls, covering 15 major US cities over 5 years.
PySpark ETL
Cleaned and deduplicated 10M+ records using distributed PySpark on a 3-node local cluster. Joined review and weather datasets on composite (city_slug, date) keys. Handled schema drift, null imputation, and timezone normalization.
Sentiment Analysis
Applied VADER NLP to raw review text to produce compound sentiment scores per review. Bucketed scores into positive (≥0.05), neutral, and negative (≤−0.05). Validated a random 2,000-review sample against manual labels, achieving 85%+ accuracy.
Snowflake Data Warehouse
Loaded enriched data into a Snowflake star schema: fact table daily_review_weather (grain: business × date) with dimension tables for weather_condition, business, and calendar. Applied auto-clustering on date partitions for sub-5s query latency.
Airflow Orchestration
Scheduled a daily incremental DAG in Apache Airflow: API pull → PySpark transform → VADER scoring → Snowflake upsert. Configured email alerting on task failure and SLA misses. Backfilled 5 years of historical data on first run.
Tableau Visualization
Connected Tableau Desktop to Snowflake via native connector. Built 12 interactive dashboards covering: precipitation vs. review volume, temperature vs. star rating distribution, sentiment heatmaps by city and season, and revenue anomaly detection overlaid with weather events.
Architecture
Results & Impact
10M+
Records Processed
Yelp reviews + weather data
0.71
Peak Sentiment
Freezing weather paradox
< 5s
Query Latency
Snowflake star-schema DW
3.85★
Avg Review Stars
Across all weather types
50.7%
Volume Drop (Rain)
vs. pleasant-day baseline
The Weather Paradox — Key Findings
Major deterrent — lowest volume AND sentiment
Baseline — highest volume, average sentiment
Volume halved but sentiment holds identical to pleasant
Cold Weather Paradox — lowest volume, highest sentiment
Tableau Dashboard
Loading dashboard…
Cold Weather Paradox
Volume drops (101/day) but sentiment peaks at 0.71
Extreme Heat Effect
Drops to ~30 reviews/day with lowest sentiment (0.65)
Rain Resilience
50% volume drop (143/day) yet sentiment stays stable at 0.69