Skip to the content.

Food Recipe Insights

Data analysis and prediction model for food recipe ratings. Final project for EECS 398 at University of Michigan.

Project Overview

We’re analyzing the Food.com recipe and rating dataset to understand what factors influence recipe ratings. The dataset contains 83,782 recipes with user ratings and detailed characteristics about each recipe.

Our analysis focuses on a key question: What factors influence a recipe’s rating on Food.com? This question is particularly relevant in today’s digital age, where home cooks increasingly rely on online recipes and reviews to guide their culinary adventures.

Dataset Overview

Relevant Columns

Our analysis focuses on these key features:

Data Cleaning and Exploratory Data Analysis

Dataset Overview

Recipe Complexity Distribution

Cooking Time Impact

Relationship Analysis

Health Metrics Correlation

The relationship between nutritional values shows strong correlations between certain health metrics. Total fat and saturated fat show a strong positive correlation (0.85), while protein and carbohydrates show a weaker relationship (0.32). This suggests that recipes tend to cluster into distinct nutritional profiles.

Key Findings from EDA

Interesting Aggregates

Recipe Ratings by Cooking Time

Cooking Time Category Average Rating Recipe Count Average Minutes
Very Quick (<15m) 4.67 16,303 9.33
Quick (15-30m) 4.62 20,115 24.87
Medium (30-60m) 4.61 24,570 45.68
Long (60-120m) 4.63 11,840 81.71
Very Long (>120m) 4.59 8,344 779.34

Recipe Complexity Analysis

Cooking Time Category Avg Ingredients Avg Steps Avg Calories
Very Quick (<15m) 6.48 5.55 313.49
Quick (15-30m) 8.88 9.34 375.64
Medium (30-60m) 10.09 11.49 445.81
Long (60-120m) 10.97 13.14 558.00
Very Long (>120m) 10.19 12.31 553.61

Missing Value Analysis

Column Missing Count Missing Percentage
name 1 0.00%
description 70 0.08%
avg_rating 2,609 3.11%
cooking_time_category 1 0.00%
time_category 1 0.00%

Key insights from this data:

Data Imputation

We chose not to impute missing values in our analysis because:

  1. The percentage of missing values was minimal (<2% across key columns)
  2. Missing values appeared to be Missing Completely at Random (MCAR)
  3. For our analysis questions, removing recipes with missing values wouldn’t introduce significant bias
  4. Maintaining data authenticity was crucial for accurate rating prediction

Prediction Problem

Problem Definition

We aim to predict recipe ratings on a scale of 1-5 based on recipe characteristics that are available when a recipe is first posted. This is framed as a regression problem because:

Model Components

  1. Target Variable: avg_rating
    • Scale: 1-5 stars
    • Represents average user satisfaction with recipe
    • Dataset size: 83,782 recipes with ratings
  2. Selected Features
    • Recipe Characteristics:
      • n_ingredients: Number of ingredients
      • minutes: Cooking time
      • n_steps: Number of preparation steps
    • Nutritional Information:
      • calories: Total calories
      • protein_pdv: Protein (% daily value)
      • total_fat_pdv: Fat (% daily value)
      • sodium_pdv: Sodium (% daily value)
    • Categorical Features:
      • cooking_time_category: Duration category
  3. Feature Selection Rationale We carefully selected features that would be available at the “time of prediction” (when a recipe is first posted). Specifically excluded:
    • review_count: Not available for new recipes
    • User-generated tags: May not be present initially
    • Submission date: To avoid temporal bias
  4. Evaluation Metric: Mean Squared Error (MSE)
    • Chosen because it:
      • Penalizes larger prediction errors more heavily
      • Provides interpretable results in the same units as ratings
      • Aligns with our goal of accurate numerical predictions

Baseline Model

Model Description

We developed a Random Forest Regression model to predict recipe ratings. Our model processes both quantitative and nominal features through a preprocessing pipeline that includes scaling and encoding steps.

Feature Composition

Data Processing

Model Performance

Our Random Forest Regressor achieved the following metrics:

Feature Importance Analysis

  1. Cooking time (minutes): 58.25%
  2. Number of ingredients: 40.71%
  3. Cooking time categories: ~1% combined

Model Assessment

The current model’s performance indicates room for improvement:

This baseline model provides valuable insights but needs enhancement through:

Final Model

Feature Engineering

We introduced several carefully engineered features designed to capture key aspects of recipes that could influence ratings:

  1. Recipe Complexity Metrics
    • steps_per_ingredient: Measures recipe intricacy by relating number of steps to ingredients
    • time_per_step: Indicates how much time each step takes on average
    • time_per_ingredient: Shows time investment per ingredient

Rationale: Complex recipes with efficient time usage may be better documented and tested, potentially leading to higher user satisfaction.

  1. Nutritional Balance Features
    • protein_fat_ratio: Relationship between protein and fat content
    • health_score: Weighted combination of nutritional elements (protein, saturated fat, sugar, sodium)

Rationale: Recipes with balanced nutritional profiles may better meet user expectations for both taste and health.

Model Selection and Optimization

Algorithm: Random Forest Regressor

Hyperparameter Optimization:

Feature Importance

Top predictive features:

  1. health_score (23.1%)
  2. protein_fat_ratio (21.1%)
  3. time_per_step (14.0%)
  4. steps_per_ingredient (13.2%)
  5. time_per_ingredient (13.2%)

Performance Improvement

The final model showed improvements over the baseline:

The improvements, while modest in RMSE, show significant gain in R² score, suggesting our engineered features better capture the factors that influence recipe ratings. The dominance of nutritional and complexity metrics in feature importance rankings validates our feature engineering approach.

Note: All features used are available at recipe creation time, ensuring our model can make predictions for new recipes before they receive any ratings.