House Price Prediction: Regression Modeling with the Ames Housing Dataset

On This Page
taken from from: Kaggle
from: Kaggle

The Ames Housing Dataset, curated by Dean De Cock, is a cornerstone dataset for the Kaggle competition “House Prices - Advanced Regression Techniques.” With 81 features spanning numerical and categorical variables, it captures detailed characteristics of residential homes in Ames, Iowa. The goal is to predict the SalePrice of houses, a continuous target variable present in the training set. In this project Ihave performed exploratory data analysis (EDA), feature engineering, handling of missing data and outliers, regression assumptions, and applied multiple linear regression ridge regression and Lasso regression models. I have also discuss the theoretical underpinnings and assumptions of these models to provide a understanding. Its hard to make a app using lots of features so I have selected few features which have strong relation with target variable sales price and fit a linear regression and deployed to streamlit app.

Why This Approach?

The dataset’s complexity—81 features with missing values, skewed distributions, and potential multicollinearity—demands a systematic approach. My strategy emphasizes:

This approach balances predictive accuracy with interpretability of variables and usability of models, addressing challenges like skewness, multicollinearity, and overfitting.

Theoretical Background and Regression Assumptions

Linear regression models rely on several key assumptions:

A diagram outlining the assumptions for linear regression models. In the center is a circle labeled 'Valid Linear Regression Model'
Assumptions of rgression model: linearity, independence, homoscedasticity, normality of residuals, and no multicollinearity.
  1. Linearity: The relationship between predictors and the target variable is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: Constant variance of residuals across predictor values.
  4. Normality: Residuals are normally distributed.
  5. No Multicollinearity: Predictors are not highly correlated with each other.

Violations of these assumptions can lead to biased or inefficient models. For instance, skewed target variables violate normality, and multicollinearity can inflate variance in coefficient estimates. I addressed these through transformations, feature selection, and regularization techniques like Lasso.

Exploratory Data Analysis

EDA is critical for understanding the dataset’s structure and identifying key predictors. The dataset comprises 81 columns, including numerical (e.g., GrLivArea, TotalBsmtSF) ordinal and nominal (e.g., Neighborhood, GarageType) features.

Key Observations:

sales price distribution

sales price qq plot

sales price vs grlivarea scatter plot

sales price vs total bsmtsf scatter plot

sales price vs overallqual scatter plot

sales price vs totalrmsabvgrd scatter plot

heatmap imp features

Handling Missing Data

Missing data was prevalent, with some features missing over 15% of values. My strategy:

missing values in dataset

              Total   Percent
PoolQC         1453  0.995205
MiscFeature    1406  0.963014
Alley          1369  0.937671
Fence          1179  0.807534
MasVnrType      872  0.597260
FireplaceQu     690  0.472603
LotFrontage     259  0.177397
GarageYrBlt      81  0.055479
GarageCond       81  0.055479
GarageType       81  0.055479
GarageFinish     81  0.055479
GarageQual       81  0.055479
BsmtExposure     38  0.026027
BsmtFinType2     38  0.026027
BsmtCond         37  0.025342
BsmtQual         37  0.025342
BsmtFinType1     37  0.025342
MasVnrArea        8  0.005479
Electrical        1  0.000685

Outlier Detection and Removal

Outliers can distort regression models. I identified two extreme outliers in GrLivArea (>4000 square feet, low SalePrice) and removed them, reducing the training set to 1456 rows. Standardization using StandardScaler and scatter plots confirmed their impact.

removing 2 outliers point
Removed the outlier

Feature Transformations

Feature Engineering

To capture non-linear relationships:

Taking Transformation

sales price distribution after taking log
Sales price after taking log

GrLivArea after taking log

TotalBsmtSF vs Log-Transformed SalePrice

Multicollinearity Check (VIF)

VIF values:
       Feature        VIF
0    Intercept  21.075941
1  OverallQual   2.122960
2    GrLivArea   1.605156
3   GarageCars   1.670124
4  TotalBsmtSF   1.477056

VIF results show that all selected features have VIF values well below 5, indicating low multicollinearity.

Model Fitting

I implemented three models: Multiple Linear Regression Ridge Regression and Lasso Regression. For better usability and interpritability we will fit a MLR using this selected feature. Then fit Ridge and Lesso using all usable variables and compare them.

Multiple Linear Regression

three plot residuals Residual plots Actual vs Predicted SalePrice (Log Scale) Actual vs Predicted SalePrice (Log Scale)

This model is deployed. You can try it. Open in Streamlit

Ridge Regression

Linear regression with Ridge regularization

Coefficients in the Ridge Model

Lasso Regression

Linear regression with Lesso regularization

Coefficients in the Lesso Model

Model Performance Summary

Model RMSE Train RMSE Test R² Train R² Test Adjusted R² Train Adjusted R² Test
Linear 0.007 0.016 0.955 0.899 0.934 0.625
Ridge 0.009 0.012 0.940 0.922 0.912 0.711
Lasso 0.010 0.011 0.939 0.928 0.911 0.733

Conclusion

The Ames Housing Dataset posed challenges like missing data, skewed distributions, and multicollinearity, which I addressed through systematic EDA, feature engineering, and transformation. Ridge and Lasso regression outperformed multiple linear regression due to its feature selection and regularization, achieving an R² of 0.9282 on the test set. This project underscores the importance of validating regression assumptions and tailoring preprocessing to the dataset’s characteristics. For the full code, refer to the notebook.

View Notebook

Add Your Note