Tutorial: Predicting House Prices in Sydney with Machine Learning.
What you’ll get: a rigorous, reproducible pipeline that goes from data acquisition (open sources), through exploratory data analysis and geospatial feature engineering (distance to beaches, train stations, CBD), to modeling with XGBoost / LightGBM, evaluation, and interpretation (SHAP). All code examples are ready to copy into a Jupyter notebook.
TL;DR
- Use an open dataset (Kaggle sample or NSW Valuer-General exports).
- Clean & geocode (use lat/lon if present; otherwise geocode addresses).
- Create geospatial features: distance to nearest train station, distance to nearest beach/coastline, distance to Sydney CBD, counts of amenities (optional).
- Train XGBoost and LightGBM with a scikit-learn pipeline, cross-validation, and SHAP for interpretation.
- Evaluate on RMSE, MAE, and R²; produce scatter & residual plots.
Legal & ethical note about scraping
Public listing websites (Domain, realestate.com.au, etc.) may restrict scraping in their Terms of Service and may have legal/privacy implications. For tutorial or production work prefer open official exports such as NSW Valuer-General property sales CSVs, AURIN, or data.gov.au datasets. If you must scrape, check robots.txt
, the site’s terms, and rate-limit your requests, and prefer APIs or licensed feeds where available.
1. Which dataset to use (practical picks)
- Kaggle: "Sydney House Prices — Greater Sydney" — good for reproducible teaching notebooks (often contains lat/lon & basic attributes).
- NSW Valuer-General / NSW Government property sales files — authoritative transaction-level CSVs for production-grade analysis.
- AURIN / data.gov.au — useful aggregated or enriched datasets (e.g., environmental, transport).
Assume a CSV with columns:
price, date, property_type, bedrooms, bathrooms, car_spaces, land_area, building_area, address, suburb, postcode, latitude, longitude
.
2. Environment & libraries
pip install pandas numpy geopandas shapely scikit-learn xgboost lightgbm osmnx folium matplotlib seaborn shap jupyterlab
3. Workflow
Acquire → Clean/Geocode → EDA → Geospatial feature engineering → Model pipeline → Evaluate → Interpret → Deploy
4. Code: Full notebook outline
import numpy as np, pandas as pd, geopandas as gpd
from shapely.geometry import Point
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb, lightgbm as lgb, shap, osmnx as ox
import matplotlib.pyplot as plt, seaborn as sns
4.1 Load and clean
df = pd.read_csv("data/sydney_sales_sample.csv", parse_dates=['date'])
df = df[df['price'].notnull()]
df['price'] = df['price'].astype(str).str.replace('[\$,]', '', regex=True).astype(float)
df = df[df['price'] > 10000]
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
4.2 Quick EDA
sns.histplot(np.log1p(df['price']), bins=60)
plt.title('Log Price Distribution')
plt.show()
5. Geospatial feature engineering
import osmnx as ox
from shapely.geometry import Point
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="EPSG:4326").to_crs(epsg=3577)
area_polygon = gdf.to_crs(epsg=4326).geometry.unary_union.convex_hull.buffer(0.2)
tags_station = {'railway': ['station', 'halt', 'tram_stop']}
stations = ox.geometries_from_polygon(area_polygon, tags_station)[['geometry']].to_crs(epsg=3577)
tags_beach = {'natural': 'beach'}
beaches = ox.geometries_from_polygon(area_polygon, tags_beach)[['geometry']].to_crs(epsg=3577)
from scipy.spatial import cKDTree
pts_prop = np.vstack([gdf.geometry.x, gdf.geometry.y]).T
pts_station = np.vstack([stations.geometry.x, stations.geometry.y]).T
tree = cKDTree(pts_station)
dists, _ = tree.query(pts_prop, k=1)
gdf['dist_to_station_m'] = dists
6. Modeling (XGBoost / LightGBM)
features = ['bedrooms','bathrooms','car_spaces','land_area','building_area',
'dist_to_station_m','dist_to_beach_m','year','month','property_type','suburb']
y = np.log1p(gdf['price'])
X = gdf[features]
num_cols = ['bedrooms','bathrooms','car_spaces','land_area','building_area',
'dist_to_station_m','dist_to_beach_m','year','month']
cat_cols = ['property_type','suburb']
num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imp', SimpleImputer(strategy='constant', fill_value='missing')), ('ohe', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
model = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=6, subsample=0.8, random_state=42)
pipe = Pipeline([('pre', pre), ('model', model)])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
pred = np.expm1(pipe.predict(X_test))
true = np.expm1(y_test)
rmse = mean_squared_error(true, pred, squared=False)
mae = mean_absolute_error(true, pred)
r2 = r2_score(true, pred)
print(f"RMSE: {rmse:,.0f}, MAE: {mae:,.0f}, R²: {r2:.3f}")
7. Interpretation (SHAP)
explainer = shap.Explainer(pipe.named_steps['model'])
shap_values = explainer(pipe.named_steps['pre'].transform(X_test))
shap.summary_plot(shap_values, feature_names=num_cols + list(pipe.named_steps['pre'].named_transformers_['cat']['ohe'].get_feature_names_out(cat_cols)))
8. Tips
- Log-transform price.
- Use temporal splits if forecasting.
- Use spatial CV for honest evaluation.
- Cache OSM data for reproducibility.
- Save the full pipeline with
joblib.dump()
.
9. Extensions
- Add public transport travel time (GTFS).
- Add school quality and amenity scores.
- Visualize predicted prices on Folium.
- Deploy with Streamlit or FastAPI.
References
- NSW Valuer-General Property Sales Data (data.nsw.gov.au)
- Kaggle: Sydney House Prices — Greater Sydney
- OSMnx library docs for geospatial queries
- SHAP documentation for model explainability