Tutorial: Predicting House Prices in Sydney with Machine Learning.

What you’ll get: a rigorous, reproducible pipeline that goes from data acquisition (open sources), through exploratory data analysis and geospatial feature engineering (distance to beaches, train stations, CBD), to modeling with XGBoost / LightGBM, evaluation, and interpretation (SHAP). All code examples are ready to copy into a Jupyter notebook.

TL;DR

Use an open dataset (Kaggle sample or NSW Valuer-General exports).
Clean & geocode (use lat/lon if present; otherwise geocode addresses).
Create geospatial features: distance to nearest train station, distance to nearest beach/coastline, distance to Sydney CBD, counts of amenities (optional).
Train XGBoost and LightGBM with a scikit-learn pipeline, cross-validation, and SHAP for interpretation.
Evaluate on RMSE, MAE, and R²; produce scatter & residual plots.

Legal & ethical note about scraping

Public listing websites (Domain, realestate.com.au, etc.) may restrict scraping in their Terms of Service and may have legal/privacy implications. For tutorial or production work prefer open official exports such as NSW Valuer-General property sales CSVs, AURIN, or data.gov.au datasets. If you must scrape, check robots.txt, the site’s terms, and rate-limit your requests, and prefer APIs or licensed feeds where available.

1. Which dataset to use (practical picks)

Kaggle: "Sydney House Prices — Greater Sydney" — good for reproducible teaching notebooks (often contains lat/lon & basic attributes).
NSW Valuer-General / NSW Government property sales files — authoritative transaction-level CSVs for production-grade analysis.
AURIN / data.gov.au — useful aggregated or enriched datasets (e.g., environmental, transport).

Assume a CSV with columns: price, date, property_type, bedrooms, bathrooms, car_spaces, land_area, building_area, address, suburb, postcode, latitude, longitude.

2. Environment & libraries

pip install pandas numpy geopandas shapely scikit-learn xgboost lightgbm osmnx folium matplotlib seaborn shap jupyterlab

3. Workflow

Acquire → Clean/Geocode → EDA → Geospatial feature engineering → Model pipeline → Evaluate → Interpret → Deploy

4. Code: Full notebook outline

import numpy as np, pandas as pd, geopandas as gpd
from shapely.geometry import Point
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb, lightgbm as lgb, shap, osmnx as ox
import matplotlib.pyplot as plt, seaborn as sns

4.1 Load and clean

df = pd.read_csv("data/sydney_sales_sample.csv", parse_dates=['date'])
df = df[df['price'].notnull()]
df['price'] = df['price'].astype(str).str.replace('[\$,]', '', regex=True).astype(float)
df = df[df['price'] > 10000]
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

4.2 Quick EDA

sns.histplot(np.log1p(df['price']), bins=60)
plt.title('Log Price Distribution')
plt.show()

5. Geospatial feature engineering

import osmnx as ox
from shapely.geometry import Point

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="EPSG:4326").to_crs(epsg=3577)
area_polygon = gdf.to_crs(epsg=4326).geometry.unary_union.convex_hull.buffer(0.2)

tags_station = {'railway': ['station', 'halt', 'tram_stop']}
stations = ox.geometries_from_polygon(area_polygon, tags_station)[['geometry']].to_crs(epsg=3577)

tags_beach = {'natural': 'beach'}
beaches = ox.geometries_from_polygon(area_polygon, tags_beach)[['geometry']].to_crs(epsg=3577)

from scipy.spatial import cKDTree
pts_prop = np.vstack([gdf.geometry.x, gdf.geometry.y]).T
pts_station = np.vstack([stations.geometry.x, stations.geometry.y]).T
tree = cKDTree(pts_station)
dists, _ = tree.query(pts_prop, k=1)
gdf['dist_to_station_m'] = dists

6. Modeling (XGBoost / LightGBM)

features = ['bedrooms','bathrooms','car_spaces','land_area','building_area',
            'dist_to_station_m','dist_to_beach_m','year','month','property_type','suburb']
y = np.log1p(gdf['price'])
X = gdf[features]

num_cols = ['bedrooms','bathrooms','car_spaces','land_area','building_area',
            'dist_to_station_m','dist_to_beach_m','year','month']
cat_cols = ['property_type','suburb']

num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imp', SimpleImputer(strategy='constant', fill_value='missing')), ('ohe', OneHotEncoder(handle_unknown='ignore'))])

pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])

model = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=6, subsample=0.8, random_state=42)
pipe = Pipeline([('pre', pre), ('model', model)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
pred = np.expm1(pipe.predict(X_test))
true = np.expm1(y_test)

rmse = mean_squared_error(true, pred, squared=False)
mae = mean_absolute_error(true, pred)
r2 = r2_score(true, pred)
print(f"RMSE: {rmse:,.0f}, MAE: {mae:,.0f}, R²: {r2:.3f}")

7. Interpretation (SHAP)

explainer = shap.Explainer(pipe.named_steps['model'])
shap_values = explainer(pipe.named_steps['pre'].transform(X_test))
shap.summary_plot(shap_values, feature_names=num_cols + list(pipe.named_steps['pre'].named_transformers_['cat']['ohe'].get_feature_names_out(cat_cols)))

8. Tips

Log-transform price.
Use temporal splits if forecasting.
Use spatial CV for honest evaluation.
Cache OSM data for reproducibility.
Save the full pipeline with joblib.dump().

9. Extensions

Add public transport travel time (GTFS).
Add school quality and amenity scores.
Visualize predicted prices on Folium.
Deploy with Streamlit or FastAPI.

References

NSW Valuer-General Property Sales Data (data.nsw.gov.au)
Kaggle: Sydney House Prices — Greater Sydney
OSMnx library docs for geospatial queries
SHAP documentation for model explainability