Ticket Price Regressor

Project Motivation and Overview

As a concert lover (See my passions page), I've often been frustrated by the lack of clarity and consistency in ticket pricing. This inspired me to build a ticket price prediction model from scratch. The goal was to develop a realistic and applicable pipeline that tackles the practical challenges of working with messy, multi-source data rather than relying on pre-cleaned datasets.

Key Skills

This project utilized a variety of tools and techniques across the machine learning lifecycle, including data collection from APIs, advanced feature engineering, and the integration of pre-trained language models. Key technologies included the Ticketmaster API, Spotify API, BERT embeddings, and classical machine learning algorithms like SVM, Random Forest, and XGBoost.

A notable component of my pipeline was leveraging modern transformer-based language models for feature creation. By applying BERT embeddings , to short text fields, I was able to infuse deep semantic understandinginto a classical machine learning setting. This approach bridged the gap between big-data-driven, pre-trained NLP techniques and a relatively quick-to-deploy regression model.

The compiled dataset includes 9,351 events from the Ticketmaster Event Discovery API, supplemented with Spotify API artist metrics and city demographic data.

While building and refining this ticket price predictor, I encountered significant hurdles due to the highly skewed nature of the dataset, where most ticket prices clustered at lower levels while a minority extended into disproportionately high tiers. Attempts to address this skewness with log transformations and alternative loss functions, such as Poisson, yielded limited benefits due to inherent data constraints.
‍
Nevertheless, by integrating advanced feature engineering, optimizing hyperparameters, and experimenting with multiple regressors, I achieved a Mean Absolute Error of 8.52 using Random Forest on the final test set. This performance demonstrates the model’s practical utility, although the skewed data distribution and limited predictive power of features still constrains further accuracy gains.

In the future, I intend to enrich the dataset with more extensive textual descriptions to make even better use of BERT for identifying semantic information, these descriptions could come from wikipedia venue descriptions or spotify artist bios.