Used Cars Price Prediction
This project was carried out as part of the TechLabs “Digital Shaper Program” in Düsseldorf (Summer Term 2021).
Abstract
This article is intended to briefly present our data science project and to provide the reader with the most important information and results. The task of our project was to determine suitable factors and set up a model that can explain the price of a used car. After the part of data cleaning and transformation we used three well-known models in the modeling part. The random forest model proved to be the most suitable for our dataset.
Introduction
Determining the selling price of a used car is a challenging task, due to the many factors that drive its price in the market. Several factors, including mileage, manufacturer, model, colour, year, etc. can influence the actual worth of a car. In addition to deciding whether a used car is worth its listed price, it can also be difficult from the perspective of a seller to price a used car appropriately.
The goal of this project is to develop machine learning models that can accurately predict the price of a used car based on its features. We train and evaluate various machine learning models on a dataset that contains prices for a wide variety of makes and models across four nations.
Methodology
At the start of our project, we devoted our focus to the search for relevant datasets. We used the website Kaggle.com as a relevant source for our following data analysis. Finally, we used four data sets coming from different locations like the United States, India, Germany and Belarus. In the next step we had to clean and transform the data. In this process, we first started to detect the outliers of the dataset and removed them, when applicable. We also dropped unnecessary or redundant columns and harmonized the column names across all datasets. Furthermore, we replaced and dropped null values, changed the data types for alignment and harmonized the data units across all datasets. The final shape of the dataset has a shape of 407,552 rows times 8 columns and a size of 3,260,416 observations. In order to set up a meaningful model that should determine the price ($) of a used car, we have included the features manufacturer, year, odometer (KM), gear, fuel and country in our model and summarized the columns according to them.
Some of the main insights we have a acquired through a exploratory data analysis of our final dataset are, that there are 107 different manufacturer (brands), most cars have an automatic gear (280.000), only 2000 used cars are offered, the average odometer value in the merged dataset is 140.000km and the year of construction of the cars ranges from 1982 to 2021.

The goal is to predict the selling price of a car based on the given input features like odometer reading, year manufactured etc. We use the approach of supervised learning to solve this problem as our dataset is labelled. The input features are well mapped with the output feature “selling price”. Since we predict a continous range of numerical values, we are trying to solve a regression problem.
We split the whole dataset into two parts, namely training dataset and testing dataset. The training dataset was used to fit the data to the model, whereas the testing dataset was used to evaluate the fit of the model. We trained three different models on the training dataset. We started with a simple linear model like linear regression and then moved to non linear models like Decision trees and Random forest.
For Linear Regression, the fit on the training dataset was poor and the R2 score on the test dataset was also low. This indicates that the model was clearly underfitting. On the other hand, Decision trees fit well to the training dataset, but it was found out that the R2 score on the test dataset was low, indicating that the model is overfitting to the training dataset.

Finally, we found that Random forest fits well to the training data and also has a better R2 score on the test dataset. We also performed hyper-parameter tuning for the Random forest model to see if there is any improvement in the R2 score further. For this we have used the Randomized SearchCV from Scikit-learn library. It was found out that the inital parameters were the best from the various randomly searched combinations. The important features that the Random forest model has learned to predict the selling price of an used car is shown in the figure below.

Technology used: Python, R, Scikit-learn , Github, Git.
GitHub repository: https://github.com/adithya36/Used_Cars_Price_Prediction
The Team:
Adithya Krishna Moku: Artifical Intelligence
David Brüninghoff: Data Science (Python)
Dayo Adedokun: Data Science (Python)
Server K: Data Science (Python)
Giuliana Moroni: Data Science (Python)