Diabetes prediction: Know your risk of Type 2 diabetes using data science

3 min readMar 23, 2023

This project was carried out as part of the TechLabs “Digital Shaper Program” in Düsseldorf (Winter Term 2022/23).

Abstract

Many people are unaware of the impact their lifestyle has on their health. Of course, human biology also plays a role, but it is significantly influenced by lifestyle. Type 2 diabetes, for example, is one of the diseases that has increased in Europe in recent years and Germany has the highest rate of diabetes patients among Europeans. Diabetes Type 2 can be avoided by taking simple measures. With our model, we would like to use various criteria such as age, BMI, or blood pressure to determine whether someone is prone to Type 2 Diabetes.

Introduction

How do you come up with the idea of predicting diseases and who benefits from it? Quite simply, a prediction can be an incentive for people to make fundamental changes in their lifestyle. After all, who wants to be sick?

However, diseases do not only change the quality of life of those affected. It is also associated with high costs for insurers. The disease would first have to be diagnosed, which is a hurdle, because incorrect diagnoses are often made and then the disease must be treated correctly. Type 2 Diabetes may require long-term treatment. With a prediction of type 2 diabetes via Data Science and Machine Learning, one can test for it in a more targeted manner and either take measures to prevent it and thus save costs or make a targeted diagnosis and work towards a quick healing without great effort.

Method

In this section, we will describe the data used, the reasons behind some decisions made and the final outcome.

The Data

· The correlations between the features were comparatively not so strong.

Modelling

In the modeling process, we normalized the data since we had a mix of quantitative and qualitative data. We then split the data into training and testing (67:33) and built our model using the machine learning algorithm, Random Forest, and Decision Tree (as a comparison). The accuracy was 84% and a confusion matrix was created. Thanks to this information, you know that the model is precise. Building the model, using the supervised machine learning algorithm, Decision Tree, also shows high accuracy. However, we will stick with the Random Forest since it is an ensemble algorithm and generally performs better than Decision Trees and given more time, we could have used more appropriate algorithms.

Also noteworthy, is that we used F1 measure as a better model evaluation method due to the class imbalance. Below is the generated confusion matrix.

The models predict whether a new instance is 0 or 1 based on the information given. The output 0 means you are not prone to Diabetes Type 2 and 1 means you are prone to Diabetes Type 2.

Result

With our model, we found that data science can be used to predict very accurately whether someone is prone to Diabetes Type 2 or not. However, you must rely on the data sets, some of which are imprecise. In our case, for example, one can say that heavy alcohol consumption increases the risk of Type 2 Diabetes, but in our model the risk from alcohol is very low and this does not reflect the current state of science.

Furthermore, it would be interesting not only to say whether someone is prone to diabetes or not, but to be able to predict this output as a percentage and to make recommendations for lifestyle changes if necessary.

GitHub repository: https://github.com/NKWOCHA/TechLab_Group4

The Team:

Arlen Euan: Data Science

Obinna Patrick Nkwocha: Data Science

Tara Zakholy: Data Science

Aakash Dhekane: Mentor