Predicting Diabetes with Machine Learning - a brief introduction to Tensorflow: AI for Dummies #1 : Carlos Rodrigues

Everybody uses AI for almost everything these days. From helping shaping ideas to the recommendation systems that tells us which movie to watch next, it’s almost omnipresent. But for many of us (myself included), what happens “under the hood” remains a complete mystery, seeming, at times, almost magical.

This post is the first in the AI for Dummies series, where we’ll try to demystify AI, by learning what it is and how it works, starting from simple use cases and concepts and evolving into more complex and real-world scenarios. This post also serves as an introduction to a few key tools, such as Tensorflow and Keras, as well as a brief introduction to important concepts in this domain.

The Problem: Predicting Diabetes

What best way to learn something than to solve a real(ish) problem?

We’ll try to determine if a patient is susceptible of developing diabetes based on their medical history and recent test results.

The Data: The “Diabetes” Dataset

To train a machine learning model, we need data. For this problem, we’ll use the “GB2024/diabetes” dataset from Hugging Face. It contains data from 100,000 patients and includes the following information:

Input Features:

gender: Patient’s gender
age: Patient’s age
hypertension: Whether the patient has hypertension (0 for no, 1 for yes)
heart_disease: Whether the patient has heart disease (0 for no, 1 for yes)
smoking_history: Patient’s smoking history
bmi: Body mass index
HbA1c_level: Hemoglobin A1c level
blood_glucose_level: Blood glucose level
Target Variable: `diabetes` (0 for no, 1 for yes – what we are trying to determine)

Our goal is to build a model that takes the input features and learns to predict whether a patient has the potential to develop diabetes or not.

How Machine Learning “Learns”

Machine Learning models learn from data in a process called “training.” We can think of it like a child learning a new skill. We show the model examples (the input features) and the correct answers (the target variables). The model then tries to find a pattern or a set of rules that connect the examples to the answers.

A key part of this learning process is backpropagation. Imagine the student child gives a try to the new skill. At first it may not go that well. Then the child finds something that improves the results, and tries again. When a new method actually provides worse results, it is discarded. Doing this trial and error many times improves the child’s ability to properly master that skill.

Backpropagation is similar. The model makes a prediction, and we calculate the “error” or how far off the prediction was from the correct answer. The model then uses this error to adjust its internal settings (we’ll talk more about this subject in a future article about neural networks), so it tries to do a better job on the next prediction. This process is repeated many times, and with each iteration (epoch), the model gets better and better at making accurate predictions.

Metrics: Measuring Performance

But how do we know how well the model is doing? We’ll use a metric metric called accuracy. Accuracy is the percentage of predictions that the model gets right. For example, if our model correctly predicts 90 out of 100 patients, its accuracy is 90%.

There are other relevant metrics, but let’s keep it simple for now.

The Tools: TensorFlow and Keras

To build and train our model, we’ll use two popular open-source tools:

TensorFlow: A powerful and widely used library for numerical computation and large-scale machine learning. It provides the fundamental building blocks for creating and training models.
Keras: A high-level API that runs on top of TensorFlow. It provides a simple and intuitive interface for building models.

Logistic Regression: A Simple Use Case

Logistic regression is an important concept in machine learning for classification problems. In our case we want to classify patients in two categories: those likely to have or develop diabetes, and those that are unlikely to have diabetes.

In the next section, we’ll show you the code to build a simple logistic regression model to predict diabetes.

Setting up your Environment

Before running the code, we’ll need to install the necessary libraries. We’ve provided a `requirements.txt` file with all the dependencies. You can install them using pip:

#requirements.txt
tensorflow
datasets
pandas

pip install -r requirements.txt

This will install `tensorflow`, `datasets`, and `pandas`, which are essential for running the example code.

Note: It may make sense to create a python virtual environment first.

python -m venv venv
source venv/bin/activate
# this may change depending on your OS

Model Creating and training code

The code, although quite simple, does a lot of different things. The detailed explanation will shed some light on what it is doing

from datasets import load_dataset
import tensorflow as tf
import pandas as pd
import numpy as np

# Load the dataset from Hugging Face
diabetes_dataset = load_dataset("GB2024/diabetes")

# Convert to pandas DataFrame
df = diabetes_dataset['train'].to_pandas()

# Select numerical features and target
numerical_features = ['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level']
X = df[numerical_features]
y = df['diabetes']

# Preprocessing with TensorFlow
def preprocess_data(X, y):
    # Convert to TensorFlow tensors
    X = tf.convert_to_tensor(X.values, dtype=tf.float32)
    y = tf.convert_to_tensor(y.values, dtype=tf.float32)

    # Normalize numerical features
    mean, variance = tf.nn.moments(X, axes=[0])
    X = (X - mean) / tf.sqrt(variance)
    
    return X, y, mean, variance  # Return normalization parameters

X, y, train_mean, train_variance = preprocess_data(X, y)

# Save normalization parameters for later use
np.save('normalization_mean.npy', train_mean.numpy())
np.save('normalization_variance.npy', train_variance.numpy())

# Split the data
dataset = tf.data.Dataset.from_tensor_slices((X, y))
dataset = dataset.shuffle(buffer_size=len(X))
train_size = int(0.8 * len(X))
train_dataset = dataset.take(train_size).batch(32)
test_dataset = dataset.skip(train_size).batch(32)

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=1, activation='sigmoid', input_shape=[X.shape[1]])
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_dataset, epochs=10)

# Evaluate the model
loss, accuracy = model.evaluate(test_dataset)
print(f"\nAccuracy: {accuracy}")

model_save_path = "diabetes_model.keras" 
model.save(model_save_path)
print(f"\nModel saved to: {model_save_path}")

Code Explanation

1. Load the dataset: We use the `load_dataset` function from the `datasets` library to load the “GB2024/diabetes” dataset from Hugging Face.

2. Select features: To keep things simple, we select only the numerical features for this example. A more detailed explanation on why can be found later in this article.

3. Preprocess the data: Real-world data is often messy and needs to be cleaned up before it can be used for training. In our example we perform a simple yet very important preprocessing step called scaling. We use `tf.nn.moments` to calculate the mean and variance of the numerical features and then normalize them. A more detailed explanation on why this is needed is also present later in this article.

We’ll save the normalization mean and variance, as we’ll need to normalized input data later one when we’ll be using the model to make predictions

4. Split the data: We use `tf.data.Dataset` to create a TensorFlow dataset and then use `take` and `skip` to split it into a training set and a testing set. It’s important to save some data to test our model, as otherwise we have no way of assessing its accuracy.

5. Build the model: We create a simple `Sequential` model with a single `Dense` layer. The `activation=’sigmoid’` function is what makes this a logistic regression model. It squashes the output of the neuron to be a value between 0 and 1, which we can interpret as a probability.

There will be a dedicated post to activation functions later in this series.

6. Compile the model: We compile the model using the `adam` optimizer, `binary_crossentropy` as the loss function, and `accuracy` as our metric. `binary_crossentropy` is a loss function that is well-suited for binary classification problems like this one. Don’t worry too much about these, as they’ll be explain in a future article in this series.

7. Train the model: We can then train the model using the `fit` method, specifying the training data and the number of `epochs` (iterations over the entire dataset).

8. Evaluate the model: After training, we use the `evaluate` method to see how well our model performs on the unseen test data. We capture and display the accuracy.

9. Save the model for future reuse: We will not retrain the model every time we need to use it. Instead we can simply save it so it can be quickly loaded next time we actually need to use it. Later in this article we’ll load it and use it to predict if a patient is likely to have diabetes.

A Note on Non-Numerical Features

One of the things that may be bothering you at this point, is the fact that we’re only using numerical data points or features. What about the non-numerical (or “categorical”) features like `gender` and `smoking_history`?

These features can be very valuable for making accurate predictions. However, machine learning models are essentially mathematical equations, and they can only understand numbers. To use these non-numerical features, we would need to convert them into a numerical format.

This process is a common preprocessing step, and there are several techniques to do it, with the most popular being one-hot encoding. We will cover this and other techniques for handling non-numerical data in a future article in this series to keep this first example as simple as possible.

Why is Normalization Required?

Another question in your mind is probably about the need to normalize data.

In our dataset, the numerical features have very different scales. For example, `age` might range from 0 to 100, while `bmi` might range from 15 to 40, and `blood_glucose_level` could be in the hundreds. This means that each feature can influence the result with different weights.

A decent analogy to this problem is using different currencies. If you are comparing the prices of similar items in different currencies it’s usually a good idea to find common ground, by converting them into a single currency.

Normalization (or “scaling”) in machine learning is very similar. It’s the process of transforming all the numerical features to be on a similar scale.

If we don’t normalize our data, the features with larger scales can dominate the learning process. The model might incorrectly assume that a feature is more important simply because its values are larger. By normalizing the data, we ensure that all features contribute to the learning process more equally, which helps the model to learn the underlying patterns in the data more efficiently and effectively. This leads to a better-performing and more accurate model.

Running the code

Running the code will show the different epochs (iterations) with an increasing accuracy (and reducing loss).

At the end, we output the accuracy and save the model:

Note that 3 files were also produced: the model file, and the normalization mean and variance.

Using the model to Predict if a patient has diabetes

We can now use the model (and normalization mean and variance) to predict if a list of patients are likely to have diabetes. The code is commented and should be self-explanatory.

from datasets import load_dataset
import tensorflow as tf
import pandas as pd
import numpy as np

# Load the saved model
loaded_model = tf.keras.models.load_model("diabetes_model.keras")
numerical_features = ['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level']

# Recompute training statistics so we normalize the new input the same way
train_mean = np.load('normalization_mean.npy')
train_variance = np.load('normalization_variance.npy')

# Example data for 10 patients
patients = [
    {'age': 54, 'hypertension': 0, 'heart_disease': 0, 'bmi': 28.7, 'HbA1c_level': 6.5, 'blood_glucose_level': 140},
    {'age': 48, 'hypertension': 1, 'heart_disease': 0, 'bmi': 25.3, 'HbA1c_level': 5.8, 'blood_glucose_level': 120},
    {'age': 65, 'hypertension': 1, 'heart_disease': 1, 'bmi': 32.1, 'HbA1c_level': 7.2, 'blood_glucose_level': 160},
    {'age': 35, 'hypertension': 0, 'heart_disease': 0, 'bmi': 23.4, 'HbA1c_level': 5.2, 'blood_glucose_level': 95},
    {'age': 72, 'hypertension': 1, 'heart_disease': 1, 'bmi': 29.8, 'HbA1c_level': 6.8, 'blood_glucose_level': 150},
    {'age': 41, 'hypertension': 0, 'heart_disease': 0, 'bmi': 26.5, 'HbA1c_level': 5.5, 'blood_glucose_level': 110},
    {'age': 58, 'hypertension': 1, 'heart_disease': 0, 'bmi': 31.2, 'HbA1c_level': 6.9, 'blood_glucose_level': 145},
    {'age': 45, 'hypertension': 0, 'heart_disease': 0, 'bmi': 24.8, 'HbA1c_level': 5.6, 'blood_glucose_level': 105},
    {'age': 62, 'hypertension': 1, 'heart_disease': 1, 'bmi': 33.5, 'HbA1c_level': 7.5, 'blood_glucose_level': 170},
    {'age': 39, 'hypertension': 0, 'heart_disease': 0, 'bmi': 22.9, 'HbA1c_level': 5.1, 'blood_glucose_level': 98}
]

eps = 1e-8

# Build input array for all patients
patients_arr = np.array([[patient[f] for f in numerical_features] for patient in patients], dtype=np.float32)

# Normalize using training stats
patients_norm = (patients_arr - train_mean) / np.sqrt(train_variance + eps)

# Convert to tensor and predict
patients_tensor = tf.convert_to_tensor(patients_norm, dtype=tf.float32)
probabilities = loaded_model.predict(patients_tensor)
predicted_classes = (probabilities >= 0.5).astype(int)

# Print results for each patient
print("\nPrediction Results:")
print("-" * 50)
for i, (prob, pred_class) in enumerate(zip(probabilities, predicted_classes), 1):
    print(f"\nPatient {i}:")
    print(f"Predicted probability of diabetes: {float(prob[0]):.4f}")
    print(f"Predicted class: {int(pred_class[0])} ({'diabetes' if pred_class[0]==1 else 'no diabetes'})")

Wrap-up and next steps

This simple example illustrates the power and simplicity of using TensorFlow and Keras for machine learning tasks. Real-world examples are much more complex that this, but it is, nevertheless, an important step into understanding what ML is about.

The companion GitHub repository can be found here

Predicting Diabetes with Machine Learning – a brief introduction to Tensorflow: AI for Dummies #1