In this article, we will learn how to load a dataset into Keras and prepare it for training. We will use the `text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`. We will also split the dataset into three splits: train, validation, and test.
Create a directory structure
Next, we need to create a directory structure for our data. We will need two folders, one for each class. For example:
main_directory/
class_a/
a_text_1.txt
a_text_2.txt
class_b/
b_text_1.txt
b_text_2.txt
Load the data
Now, we can load the data into a `tf.data.Dataset` using the `text_dataset_from_directory` utility. This utility will create a dataset for each class, and label the data with the class name. For example:
dataset = tf.keras.utils.text_dataset_from_directory(
main_directory,
batch_size=32,
label_mode="binary",
shuffle=True,
)
The `batch_size` argument specifies the number of examples to load in each batch. The `label_mode` argument specifies the type of label. In this case, we are using a binary classification problem, so the label mode is "binary". The `shuffle` argument specifies whether to shuffle the data before each epoch.
Split the dataset
Finally, we need to split the dataset into three splits: train, validation, and test. We will use the `train_test_split` utility to do this. The `train_test_split` utility will split the data into a 60/20/20 split by default. For example:
(train_dataset, test_dataset) = tf.keras.utils.train_test_split(
dataset,
test_size=0.2,
)
(train_dataset, validation_dataset) = tf.keras.utils.train_test_split(
train_dataset,
test_size=0.25,
)
The `train_dataset` will contain 60% of the data, the `validation_dataset` will contain 20% of the data, and the `test_dataset` will contain 20% of the data.
Now that we have loaded the data and split it into three splits, we can start training our model. In the next article, we will learn how to create a simple model and train it on the train dataset.