ML Basics with Keras in Tensorflow: Load the dataset

0

In this article, we will learn how to load a dataset into Keras and prepare it for training. We will use the `text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`. We will also split the dataset into three splits: train, validation, and test.

Create a directory structure

Next, we need to create a directory structure for our data. We will need two folders, one for each class. For example:

main_directory/

  class_a/

    a_text_1.txt

    a_text_2.txt

  class_b/

    b_text_1.txt

    b_text_2.txt

Load the data

Now, we can load the data into a `tf.data.Dataset` using the `text_dataset_from_directory` utility. This utility will create a dataset for each class, and label the data with the class name. For example:

dataset = tf.keras.utils.text_dataset_from_directory(

    main_directory,

    batch_size=32,

    label_mode="binary",

    shuffle=True,

)

The `batch_size` argument specifies the number of examples to load in each batch. The `label_mode` argument specifies the type of label. In this case, we are using a binary classification problem, so the label mode is "binary". The `shuffle` argument specifies whether to shuffle the data before each epoch.

Split the dataset

Finally, we need to split the dataset into three splits: train, validation, and test. We will use the `train_test_split` utility to do this. The `train_test_split` utility will split the data into a 60/20/20 split by default. For example:

(train_dataset, test_dataset) = tf.keras.utils.train_test_split(

    dataset,

    test_size=0.2,

)

(train_dataset, validation_dataset) = tf.keras.utils.train_test_split(

    train_dataset,

    test_size=0.25,

)

The `train_dataset` will contain 60% of the data, the `validation_dataset` will contain 20% of the data, and the `test_dataset` will contain 20% of the data.

Now that we have loaded the data and split it into three splits, we can start training our model. In the next article, we will learn how to create a simple model and train it on the train dataset.

Tags

Post a Comment

0Comments
Post a Comment (0)