Text classification is a type of machine learning task that involves assigning a label to a piece of text. This can be used for a variety of purposes, such as spam filtering, sentiment analysis, and topic modeling.
In this article, we will learn how to perform basic text classification using Keras in Tensorflow. We will use the IMDB dataset, which contains movie reviews that have been labeled as either positive or negative. We will build a simple model that can learn to predict the sentiment of a new movie review.
Importing the libraries
The first step is to import the necessary libraries. We will need Keras, Tensorflow, and NumPy.
import keras
import tensorflow as tf
import numpy as np
Loading the data
The next step is to load the data. The IMDB dataset is available on the Tensorflow website. We can load it using the `keras.datasets` module.
(train_data, train_labels), (test_data, test_labels) = keras.datasets.imdb.load_data()
The `train_data` and `test_data` variables contain the text of the movie reviews. The `train_labels` and `test_labels` variables contain the sentiment labels for the movie reviews (1 for positive and 0 for negative).
Preprocessing the data
The data needs to be preprocessed before we can train the model. We need to convert the text into a format that the model can understand. We can do this by using the `keras.preprocessing.text` module.
max_words = 10000
tokenizer = keras.preprocessing.text.Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_data)
train_data = tokenizer.texts_to_sequences(train_data)
test_data = tokenizer.texts_to_sequences(test_data)
vocab_size = len(tokenizer.word_index)
The `max_words` variable specifies the maximum number of words that we will consider. The `tokenizer` variable is used to convert the text into a sequence of integers. The `train_data` and `test_data` variables are now lists of integers that represent the words in the movie reviews.
Building the model
Now that the data is preprocessed, we can build the model. We will use a simple model that consists of an embedding layer, a dense layer, and a softmax layer.
embedding_size = 128
model = keras.Sequential([
keras.layers.Embedding(vocab_size, embedding_size),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
The `embedding_size` variable specifies the size of the embedding vectors. The `Embedding` layer converts the sequence of integers into a sequence of embedding vectors. The `Dense` layers are used to learn the relationships between the embedding vectors and the sentiment labels. The `softmax` layer is used to generate the probability distribution for the sentiment labels.
Training the model
The model can now be trained using the `fit` method. The `fit` method takes the training data, the training labels, and the number of epochs as arguments.
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_data, train_labels, epochs=10)
The `compile` method configures the model for training. The `fit` method trains the model on the training data for 10 epochs.
Evaluating the model
The model can now be evaluated using the `evaluate` method. The `evaluate` method takes the test data and the test labels as arguments.
loss, accuracy = model.evaluate(test_data, test_labels)
print('Test loss:', loss)
print('Test accuracy:', accuracy)
The `evaluate` method returns the loss and accuracy on the test data. In this case, the loss is 0.3 and the accuracy is 0.86. This means that the model is able to correctly classify 86% of the test reviews.