In this article, we will learn how to use Keras and TensorFlow Hub to classify movie reviews as positive or negative. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.
Getting started
First, we need to install the necessary libraries.
pip install tensorflow-hub
pip install tensorflow-datasets
Next, we need to load the IMDB dataset. This dataset contains the text of 50,000 movie reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.
import tensorflow_datasets as tfds
dataset = tfds.load("imdb_reviews", split="train")
Preprocessing the data
The data needs to be preprocessed before we can train a model on it. This includes cleaning the text, removing stop words, and vectorizing the words.
def preprocess_text(text):
# Clean the text
text = text.lower()
text = text.replace("[^a-zA-Z]", " ")
# Remove stop words
stop_words = set(stopwords.words("english"))
text = " ".join([word for word in text.split() if word not in stop_words])
# Vectorize the words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
word_ids = tokenizer.texts_to_sequences([text])
return word_ids
dataset = dataset.map(preprocess_text)
Training the model
Now that the data is preprocessed, we can train a model on it. We will use a simple neural network with two hidden layers.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index), output_dim=128),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(dataset, epochs=10)
Evaluating the model
We can evaluate the model on the test set to see how well it performs.
test_loss, test_accuracy = model.evaluate(dataset["test"])
print("Test loss:", test_loss)