• The TensorFlow tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.


  • Colab link - Open colab


  • This notebook classifies movie reviews as *positive* or *negative* using the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.


  • The tutorial demonstrates the basic application of transfer learning with [TensorFlow Hub](https://tfhub.dev) and Keras.


  • We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/).


  • These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews.


  • This notebook uses [`tf.keras`](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [`tensorflow_hub`](https://www.tensorflow.org/hub), a library for loading trained models from [TFHub](https://tfhub.dev) in a single line of code.


  • For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).


  • 
    !pip install tfds-nightly
    !pip install tensorflow-hub
    
    import numpy as np
    
    import tensorflow as tf
    import tensorflow_hub as hub
    import tensorflow_datasets as tfds
    
    print("Version: ", tf.__version__)
    print("Eager mode: ", tf.executing_eagerly())
    print("Hub version: ", hub.__version__)
    print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
     
    
  • ## Download the IMDB dataset The IMDB dataset is available on [imdb reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) or on [TensorFlow datasets](https://www.tensorflow.org/datasets).


  • The following code downloads the IMDB dataset to your machine (or the colab runtime):


  • 
    # Split the training set into 60% and 40%, so we'll end up with 15,000 examples
    # for training, 10,000 examples for validation and 25,000 examples for testing.
    train_data, validation_data, test_data = tfds.load(
        name="imdb_reviews", 
        split=('train[:60%]', 'train[60%:]', 'test'),
        as_supervised=True)
     
    
  • ## Explore the data Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label.


  • The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.


  • Let's print first 10 examples.


  • 
    train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
    train_examples_batch
     
    
  • Let's also print the first 10 labels.


  • 
    train_labels_batch
     
    
  • ## Build the model The neural network is created by stacking layers—this requires three main architectural decisions:


  • * How to represent the text? * How many layers to use in the model? * How many *hidden units* to use for each layer?


  • In this example, the input data consists of sentences. The labels to predict are either 0 or 1.


  • One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have three advantages:


  • * we don't have to worry about text preprocessing, * we can benefit from transfer learning, * the embedding has a fixed size, so it's simpler to process.


  • For this example we will use a **pre-trained text embedding model** from [TensorFlow Hub](https://tfhub.dev) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).


  • There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:


  • * [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2) - trained with the same NNLM architecture on the same data as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.


  • * [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - the same as [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2), but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.


  • * [google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4) - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.


  • And many more! Find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub.


  • Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: `(num_examples, embedding_dimension)`.


  • 
    embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
    hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                               dtype=tf.string, trainable=True)
    hub_layer(train_examples_batch[:3])
     
    
  • Let's now build the full model:


  • 
    model = tf.keras.Sequential()
    model.add(hub_layer)
    model.add(tf.keras.layers.Dense(16, activation='relu'))
    model.add(tf.keras.layers.Dense(1))
    
    model.summary()
     
    
  • The layers are stacked sequentially to build the classifier:


  • 1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that we are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`. For this NNLM model, the `embedding_dimension` is 50.


  • 2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.


  • 3. The last layer is densely connected with a single output node.


  • Let's compile the model.


  • ### Loss function and optimizer A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), we'll use the `binary_crossentropy` loss function.


  • This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.


  • Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.


  • Now, configure the model to use an optimizer and a loss function:


  • 
    model.compile(optimizer='adam',
                  loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])
     
    
  • ## Train the model Train the model for 10 epochs in mini-batches of 512 samples.


  • This is 10 iterations over all samples in the `x_train` and `y_train` tensors.


  • While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:


  • 
    history = model.fit(train_data.shuffle(10000).batch(512),
                        epochs=10,
                        validation_data=validation_data.batch(512),
                        verbose=1)
     
    
  • ## Evaluate the model And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.


  • 
    results = model.evaluate(test_data.batch(512), verbose=2)
    
    for name, value in zip(model.metrics_names, results):
      print("%s: %.3f" % (name, value))
     
    
  • This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.