• The TensorFlow tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.


  • Colab link - Open colab


  • Description: Training a video classifier with transfer learning and a recurrent model on the UCF101 dataset.


  • This example demonstrates video classification, an important use-case with applications in recommendations, security, and so on. We will be using the UCF101 dataset to build our video classifier.


  • The dataset consists of videos categorized into different actions, like cricket shot, punching, biking, etc. This dataset is commonly used to build action recognizers, which are an application of video classification.


  • A video consists of an ordered sequence of frames. Each frame contains spatial information, and the sequence of those frames contains temporal information.


  • To model both of these aspects, we use a hybrid architecture that consists of convolutions (for spatial processing) as well as recurrent layers (for temporal processing).


  • Specifically, we'll use a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) consisting of GRU layers. This kind of hybrid architecture is popularly known as a CNN-RNN.


  • This example requires TensorFlow 2.5 or higher, as well as TensorFlow Docs, which can be installed using the following command:


  • 
    
    
    !pip install -q git+https://github.com/tensorflow/docs
       
    
  • Data collection In order to keep the runtime of this example relatively short, we will be using a subsampled version of the original UCF101 dataset.


  • You can refer to [this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) to know how the subsampling was done.


  • The wget command is for downloading from a external URL or web address. The URL from where to download is given after wget.


  • After the URL the name by which the file is to be stored is given. TAR.GZ is similar to zip file.


  • Then using tar xf command we unzip the file to obtain the contents in a folder


  • The ! symbol is added before the wget and tar xf command if we are executing on a colab notebook. The ! symbol is not needed on anaconda jupyter notebooks.


  • 
    !wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
    !tar xf ucf101_top5.tar.gz
    
       
    
  • Downloading the libraries using import


  • 
    from tensorflow_docs.vis import embed
    from tensorflow import keras
    from imutils import paths
    
    import matplotlib.pyplot as plt
    import tensorflow as tf
    import pandas as pd
    import numpy as np
    import imageio
    import cv2
    import os
       
    
  • The data we downloaded is present in a folder. The folder contains videos of different lengths. We also have two files train.csv and test.csv which only have location of the video in one column and label of video in the second column.


  • We read both the train.csv and test.csv into pandas dataframe using read_csv.


  • The len() gives the length of both dataframes.


  • 
    ## Data preparation
    
    train_df = pd.read_csv("train.csv")
    test_df = pd.read_csv("test.csv")
    
    print(f"Total videos for training: {len(train_df)}")
    print(f"Total videos for testing: {len(test_df)}")
    
    train_df.sample(10)
       
    
  • Since a video is an ordered sequence of frames, we could just extract the frames and put them in a 3D tensor. But the number of frames may differ from video to video which would prevent us from stacking them into batches (unless we use padding).


  • As an alternative, we can **save video frames at a fixed interval until a maximum frame count is reached**. In this example we will do the following:


  • 1. Capture the frames of a video.


  • 2. Extract frames from the videos until a maximum frame count is reached.


  • 3. In the case, where a video's frame count is lesser than the maximum frame count we will pad the video with zeros.


  • Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf) to not contain extreme variations in objects and actions across frames. Because of this, it may be okay to only consider a few frames for the learning task.


  • But this approach may not generalize well to other video classification problems.


  • We will be using [OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html) to read frames from videos.


  • We define some variables such as IMG_SIZE is the height and width of the image that we extract from the video. As neural network needs images to be of same size.


  • 
    ## Define hyperparameters
    
    IMG_SIZE = 224
    BATCH_SIZE = 64
    EPOCHS = 10
    
    MAX_SEQ_LENGTH = 20
    NUM_FEATURES = 2048
       
    
  • crop_center_square() takes as input the numpy array of a image. Suppose both x, y are same then the function does do anything.


  • If x, y are different then to get a square image, it reduces some part of the larger dimension.


  • So if x is larger than y, then the first x-y pixels of x are discarded.


  • 
    def crop_center_square(frame):
        y, x = frame.shape[0:2]
        min_dim = min(y, x)
        start_x = (x // 2) - (min_dim // 2)
        start_y = (y // 2) - (min_dim // 2)
        return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]
       
    
  • load_video(), is for getting the frames from a video and resize each frame to IMG_SIZE width and height.


  • VideoCapture() reads the video from path. The while loop runs till the max_frames number of frames of each video are captured.


  • 
    def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
        cap = cv2.VideoCapture(path)
        frames = []
        try:
            while True:
                ret, frame = cap.read()
                if not ret:
                    break
                frame = crop_center_square(frame)
                frame = cv2.resize(frame, resize)
                frame = frame[:, :, [2, 1, 0]]
                frames.append(frame)
    
                if len(frames) == max_frames:
                    break
        finally:
            cap.release()
        return np.array(frames)
    
       
    
  • We can use a pre-trained network to extract meaningful features from the extracted frames. The pretrained model is applied directly on images and we get the array for each image which is the feature representation of that image.


  • The [`Keras Applications`](https://keras.io/api/applications/) module provides a number of state-of-the-art models pre-trained on the [ImageNet-1k dataset](http://image-net.org/).


  • We will be using the [InceptionV3 model](https://arxiv.org/abs/1512.00567) for this purpose. Any other model can also be used.


  • build_feature_extractor() initialises the InceptionV3 with weights of imagenet, removes the top layer that classifies the image, and also changes the input_shape to IMG_SIZE.


  • 
    def build_feature_extractor():
        feature_extractor = keras.applications.InceptionV3(
            weights="imagenet",
            include_top=False,
            pooling="avg",
            input_shape=(IMG_SIZE, IMG_SIZE, 3),
        )
        preprocess_input = keras.applications.inception_v3.preprocess_input
    
        inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
        preprocessed = preprocess_input(inputs)
    
        outputs = feature_extractor(preprocessed)
        return keras.Model(inputs, outputs, name="feature_extractor")
    
    
    feature_extractor = build_feature_extractor()
       
    
  • The labels of the videos are strings. Neural networks do not understand string values, so they must be converted to some numerical form before they are fed to the model.


  • Here we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup) layer encode the class labels as integers.


  • 
    label_processor = keras.layers.StringLookup(
        num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
    )
    print(label_processor.get_vocabulary())
       
    
  • Finally, we can put all the pieces together to create our data processing utility.


  • prepare_all_videos():


    1. first we read the dataframe of videonames and labels.


    2. num_samples is length of dataframe or total number of videos.


    3. video_paths is list of all file paths of each video.


    4. labels is the class of each video.


    5. label_processor converts labels to one hot encoding.


    6. `frame_masks` and `frame_features` are what we will feed to our sequence model.


    7. `frame_masks` will contain a bunch of booleans denoting if a timestep is masked with padding or not. Initially it is all zeros.


    8. The in the for loop we loop through all the videos one by one.


    9. The first video is passed to the load_video() and frames are extracted.


    10. Then a batch dimension is added to the frames.


    11. A second loop goes through each frame and passes it through the pretrained model to get the feature embedding.


  • At end of execution, each video will become 20 * 2048 dimension. 20 as only first 20 frames are extracted from the video. This can be changed using MAX_SEQ_LENGTH. 2048 is the number of features that the pretrained model generates for each input frame. After all 592 videos are processed then we get total train data as 592*20*2048.


  • 
    def prepare_all_videos(df, root_dir):
        num_samples = len(df)
        video_paths = df["video_name"].values.tolist()
        labels = df["tag"].values
        labels = label_processor(labels[..., None]).numpy()
    
    
        frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
        frame_features = np.zeros(
            shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )
    
        # For each video.
        for idx, path in enumerate(video_paths):
            # Gather all its frames and add a batch dimension.
            frames = load_video(os.path.join(root_dir, path))
            frames = frames[None, ...]
    
            # Initialize placeholders to store the masks and features of the current video.
            temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
            temp_frame_featutes = np.zeros(
                shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
            )
    
            # Extract features from the frames of the current video.
            for i, batch in enumerate(frames):
                video_length = batch.shape[1]
                length = min(MAX_SEQ_LENGTH, video_length)
                for j in range(length):
                    temp_frame_featutes[i, j, :] = feature_extractor.predict(
                        batch[None, j, :]
                    )
                temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked
    
            frame_features[idx,] = temp_frame_featutes.squeeze()
            frame_masks[idx,] = temp_frame_mask.squeeze()
    
        return (frame_features, frame_masks), labels
    
    
    train_data, train_labels = prepare_all_videos(train_df, "train")
    test_data, test_labels = prepare_all_videos(test_df, "test")
    
    print(f"Frame features in train set: {train_data[0].shape}")
    print(f"Frame masks in train set: {train_data[1].shape}")
       
    
  • The above code block will take ~20 minutes to execute depending on the machine it's being executed.


  • Now, we can feed this data to a sequence model consisting of recurrent layers like `GRU`.


  • GRU and LSTM are two options that can be used and both are better at preserving long range dependencies.


  • LSTM is preferred but here we use GRU.


  • It may also be that bidirectional version could be useful.


  • We could start with RNN, then GRU, then LSTM, the bidirectional.


  • 
    def get_sequence_model():
        class_vocab = label_processor.get_vocabulary()
    
        frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
        mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")
    
        # Refer to the following tutorial to understand the significance of using `mask`:
        # https://keras.io/api/layers/recurrent_layers/gru/
        x = keras.layers.GRU(16, return_sequences=True)(
            frame_features_input, mask=mask_input
        )
        x = keras.layers.GRU(8)(x)
        x = keras.layers.Dropout(0.4)(x)
        x = keras.layers.Dense(8, activation="relu")(x)
        output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
    
        rnn_model = keras.Model([frame_features_input, mask_input], output)
    
        rnn_model.compile(
            loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
        )
        return rnn_model
       
    
  • Utility for running experiments. filepath gives path to store the checkpoint model.


  • ModelCheckpoint() is that whenever there is an improvement in validation loss save the model as h5 file to a location.


  • After training we load the model again that is best on the data and then check results on test set using evaluate().


  • 
    def run_experiment():
        filepath = "/tmp/video_classifier"
        checkpoint = keras.callbacks.ModelCheckpoint(
            filepath, save_weights_only=True, save_best_only=True, verbose=1
        )
    
        seq_model = get_sequence_model()
        history = seq_model.fit(
            [train_data[0], train_data[1]],
            train_labels,
            validation_split=0.3,
            epochs=EPOCHS,
            callbacks=[checkpoint],
        )
    
        seq_model.load_weights(filepath)
        _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
        print(f"Test accuracy: {round(accuracy * 100, 2)}%")
    
        return history, seq_model
    
    
    _, sequence_model = run_experiment()
       
    
  • To keep the runtime of this example relatively short, we just used a few training examples.


  • This number of training examples is low with respect to the sequence model being used that has 99,909 trainable parameters.


  • You are encouraged to sample more data from the UCF101 dataset using [the notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) mentioned above and train the same model.


    • So we can take a sample video from the test set and apply it on the data.


    • So we take Test video path: v_Punch_g02_c04.avi.


    • After getting all frames from the video using function load_video () we get Frame in train set: (1, 55, 224, 224, 3).


    • The Frame datastructure has five dimensions.


    • First: batch which gives number of the video from which frames are extracted so 1.


    • Next, number of frames taken from video thats 55 depending on the length.


    • Next is width, height and channels.


    • Channels are 3 for RGB or color video and 1 for black and white (grayscale).


    • 
      def prepare_single_video(frames):
          frames = frames[None, ...]
          frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
          frame_featutes = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")
      
          for i, batch in enumerate(frames):
              video_length = batch.shape[1]
              length = min(MAX_SEQ_LENGTH, video_length)
              for j in range(length):
                  frame_featutes[i, j, :] = feature_extractor.predict(batch[None, j, :])
              frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked
      
          return frame_featutes, frame_mask
      
      
      def sequence_prediction(path):
          class_vocab = label_processor.get_vocabulary()
      
          frames = load_video(os.path.join("test", path))
          frame_features, frame_mask = prepare_single_video(frames)
          probabilities = sequence_model.predict([frame_features, frame_mask])[0]
      
          for i in np.argsort(probabilities)[::-1]:
              print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
          return frames
         
      
      
      # This utility is for visualization.
      # Referenced from:
      # https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
      
      def to_gif(images):
          converted_images = images.astype(np.uint8)
          imageio.mimsave("animation.gif", converted_images, fps=10)
          return embed.embed_file("animation.gif")
      
      
      test_video = np.random.choice(test_df["video_name"].values.tolist())
      print(f"Test video path: {test_video}")
      test_frames = sequence_prediction(test_video)
      to_gif(test_frames[:MAX_SEQ_LENGTH])