The TensorFlow tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.

Colab link - Open colab

Description: Training a video classifier with transfer learning and a recurrent model on the UCF101 dataset.

This example demonstrates video classification, an important use-case with applications in recommendations, security, and so on. We will be using the UCF101 dataset to build 3D CNN video classifier.

The dataset consists of videos categorized into different actions, like cricket shot, punching, biking, etc. This dataset is commonly used to build action recognizers, which are an application of video classification.

A video consists of an ordered sequence of frames. Each frame contains spatial information, and the sequence of those frames contains temporal information.

To model both of these aspects, we use a 3D CNN architecture that consists of convolutions.

Specifically, we'll use a 3D Convolutional Neural Network (CNN).

In order to keep the runtime of this example relatively short, we will be using a subsampled version of the original UCF101 dataset.

You can refer to [this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) to know how the subsampling was done.


!pip install -q git+https://github.com/tensorflow/docs

!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

Import the packages and classes needed.

IMG_SIZE defines the width and height of a frame, batch size is to mini batch gradient descent loss. EPOCHS is for the number of epochs or iterations of fit(). MAX_SEQ_LENGTH is the number of frames to be extracted from each video.


from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

## Define hyperparameters

IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10
MAX_SEQ_LENGTH = 20

The two dataframes are made one with train.csv and test.csv to store the train videos and test videos.


train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

One of the many challenges of training video classifiers is figuring out a way to feed the videos to a network.

[This blog post](https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5) discusses five such methods.

Since a video is an ordered sequence of frames, we could just extract the frames and put them in a 3D tensor.

But the number of frames may differ from video to video which would prevent us from stacking them into batches (unless we use padding).

As an alternative, we can **save video frames at a fixed interval until a maximum frame count is reached**. In this example we will do the following:

1. Capture the frames of a video.

2. Extract frames from the videos until a maximum frame count is reached.

3. In the case, where a video's frame count is lesser than the maximum frame count we will pad the video with zeros.

Note that this workflow is identical to [problems involving texts sequences](https://developers.google.com/machine-learning/guides/text-classification/). Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf)

to not contain extreme variations in objects and actions across frames.

Because of this, it may be okay to only consider a few frames for the learning task.

But this approach may not generalize well to other video classification problems.

We will be using [OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html) to read frames from videos.

# The following two methods are taken from this tutorial:

# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    frames=frames[:MAX_SEQ_LENGTH]
    return np.array(frames)

The labels of the videos are strings.

Here we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup) layer encode the class labels as integers.


label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())

#take all classlabels from train_df column named 'tag' and store in labels
labels = train_df["tag"].values
#convert classlabels to label encoding
labels = label_processor(labels[..., None]).numpy()

#594, 1 - shaped numpy array
labels.shape

Finally, we can put all the pieces together to create our data processing utility.

prepare-all_videos() takes the dataframe that has two columns - one column with video paths and second column with class labels.

An empty list is created and then for each video , the frames are extracted using load_video() and then the frames are appended to the list.

After all videos are processed. We get a list of video frames. The dimensions of the list is 594, 20. 224, 224, 3. The 594 is batch dimension, 20 is number of frames, 224by224by3 is the width, height and channels of the each frame.

Then we converted the list to numpy array and saved the array as pickle files.


def prepare_all_videos(df, root_dir):
    frame = []
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # For each video.
    for idx, path in enumerate(video_paths):
        #print(idx)
        #print(path)
        # Gather all its frames and add to a list.
        frames = load_video(os.path.join(root_dir, path))
        frame.append(frames)
        
    return frame, labels


train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"train_data in train set: {len(train_data)}")
print(f"train_labels in train set: {train_labels.shape}")
print(f"test_data in train set: {len(test_data)}")
print(f"test_labels in train set: {test_labels.shape}")

#train_data each index: (20, 224, 224, 3)
print(f"train_data each index: {train_data[0].shape}")

train_data = np.array(train_data)

train_data.shape

import pickle

# Save to file in the current working directory
pkl_filename1 = "traindata.pkl"
with open(pkl_filename1, 'wb') as file:
        pickle.dump(train_data, file)

pkl_filename2 = "trainlabels.pkl"
with open(pkl_filename2, 'wb') as file:
        pickle.dump(train_labels, file)

pkl_filename3 = "testdata.pkl"
with open(pkl_filename3, 'wb') as file:
        pickle.dump(test_data, file)

pkl_filename4 = "testlabels.pkl"
with open(pkl_filename4, 'wb') as file:
        pickle.dump(test_labels, file)

!ls -a

The above code block will take ~20 minutes to execute depending on the machine it's being executed.

Now we will build a 3D CNN model by first importing the required functions and classes - Dense, Conv3D, Flatten, Dropout

Conv3D layer followed by Maxpooling was built then due to memory restrictions few dense layers were added. Callbacks like model check point was added to store best model during training.

Compile() uses sparse categorical crossentropy loss , with optimizer adam and accuracy as metrics.


from keras.models import Sequential
from keras.layers import Dense, MaxPooling3D, Conv3D, Flatten, Dropout

class_vocab = label_processor.get_vocabulary()

model = Sequential()
model.add(Conv3D(
            16, (3,3,3), activation='relu', input_shape=(20, 224, 224, 3)
        ))
model.add(MaxPooling3D(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(Flatten())
model.add(Dense(128))
model.add(Dropout(0.5))
model.add(Dense(len(class_vocab), activation='softmax')) 
model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )

filepath = "/tmp/video_classifier"

checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )



history = model.fit(
        train_data,
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

To keep the runtime of this example relatively short, we just used a few training examples.

This number of training examples is low with respect to the sequence model being used that has 99,909 trainable parameters.

You are encouraged to sample more data from the UCF101 dataset using [the notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) mentioned above and train the same model.

The prediction of a single video file is just to extract the first twenty frames and add a batch dimension then pass it to the final model for predict. All is achieved with below code

In sequence_prediction() we take a path of the video. The video is passed to load_video() and from it the first twenty frames are extracted.

Then a batch dimension is added so the data structure has five dimension - batch, number of frames, width, height, channels/depth.

Following which we pass the data structure to the model for prediction and then sort the prediction probabilities to get top five ones.


import random

def sequence_prediction(path):    
    frames = load_video(os.path.join("test", path))    
    frame_features = frames[None, ...]        
    probabilities = model.predict([frame])[0]
    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=10)
    return embed.embed_file("animation.gif")


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

Video classification step by step in 3D CNN

Data collection

Setup - Video classification step by step in 3D CNN

Data preparation - Video classification step by step in 3D CNN

Data preparation - Video classification step by step in 3D CNN

Data preparation - Video classification step by step in 3D CNN

Data processing utility - Video classification step by step in 3D CNN

The sequence model - Video classification step by step in 3D CNN

Inference - Video classification step by step in 3D CNN