This blog post will aim to build a simple video to slides converter application to obtain slide images given slide or lecture videos using basic frame differencing and background subtraction techniques using OpenCV.
This is highly useful when one wishes to have a video lecture(with or without animations) in the form of slides – either a ppt or pdf. However, more often than not, slides are not provided when such video lectures are hosted on platforms such as Youtube. This article aims to build a robust application that can convert video lectures into corresponding slides using techniques such as basic frame differencing and statistical background subtraction models such as KNN and GMG, already available with OpenCV.
- What is Background Subtraction?
- Background Subtraction using Frame Differencing in OpenCV
- OpenCV Background Subtraction Techniques
- Workflow for Video to Slides Converter Application
- Code Explanation for Video to Slides Converter Application
- Comparison Across GMG and KNN Background Estimation
- Scope for Improvements
- Summary
- References
What is Background Subtraction?
Before we dive deep into building the application, it becomes imperative to know about background subtraction since this is the crux of our application.
Background subtraction is a technique for separating foreground objects from the background in a video sequence. The idea is to model the scene’s background and subtract it from each frame to obtain the foreground objects. This is useful in many computer vision applications, such as object tracking, activity recognition, and crowd analysis. Therefore, we can extend this concept to convert slide videos into the corresponding slides where the notion of motion is the various animations encountered through the video sequence.
Background modeling consists of two main steps:
- Background Initialization
- Background Update.
In the first step, an initial model of the background is computed, while in the second, the model is updated to adapt to possible changes in the scene. Background Estimation can also be applied to motion-tracking applications such as traffic analysis, people detection, etc. This article on background estimation for motion tracking will undoubtedly help you gain a better understanding.
Background Subtraction using Frame Differencing in OpenCV
One common question to model the background is whether we can achieve it by simply performing a frame differencing between the previous and current frames. While this approach can work for videos with static frames, it does not yield good results for videos with significant animations.
Therefore, for videos with significant animations, it becomes imperative to model the background with statistical approaches instead of naive frame differencing. There are quite a few background separation approaches already provided by OpenCV.
OpenCV Background Subtraction Techniques
The major background subtraction approaches from OpenCV that are popularly used are:
- KNN-based Background Subtraction: A non-parametric modeling approach that implements the K-nearest neighbors technique for background/foreground segmentation. The function prototype for creating an object for the KNN Background subtraction model in OpenCV is:
cv2.createBackgroundSubtractorKNN([, history[, dist2Threshold[, detectShadows]]])
- Mixture of Gaussians (MOG v2): This parametric modeling approach implements an efficient adaptive algorithm using a Gaussian mixture probability density function for Background/Foreground Segmentation to better handle variations in the background over time and complex backgrounds with multiple colors and textures.
Its function prototype is:cv2.createBackgroundSubtractorMOG2([, history[, varThreshold[, detectShadows]]])
- GMG Background Subtraction: This approach is named after the paper’s authors, “Visual Tracking of Human Visitors under Variable-Lighting Conditions for a Responsive Audio Art Installation,” by Andrew B. Godbehere, Akihiro Matsukawa, and Ken Goldberg. It is a parametric approach that combines statistical background image estimation and per-pixel Bayesian segmentation. It employs a probabilistic foreground segmentation algorithm that identifies possible foreground objects using Bayesian inference.
Its function prototype is:cv2.bgsegm.createBackgroundSubtractorGMG([, initializationFrames[, decisionThreshold]])
Of the above, we are going to use the GMG and K-NN Background subtraction models for our application as they yield better results compared to MOG2.
One can also use an improved version of the spatiotemporal Local Binary Similarity Patterns (LBSP) approach, such as Self-Balanced Sensitivity Segmenter (SuBSENSE), for background estimation. SuBSENSE is already implemented in the BGS Library along with several other background subtraction methods.
In the coming sections, we shall discuss the arguments for the above Background Subtractor classes.
Workflow for the Video to Slides Converter Application
For our application, we will employ both frame differencing (for handling static frames) and probabilistic background subtraction techniques (for handling videos with large animations).
Background Subtraction using Frame Differencing
Background Subtraction through frame differencing is quite simple. We begin by retrieving the video frames in grayscale. We then compute the absolute difference between successive frames and calculate the foreground mask percentage after some morphological operations. We shall save that particular frame if this percentage exceeds a certain threshold.
The flowchart below demonstrates this.
Probabilistic Background Modeling
Recall that we can use algorithms to model the background pixels when there are many video animations.
- Just as was the case with frame differencing, we start by retrieving the video frames.
- We then pass each of them through a background subtraction model, which generates a binary mask, and then calculate the percentage of foreground pixels in that frame.
- If this percentage is above a specific threshold T1, it indicates some motion (animations in our case), and we wait till the motion settles down. Once the percentage is below a threshold T2, we save the corresponding frame.
The same is shown in the flowchart below.
Video to Slides Converter Application
Before we jump into the respective algorithms of frame differencing and background subtraction, there are some code segments common to both approaches. Let’s discuss them first.
The following hierarchy shows all the scripts that would be used for this article:
├── frame_differencing.py
├── post_process.py
├── utils.py
└── video_2_slides.py
The file video_2_slides.py
contains the main script for running the application. The others are the supporting utility modules for running the application.
We first begin with our imports:
import os
import time
import sys
import cv2
import argparse
import shutil
import img2pdf
import glob
import imagehash
from PIL import Image
Note: You need to have the opencv-contrib-python
installed to apply the GMG background subtractor method.
The additional utility packages used for this article are:
img2pdf
converts all our generated slide images into a single PDF. We could also have used thePIL
library to convert the image set into a single PDF. However, it requires each image in the set to be opened first. Usingimg2pdf
overcomes this problem.
Instead of a PDF, transform the slide images into a PowerPoint presentation (ppt).
imagehash
removes all similarly generated images using a popular image processing technique known as image hashing. The approach is a post-processing step after the corresponding slide images are generated.
Now, let’s take a look at the create_output_directory
function present in the utils
module:
def create_output_directory(video_path, output_path, type_bgsub):
vid_file_name = video_path.rsplit('/')[-1].split('.')[0]
output_dir_path = os.path.join(output_path, vid_file_name,
type_bgsub)
# Remove output directory if there is already one.
if os.path.exists(output_dir_path):
shutil.rmtree(output_dir_path)
# Create output directory.
os.makedirs(output_dir_path, exist_ok=True)
print('Output directory created...')
print('Path:', output_dir_path)
print('***'*10,'\n')
return output_dir_path
The function code above creates a directory where all the obtained slide png
images, and the final PDF will be dumped. In case of a duplicate folder, it will be automatically deleted first.
The arguments are:
video_path
: the path to the video file.output_path
: the path to the output directory where all the generated image slides would be stored.type_bgsub
: The type of background subtraction to be performed. It can be either ofFrame_Diff
,KNN
, orGMG
.
Next, we look at the convert_slides_to_pdf
function below:
def convert_slides_to_pdf(video_path, output_path):
pdf_file_name = video_path.rsplit('/')[-1].split('.')[0]+'.pdf'
output_pdf_path = os.path.join(output_path, pdf_file_name)
print('Output PDF Path:', output_pdf_path)
print('Converting captured slide images to PDF...')
with open(output_pdf_path, "wb") as f:
f.write(img2pdf.convert(sorted(glob.glob(f"{output_path}/*.png"))))
print('PDF Created!')
print('***'*10,'\n')
The function takes the video file path and the output directory path discussed above. It converts all the generated png
slide images into a single PDF file named after the video file.
Now, let us move on to discuss each of the approaches separately.
Frame Differencing for Background Subtraction
We start with the corresponding initializations:
prev_frame = None
curr_frame = None
screenshots_count = 0
capture_frame = False
frame_elapsed = 0
Here, the variables: prev_frame
and curr_frame
are used to keep track of the previous and current frames, respectively.
The variable screenshots_count
is used to track the screenshots saved during the process.
We have also used a flag capture_frame
that gets set whenever there is some motion between successive frames.
Next, we create the Video Capture object containing the path to the video file as follows:
cap = cv2.VideoCapture(video_file)
Next, we create the structuring element and retrieve the first frame of the video, as shown below:
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (7,7))
success, first_frame = cap.read()
Since the first frame should always be part of the final set of slide images, we need to save it.
if success:
# Convert frame to grayscale.
first_frame_gray = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
prev_frame = first_frame_gray
screenshots_count+=1
filename = f"{screenshots_count:03}.png"
out_file_path = os.path.join(output_dir_path, filename)
print(f"Saving file at: {out_file_path}")
# Save frame.
cv2.imwrite(out_file_path, first_frame)
We continue retrieving each video frame and converting it from BGR into gray-scale.
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
curr_frame = frame_gray
We set the curr_frame
as the converted gray-scale image.
Next, we perform the absolute difference between the consecutive frames and perform thresholding to generate a binary mask. This generated mask further undergoes a dilation process using the kernel filter already defined earlier.
We then calculate the percentage of non-zero pixels (the foreground mask) across the entire mask.
frame_diff = cv2.absdiff(curr_frame, prev_frame)
_, frame_diff = cv2.threshold(frame_diff, 80, 255, cv2.THRESH_BINARY)
# Perform dilation to capture motion.
frame_diff = cv2.dilate(frame_diff, kernel)
# Compute the percentage of non-zero pixels in the frame.
p_non_zero = (cv2.countNonZero(frame_diff) / (1.0*frame_gray.size))*100
The video clip below shows how the foreground mask would look using frame differencing.
This percentage goes up whenever there is some sort of animation, and the capture_frame
flag will be enabled, indicating the frame is in a transition phase. To capture the transition phase, we have defined a MIN_PERCENT_THRESH
which is kept 0.06
as default.
if p_non_zero>=MIN_PERCENT_THRESH and not capture_frame:
capture_frame = True
Since we are performing frame differencing for videos having majorly static frames, we can safely skip some n
number of frames before we go on to save the appropriate frames. The value of n
is set to 85
based on multiple experimentations.
elif capture_frame:
frame_elapsed+=1
if frame_elapsed >= ELAPSED_FRAME_THRESH:
capture_frame = False
frame_elapsed=0
screenshots_count+=1
filename = f"{screenshots_count:03}.png"
out_file_path = os.path.join(output_dir_path, filename)
print(f"Saving file at: {out_file_path}")
cv2.imwrite(out_file_path, frame)
In the end, we set the prev_frame
as the curr_frame
.
prev_frame = curr_frame
This concludes our frame differencing approach, which is simple and seems very effective for videos with significant static frames as opposed to the probabilistic background subtraction approaches.
Statistical Modeling of Background Pixels using OpenCV
The above method of naive frame differencing works well when there are majorly static frames in the videos. However, for those videos having too many animations, this approach seemingly fails!
Therefore, it becomes essential to model the background pixels using a statistical approach. One of the popular background subtraction methods is the GMG background subtractor.
The process of obtaining slide images is handled inside the capture_slides_bg_modeling
function as shown below:
def capture_slides_bg_modeling(video_path, output_dir_path, type_bgsub,
history, threshold, MIN_PERCENT_THRESH,
MAX_PERCENT_THRESH):
Let’s examine the arguments for this function:
video_path
: the path to the video file.output_dir_path
: the output directory path where the snapshots and the PDF would be stored.type_bgsub
: the type of background subtraction algorithm that we want to opt for. For e.g.,GMG
orKNN
.history
: frame history for which to model the background- threshold: the required thresholds for the corresponding background subtraction algorithms.
- For GMG, it is the
decisionThreshold
which refers to the threshold value above which it is marked foreground, - For KNN, it is the
dist2Threshold
threshold refers to the squared distance between the pixel and the sample to decide whether a pixel is close to that sample.
- For GMG, it is the
MIN_PERCENT_THRESH
: threshold to check if there is motion (animations, in our case) across subsequent frames.MAX_PERCENT_THRESH
: threshold to determine if the motion across frames has stopped.
We now create the Background subtractor class object as follows:
if type_bgsub == 'GMG':
bg_sub = cv2.bgsegm.createBackgroundSubtractorGMG(
initializationFrames=history,
decisionThreshold=threshold)
elif type_bgsub == 'KNN':
bg_sub = cv2.createBackgroundSubtractorKNN(
history=history,
dist2Threshold=threshold,
detectShadows=False)
After subsequent experiments, we have used the following frame history and thresholds:
- GMG:
history = 15
,decisionThreshold = 0.75
- KNN:
history = 15
,dist2Threshold = 100
To reduce the computational cost, we have disabled the detectShadows
flag.
Next, we retrieve the video frames as we had done for frame differencing. We create a copy of the original frame and store it in orig_frame
.
Applying background subtraction to a frame is computationally intensive. Hence we need to resize the frame to a lower dimension keeping the aspect ratio intact. The resizing option is performed in the resize_image_frame
utility function, where we pass the image frame and the frame width to resize. This is shown in the following lines of code.
# Create a copy of the original frame.
orig_frame = frame.copy()
# Resize the frame keeping aspect ratio.
frame = resize_image_frame(frame, resize_width=640)
The next step is to apply background estimation to the resized frame. After that, we calculate the percentage of non-zero pixels (i.e., the percentage of foreground mask) across the binary mask obtained after background subtraction.
# Apply each frame through the background subtractor.
fg_mask = bg_sub.apply(frame)
# Compute the percentage of the Foreground mask."
p_non_zero = (cv2.countNonZero(fg_mask) / (1.0 * fg_mask.size)) * 100
The videos below depict the foreground mask and its percentage that is generated while applying background subtraction both for GMG and KNN:
GMG Background Subtraction
KNN Background Subtraction
If suppose this percentage is less than the maximum threshold percentage, in that case, it indicates that any subsequent motion (or animation) between successive frames has reduced. We are now ready to capture the frame. Otherwise, it indicates that the scene is still in motion, and we must wait until it settles down.
The following lines of code indicate this.
if p_non_zero < MAX_PERCENT_THRESH and not capture_frame:
capture_frame = True
screenshots_count += 1
png_filename = f"{screenshots_count:03}.png"
out_file_path = os.path.join(output_dir_path, png_filename)
print(f"Saving file at: {out_file_path}")
cv2.imwrite(out_file_path, orig_frame)
elif capture_frame and p_non_zero >= MIN_PERCENT_THRESH:
capture_frame = False
Once we finish retrieving the frames, we can display the statistics showcasing the total time taken and the number of screenshots captured during the process.
end_time = time.time()
print('***'*10,'\n')
print("Statistics:")
print('---'*5)
print(f'Total Time taken: {round(end_time-start, 3)} secs')
print(f'Total Screenshots captured: {screenshots_count}')
print('---'*10,'\n')
As a final step, we can convert all the captured snapshots into a single pdf.
As mentioned earlier, we can convert the snapshots into PowerPoint ppts too!
Post-Processing Step for the Video to Slides Converter Application
Now that we have obtained slide images using background modeling, a significant concern remains. Many of the screenshots that were generated were broadly similar. Therefore, we have our task cut out to eliminate such images.
We apply a popular technique called image hashing to achieve this task. It should be borne in mind that we cannot apply the more popular cryptographic hashing algorithms such as MD5 or SHA-1.
Recall that our post-processing objective is to identify similar images. Cryptographic hashing techniques cause minor differences in the pixel values of similar images to be completely different. Image hashing techniques can mitigate this problem by yielding similar (or identical) hashes for similar images.
There are several approaches for image hashing, such as average hashing, perceptual hashing, difference hashing, wavelet hashing, etc. For our application, we shall use difference hashing for mapping similar images because of its extremely fast computations and more robustness than hashing algorithms such as average and perceptual hashing.
Enough of the theory!
Let’s get to the interesting part: the code for image hashing.
As mentioned earlier, we shall use imagehash
to perform image hashing. You can simply install it using pip.
pip install imagehash
Using imagehash
is quite simple. You can simply specify the hashing algorithm of your choice using:
imagehash.hashing_algo(PIL_Image, hash_size)
Since we have used difference hashing, the function call should look like this:
imagehash.dhash(PIL_Image, hash_size)
The hash_size
represents the size of our output hash value in bits. For example, an hash_size
of 8
results in a hash value of 64 bits (8*8) represented in hexadecimal format.
Increasing the hash size allows the algorithm to store more detail in its hash. For our application, we have kept the hash size to 12
.
One more point to consider is that the input to the corresponding hashing algorithm is a PIL Image and not a numpy array.
Let us take a look at the find_similar_images
function in the post_process.py
module.
The function signature is shown below.
def find_similar_images(base_dir, hash_size=8):
It takes in the image set directory path and the hash size.
We first sort the filenames and then create a dictionary hash_dict
to contain the unique hash values across the image set. Besides we also initialize a list: duplicates
, that maintains the duplicate (similar) image files. In the end, we can delete all the files contained in duplicates
. We can also keep a count of the number of duplicate images in the directory at num_duplicates
.
hash_dict = {}
duplicates = []
num_duplicates = 0
Next, we iterate through the files and update the dictionary and the list containing similar files in hash_dict
and duplicates
, respectively.
for file in snapshots_files:
read_file = Image.open(os.path.join(base_dir, file))
comp_hash = str(imagehash.dhash(read_file, hash_size=hash_size))
if comp_hash not in hash_dict:
hash_dict[comp_hash] = file
else:
print('Duplicate file: ', file)
duplicates.append(file)
num_duplicates+=1
Our final step is to remove all the duplicate files contained in the duplicates
list.
for dup_file in duplicates:
file_path = os.path.join(base_dir, dup_file)
if os.path.exists(file_path):
os.remove(file_path)
else:
print('Filepath: ', file_path, 'does not exists.')
This completes our final post-processing step.
Command-Line Options
As mentioned earlier, video_2_slides.py
contains the main execution script. It has the following arguments for running the script.
parser = argparse.ArgumentParser(description="This script is used to
convert video frames into slide PDFs.")
parser.add_argument("-v", "--video_file_path", help="Path to the video
file", type=str)
parser.add_argument("-o", "--out_dir", default = 'output_results',
help="Path to the output directory", type=str)
parser.add_argument("--type", choices=['Frame_Diff', 'GMG', 'KNN'],
default = 'GMG', help = "type of background
subtraction to be used" , type=str)
parser.add_argument("--no_post_process", action="store_true",
default=False,
help="flag to apply post processing or not")
parser.add_argument("--convert_to_pdf", action="store_true",
default=False, help="flag to convert the entire
image set to pdf or not")
args = parser.parse_args()
The arguments are as follows:
video_file_path
: The path to the input video file.out_dir
: The path to the output directory where the results would be stored.type
: The type of background subtraction method to be applied. It can be one of:Frame_Diff
,GMG
(default), orKNN
.no_post_process
: flag to specify whether to apply the post-processing step. If not specified, the post-processing step is always applied as default.convert_to_pdf
: flag to specify whether to convert the image set into a single PDF file.
Comparisons Across GMG and KNN Background Estimation
We ran the GMG and KNN methods across several slide videos and found that the KNN background estimation approach yielded almost four times as much FPS as its GMG counterpart.
However, in some video samples, we found that the KNN approach missed out on a few frames.
Do take a look at the video inferences!
GMG Background Estimation
KNN Background Estimation:
However, we did not opt for the MOG2 (Mixture of Gaussian v2) background estimation technique because it captured most of the video frames during the transition phase and tended to miss out on the important frames. However, with more careful settings of the frame history and the variance threshold, we may obtain better results.
Scope for Improvements
We have seen how the various background estimation approaches produce decent results for scenes with significant animations. Similarly, scenes containing mostly static frames can be handled by a naive frame differencing approach.
However, for video sequences containing majorly static frames but simultaneously having facial camera movements embedded within the video, none of the above approaches yield good results as the facial movements are predicted to be animations. This leads to a significant amount of redundant captured frames. There was little improvement in the results, even with frame differencing and applying the post-processing step.
Have a look at the results.
It is also observed that even after the post-processing step, there have been instances where redundant slide images remain. Instead of applying image hashing, we can opt for better techniques, such as cosine similarity, to determine similar images.
This issue could be mitigated by increasing the frame buffer history, but neither of the approaches guarantees good results. This problem could be addressed by a deep-learning based approach where we can capture the facial features and subsequently pass such features to a non-parametric supervised learning classifier such as K-Nearest Neighbors to obtain unique samples.
The application yields almost perfect results for lectures having voice-over presentations. However, for lectures having interactive sessions, we might have to look for techniques other than the ones discussed throughout this article.
Summary
The objective of this blog post was to build a simple python application to convert voice-over video lectures into slides using the following approaches:
- The naive frame differencing approach can produce decent results for video lectures with majorly static frames.
- Probabilistic approaches such as GMG and KNN for modeling background pixels. We can apply background modeling for lectures containing significant animations.
Besides, we have also discussed that for videos with static slides containing facial movements, both approaches yield redundant slides. However, we can still obtain pretty decent results in most video lectures, even with these simple techniques.
We hope this article gives you enough guidance and intuition to build a simple yet effective application to convert video lectures into slide pdf or PowerPoint ppts.
References
Here are a few additional resources on Background Subtraction. Do give them a read!
- How to Use Background Subtraction Methods
- Visual tracking of human visitors under variable-lighting conditions for a responsive audio art installation
- Difference Image Hashing
- Background Subtraction Library
- OpenCV Background Subtraction Methods