Back To Top

January 19, 2024

Narrating Videos with OpenAI Vision and Whisperer Automatically

Synchronizing Narration with Video Length, Tone and Frame Content Contextually Rich and Dynamic Storytelling

This guide aims to present an end-to-end solution that for automatically narrating videos with AI, leveraging OpenAI’s GPT-4 Vision and Text-to-Speech technology’s cutting-edge capabilities. OpenAI’s GPT-4 Vision transcends traditional language models by understanding and interpreting visual content, thereby unlocking a myriad of possibilities.

We will go through a Python-based method that generates and aligns narrations with both the timeline and visual elements of videos. Furthermore, GPT-4 Vision’s ability to comprehend images, charts, graphs, and even generate creative content marks a significant step forward in the way we create and consume content.

1. Understanding the Technology

GPT-4 Vision notably excels in visual question answering (VQA). It can comprehend the context and relationships within an image, even deciphering text and code. For example, in tests, GPT-4 Vision successfully described why a particular image was humorous by referencing its components and their interconnections. 

GPT-4 with Vision processes images alongside text by accepting either image URLs or base64-encoded images within the request body. Upon receiving an image, GPT-4 with Vision analyzes it and produces text-based responses or descriptions.

1.1 The key capabilities unlocked by adding vision to GPT-4 include:

  • Image Captioning: Generating natural language descriptions of image contents.
  • Visual Question Answering: Answering text-based questions about images.
  • Multimodal Reasoning: Making inferences using both text and visual inputs.
  • OCR for Text Extraction: Recognizing and extracting text from images.

To use GPT-4 with Vision in Python, you can make an HTTP request to the OpenAI API using the requests module. The payload of the request should include the model name (gpt-4-vision-preview), a user message containing either the image_url or a base64 encoded image, and a max_tokens limit. The model then processes the queried image and interprets the image’s primary content.

While GPT-4 with Vision offers a range of capabilities, it’s not optimized for tasks like precise object localization or counting, and its performance may vary with rotated, blurry, or obscured images. Therefore, it’s crucial to consider these limitations when applying the technology to specific use cases.

Furthermore, OpenAI recommends that image inputs should be under 1 MB and no larger than 1024×1024 pixels for optimal performance, though other sizes are supported. Additionally, the detail parameter in the request can control the fidelity of image interpretation, with options for lower or higher resolutions based on the processing needs.

Entreprenerdly.com Solution Architecture [OPTIMIZED] www.entreprenerdly.com

Figure. 1: Conceptual Solution Architecture - Narrating Videos using OpenCV, GPTVision and OpenAI's Whisper.

2. Setting Up the Environment

  1. Python Installation: Ensure that your system has Python installed. If not, download and install it from the official Python website.
  2. Library Installation: Install the OpenAI Python library essential for interacting with the OpenAI API, should be installed.. You can install it using pip inside a Jupyter Notebook (!pip install openai) or from the cmd (pip install openai)
  3. API Key: Sign up for an OpenAI account and obtain your API key. This key is required to authenticate your requests to OpenAI’s servers.

However, the cost of using GPT-4 is higher than its predecessors. Understanding the pricing structure is important to manage your project budget effectively, especially with long videos. 

The API pricing varies based on the model and the number of tokens used. Detailed pricing information can be found on OpenAI’s pricing page. Refer further to OpenAI’s official documentation for detailed guidance.

3. Python Implementation

This section provides a detailed walkthrough of the Python code. Which is designed to integrate AI-driven narration into videos using OpenAI’s GPT-4 Vision and Text-to-Speech technologies. We have carefully annotated each function with commentary to help readers understand its specific role and purpose.

Main Process:

  1. Initialization: Sets up the OpenAI client with the API key.
  2. Video Processing: Reads the video file and extracts frames
  3. Narration Generation: Uses GPT-4 Vision to create a script based on the frames.
  4. Text-to-Speech Conversion: Turns the script into an audio file.
  5. Saving Output: The system saves the audio narration for later use or integration with the video.

3.1 Python Functions

In this section, we break down the Python functions used in our video narration project. Each function is carefully annotated with commentary to ensure readers understand its specific role and purpose. 

From initializing the OpenAI client to extracting video frames, creating AI-generated narration, and converting it to speech, these functions collectively orchestrate the process of adding AI-driven audio narration to video content. 

				
					# Import necessary libraries
from IPython.display import display, Image, Audio
import cv2
import base64
import time
from openai import OpenAI
import os
import requests

# Initialize OpenAI client
def initialize_openai_api(api_key):
    """
    Initializes the OpenAI API client with the provided API key.
    
    Parameters:
    api_key (str): Your OpenAI API key.
    
    Returns:
    OpenAI: An instance of the OpenAI client.
    """
    os.environ["OPENAI_API_KEY"] = api_key
    return OpenAI()

# Function to read and process the video
def read_video(video_path, frame_interval):
    """
    Reads a video file and extracts frames at specified intervals.
    
    Parameters:
    video_path (str): The path to the video file.
    frame_interval (int): The interval at which frames are extracted, e.g. 60 means every 60th frame is extracted.
    
    Returns:
    tuple: A tuple containing a list of base64-encoded frames and the duration of the video. 
    """
    video = cv2.VideoCapture(video_path)
    frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_rate = video.get(cv2.CAP_PROP_FPS)
    duration = frame_count / frame_rate
    base64Frames = []

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    return base64Frames[::frame_interval], duration

# Function to display frames (Optional)
def display_frames(frames):
    """
    Displays each frame from a list of base64-encoded frames every 0.025 seconds.
    
    Parameters:
    frames (list): A list of base64-encoded frames.
    """
    display_handle = display(None, display_id=True)
    for img in frames:
        display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
        time.sleep(0.025)

# Function to generate narration script using GPT with Vision
def generate_gpt_vision_prompt(frames, prompt_text, max_tokens):
    """
    Generates a narration script using GPT with Vision.
    
    Parameters:
    frames (list): A list of base64-encoded frames.
    prompt_text (str): The prompt text to guide the AI model.
    max_tokens (int): The maximum number of tokens in the generated text.
    
    Returns:
    str: The generated narration script.
    """
    PROMPT_MESSAGES = [
        {
            "role": "user",
            "content": [
                prompt_text,
                *map(lambda x: {"image": x, "resize": 768}, frames),
            ],
        },
    ]
    params = {
        "model": "gpt-4-vision-preview",
        "messages": PROMPT_MESSAGES,
        "max_tokens": max_tokens,
    }
    client = OpenAI()
    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

# Function to convert text to speech
def text_to_speech(text, voice, speaking_rate):
    """
    Converts the provided text to speech using OpenAI's Text-to-Speech API.
    
    Parameters:
    text (str): The text to be converted into speech.
    voice (str): The voice model to be used.
    speaking_rate (float): The rate of speech.
    
    Returns:
    bytes: The audio data of the spoken text.
    """
    response = requests.post(
        "https://api.openai.com/v1/audio/speech",
        headers={
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
        },
        json={
            "model": "tts-1-1106",
            "input": text,
            "voice": voice,
            "speaking_rate": speaking_rate,
        },
    )
    audio = b""
    for chunk in response.iter_content(chunk_size=1024 * 1024):
        audio += chunk
    return audio
				
			

3.2 Executing the Functions

The main script combines all the above functions to create a workflow. It initializes the OpenAI client, processes the video to extract frames, uses GPT-4 Vision to generate a narration script based on these frames, converts this script into audio, and finally saves the audio file. Importantly, it ensures that the narration aligns with the duration and sequence of the video frames.

Related Articles

 

Google Introduces VideoPoet: Multimodal Video Generation

Developed by a team at Google Research, VideoPoet presents a method of synthesizing videos and accompanying audio from a diverse array of conditioning signals, such as images, videos, text, and audio. More specifically, it offers Text-to-Video, Image-to-Video, Depth and Optical Flow, Video to Audio, Stylization, Outpainting.
Prev Post

Real-Time Emotion Recognition in Python with OpenCV and FER

Next Post

Bard Finally Generates Images. Let’s Test it

post-bars
Mail Icon

Newsletter

Get Every Weekly Update & Insights

[mc4wp_form id=]

Leave a Comment