Data & AI

January 19, 2024

Narrating Videos with OpenAI Vision and Whisperer Automatically

Synchronizing Narration with Video Length, Tone and Frame Content Contextually Rich and Dynamic Storytelling

This guide aims to present an end-to-end solution that for automatically narrating videos with AI, leveraging OpenAI’s GPT-4 Vision and Text-to-Speech technology’s cutting-edge capabilities. OpenAI’s GPT-4 Vision transcends traditional language models by understanding and interpreting visual content, thereby unlocking a myriad of possibilities.

We will go through a Python-based method that generates and aligns narrations with both the timeline and visual elements of videos. Furthermore, GPT-4 Vision’s ability to comprehend images, charts, graphs, and even generate creative content marks a significant step forward in the way we create and consume content.

1. Understanding the Technology

GPT-4 Vision notably excels in visual question answering (VQA). It can comprehend the context and relationships within an image, even deciphering text and code. For example, in tests, GPT-4 Vision successfully described why a particular image was humorous by referencing its components and their interconnections.

GPT-4 with Vision processes images alongside text by accepting either image URLs or base64-encoded images within the request body. Upon receiving an image, GPT-4 with Vision analyzes it and produces text-based responses or descriptions.

1.1 The key capabilities unlocked by adding vision to GPT-4 include:

Image Captioning: Generating natural language descriptions of image contents.
Visual Question Answering: Answering text-based questions about images.
Multimodal Reasoning: Making inferences using both text and visual inputs.
OCR for Text Extraction: Recognizing and extracting text from images.

To use GPT-4 with Vision in Python, you can make an HTTP request to the OpenAI API using the requests module. The payload of the request should include the model name (gpt-4-vision-preview), a user message containing either the image_url or a base64 encoded image, and a max_tokens limit. The model then processes the queried image and interprets the image’s primary content.

While GPT-4 with Vision offers a range of capabilities, it’s not optimized for tasks like precise object localization or counting, and its performance may vary with rotated, blurry, or obscured images. Therefore, it’s crucial to consider these limitations when applying the technology to specific use cases.

Furthermore, OpenAI recommends that image inputs should be under 1 MB and no larger than 1024×1024 pixels for optimal performance, though other sizes are supported. Additionally, the detail parameter in the request can control the fidelity of image interpretation, with options for lower or higher resolutions based on the processing needs.

2. Setting Up the Environment

Python Installation: Ensure that your system has Python installed. If not, download and install it from the official Python website.
Library Installation: Install the OpenAI Python library essential for interacting with the OpenAI API, should be installed.. You can install it using pip inside a Jupyter Notebook (!pip install openai) or from the cmd (pip install openai)
API Key: Sign up for an OpenAI account and obtain your API key. This key is required to authenticate your requests to OpenAI’s servers.

However, the cost of using GPT-4 is higher than its predecessors. Understanding the pricing structure is important to manage your project budget effectively, especially with long videos.

The API pricing varies based on the model and the number of tokens used. Detailed pricing information can be found on OpenAI’s pricing page. Refer further to OpenAI’s official documentation for detailed guidance.

3. Python Implementation

This section provides a detailed walkthrough of the Python code. Which is designed to integrate AI-driven narration into videos using OpenAI’s GPT-4 Vision and Text-to-Speech technologies. We have carefully annotated each function with commentary to help readers understand its specific role and purpose.

Main Process:

Initialization: Sets up the OpenAI client with the API key.
Video Processing: Reads the video file and extracts frames
Narration Generation: Uses GPT-4 Vision to create a script based on the frames.
Text-to-Speech Conversion: Turns the script into an audio file.
Saving Output: The system saves the audio narration for later use or integration with the video.

3.1 Python Functions

In this section, we break down the Python functions used in our video narration project. Each function is carefully annotated with commentary to ensure readers understand its specific role and purpose.

From initializing the OpenAI client to extracting video frames, creating AI-generated narration, and converting it to speech, these functions collectively orchestrate the process of adding AI-driven audio narration to video content.

				
					# Import necessary libraries
from IPython.display import display, Image, Audio
import cv2
import base64
import time
from openai import OpenAI
import os
import requests

# Initialize OpenAI client
def initialize_openai_api(api_key):
    """
    Initializes the OpenAI API client with the provided API key.
    
    Parameters:
    api_key (str): Your OpenAI API key.
    
    Returns:
    OpenAI: An instance of the OpenAI client.
    """
    os.environ["OPENAI_API_KEY"] = api_key
    return OpenAI()

# Function to read and process the video
def read_video(video_path, frame_interval):
    """
    Reads a video file and extracts frames at specified intervals.
    
    Parameters:
    video_path (str): The path to the video file.
    frame_interval (int): The interval at which frames are extracted, e.g. 60 means every 60th frame is extracted.
    
    Returns:
    tuple: A tuple containing a list of base64-encoded frames and the duration of the video. 
    """
    video = cv2.VideoCapture(video_path)
    frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_rate = video.get(cv2.CAP_PROP_FPS)
    duration = frame_count / frame_rate
    base64Frames = []

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    return base64Frames[::frame_interval], duration

# Function to display frames (Optional)
def display_frames(frames):
    """
    Displays each frame from a list of base64-encoded frames every 0.025 seconds.
    
    Parameters:
    frames (list): A list of base64-encoded frames.
    """
    display_handle = display(None, display_id=True)
    for img in frames:
        display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
        time.sleep(0.025)

# Function to generate narration script using GPT with Vision
def generate_gpt_vision_prompt(frames, prompt_text, max_tokens):
    """
    Generates a narration script using GPT with Vision.
    
    Parameters:
    frames (list): A list of base64-encoded frames.
    prompt_text (str): The prompt text to guide the AI model.
    max_tokens (int): The maximum number of tokens in the generated text.
    
    Returns:
    str: The generated narration script.
    """
    PROMPT_MESSAGES = [
        {
            "role": "user",
            "content": [
                prompt_text,
                *map(lambda x: {"image": x, "resize": 768}, frames),
            ],
        },
    ]
    params = {
        "model": "gpt-4-vision-preview",
        "messages": PROMPT_MESSAGES,
        "max_tokens": max_tokens,
    }
    client = OpenAI()
    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

# Function to convert text to speech
def text_to_speech(text, voice, speaking_rate):
    """
    Converts the provided text to speech using OpenAI's Text-to-Speech API.
    
    Parameters:
    text (str): The text to be converted into speech.
    voice (str): The voice model to be used.
    speaking_rate (float): The rate of speech.
    
    Returns:
    bytes: The audio data of the spoken text.
    """
    response = requests.post(
        "https://api.openai.com/v1/audio/speech",
        headers={
            "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
        },
        json={
            "model": "tts-1-1106",
            "input": text,
            "voice": voice,
            "speaking_rate": speaking_rate,
        },
    )
    audio = b""
    for chunk in response.iter_content(chunk_size=1024 * 1024):
        audio += chunk
    return audio

3.2 Executing the Functions

The main script combines all the above functions to create a workflow. It initializes the OpenAI client, processes the video to extract frames, uses GPT-4 Vision to generate a narration script based on these frames, converts this script into audio, and finally saves the audio file. Importantly, it ensures that the narration aligns with the duration and sequence of the video frames.

Click here to upgrade to a paid membership account to continue reading this content..

Google Introduces VideoPoet: Multimodal Video Generation

Developed by a team at Google Research, VideoPoet presents a method of synthesizing videos and accompanying audio from a diverse array of conditioning signals, such as images, videos, text, and audio. More specifically, it offers Text-to-Video, Image-to-Video, Depth and Optical Flow, Video to Audio, Stylization, Outpainting.

Read Article

Narrating Videos with OpenAI Vision and Whisperer Automatically

Synchronizing Narration with Video Length, Tone and Frame Content Contextually Rich and Dynamic Storytelling

1. Understanding the Technology

1.1 The key capabilities unlocked by adding vision to GPT-4 include:

2. Setting Up the Environment

3. Python Implementation

3.1 Python Functions

3.2 Executing the Functions

Click here to upgrade to a paid membership account to continue reading this content..

Related Articles

Google Introduces VideoPoet: Multimodal Video Generation

Real-Time Emotion Recognition in Python with OpenCV and FER

Bard Finally Generates Images. Let’s Test it

Leave a Comment Cancel Reply

Cristian Velasquez

How Bitcoin Can Survive Quantum Computing Threats in

Gold Faces Pressure as It Approaches Key Support

XRP Holds Strong with 300% Gains; What Lies

AI Singing Voice Cloning with AI in Python

Acquiring and Analyzing Earnings Announcements Data in Python

Top 36 Moving Averages Methods For Stock Prices

Technical Guides

Stock Market News

Forex Market News

Crypto Market News

Classify Stock Moves with KNN and Lorentzian

Market Memory Structure with Autocorrelation Periodgram

<img width="230" height="40" src="//entreprenerdly.com/wp-content/uploads/2025/04/Entreprenerdly-Logo-BLACK-min2.svg" alt="Search">

Narrating Videos with OpenAI Vision and Whisperer Automatically

Synchronizing Narration with Video Length, Tone and Frame Content Contextually Rich and Dynamic Storytelling

1. Understanding the Technology

1.1 The key capabilities unlocked by adding vision to GPT-4 include:

2. Setting Up the Environment

3. Python Implementation

3.1 Python Functions

3.2 Executing the Functions

Related Articles

Google Introduces VideoPoet: Multimodal Video Generation

Real-Time Emotion Recognition in Python with OpenCV and FER

Bard Finally Generates Images. Let’s Test it

Get Every Weekly Update & Insights

Leave a Comment Cancel Reply

Cristian Velasquez

Categories

Newsletter

Recent Feeds