Create your own GenAI video pipeline with Thierry Moreau #AIvideopipeline

Summarise this content to 300 words

We’ve all been blown away by the minute long videos generated by OpenAI’s latest Sora models. At the time of this writing, the technology has been showcased but is not generally available yet. I’m personally very curious to know how much GPU compute time it will takes to produce just one minute of Sora video, my guess is a lot.

If you can’t wait to generate your own videos from a simple text prompt, there’s an alternative path that’s available to developers today. This method will rely on APIs that are readily available to developers today, all of which are built out of open source generative models from Mistral AI, Nous Research, and Stability AI.

Let’s find out how close we can get to Sora levels; all of it for under $3 in compute costs per minute long video.

Let’s first take a look at the overall GenAI video generation experience we’re trying to deliver. I settled for this simple idea: name any dish — it can be a real dish or something completely made up like a “skittles omelette” — and the pipeline generates a video ready for posting on TikTok that shows you how to cook the dish.

Here’s a preview of what comes out after you ask the tool to generate a recipe video for a “dorritos consomme” (which by the way is not a made up dish here but a real recipe):

In order to make this work however I’m not going to rely on a single model as with OpenAI’s Sora. Instead I’m building a pipeline specialized to the task of “recipe video generation” out of a handful of GenAI models. I’ll follow the GenAI model chain (which I also like to refer to as model cocktail) below.

Let’s take a look at the whole pipeline of models here:

Nous Hermes 2 Mixtral8x7B LLM (community fine tune by Nous Research) to generate a recipe from the name of a dish.
Mixtral8x7B Instruct LLM in JSON mode to take the recipe into a structured JSON format which breaks down the recipe into the following fields: recipe title, prep time, cooking time, difficulty, ingredients list and instruction steps.
SDXL to generate a frame for the finished dish, each one of the ingredients, and each of the recipe steps.
Stable Video Diffusion 1.1 to animate each frame into a short 4 second video.

Finally once all of the videos are generated, I stitch all of the clips together using MoviePy, add subtitles and a human generated soundtrack.

OctoAI

You can deploy all of the models above yourself on powerful enough hardware, but chances it’s going to be a bit of a cumbersome task. You can also obtain these models from various model providers that host LLMs, text to image, or image animation models, but the simplest path to getting started is to get all of your models from OctoAI to power the code I’ve prepared.

To use OctoAI, you just need to go to https://octoai.cloud/ and sign in using your Google or GitHub account.
Next you’ll need to produce an OctoAI API token by following these instructions.

When signing up you’ll receive $10 worth of free credits (expire within about one month). That is equivalent of:

Over 500,000 words with the largest Llama 2 70B model, and over a million words with the new Mixtral 8x7B model
1,000 SDXL default images
66 SVD videos

Jupyter Notebook

Next you’ll need a place to run your code, and the simplest way to get started is to launch a jupyter notebook on your own machine or on Google Colab (link to notebook).

You just need to install ImageMagick, and the following pip packages:

octoai-sdk
langchain
openai
pillow
ffmpeg
moviepy

First I’m going to show how you can use a great community model, Nous Hermes 2 Mixtral-8x7B to generate a full recipe from a dish name. Typically developers will have their favorite libraries they’ll use to build LLM apps — in this section we’ll use Langchain python SDK, which is extremely popular.

Nous Hermes 2 Mixtral is a community fine tune of Mixtral-8x7B by Nous Research that has been trained on over 1 million entries from GPT-4. We’re using the SFT + DPO version of Mixtral Hermes 2.

The key here if you’re a Langchain developer and want to use this cool Nous Hermes 2 model is to rely on the OctoAIEndpoint LLM by adding the following line to your python script:

from langchain.llms.octoai_endpoint import OctoAIEndpoint

Then you can instantiate your OctoAIEndpoint LLM by passing in under the model_kwargs dictionary what model you wish to use (there is a rather wide selection you can consult here), and what the maximum number of tokens should be set to.

Next you need to define your prompt template. The key here is to provide enough rules to guide the LLM into generating a recipe with just the right amount of information and detail. This will make the text generated by the LLM usable in the next generation steps (image generation, image animation etc.).

Finally you’ll need to instantiate an LLM chain by passing in the LLM and the prompt template you just created.

This chain is now ready to be invoked by passing in the user input, namely: the name of the dish to generate a recipe for. Let’s invoke the chain and see what recipe our LLM just thought about.

Let’s take a look at the code.

from langchain.llms.octoai_endpoint import OctoAIEndpoint
from langchain import PromptTemplate, LLMChainllm = OctoAIEndpoint(
endpoint_url="https://text.octoai.run/v1/chat/completions",
model_kwargs={
"model": "nous-hermes-2-mixtral-8x7b-dpo",
"messages": [
{
"role": "system",
"content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
}
],
"stream": False,
"max_tokens": 1024,
"temperature": 0.01
},
)
# Define a recipe template
template = """
You are a food recipe generator.
Given the name of a dish, generate a recipe that's easy to follow and leads to a delicious and creative dish.
Here are some rules to follow at all costs:
1. Provide a list of ingredients needed for the recipe.
2. Provide a list of instructions to follow the recipe.
3. Each instruction should be concise (1 sentence max) yet informative. It's preferred to provide more instruction steps with shorter instructions than fewer steps with longer instructions.
4. For the whole recipe, provide the amount of prep and cooking time, with a classification of the recipe difficulty from easy to hard.
Human: Generate a recipe for a dish called {human_input}
AI: """
prompt = PromptTemplate(template=template, input_variables=["human_input"])
# Set up the language model chain
llm_chain = LLMChain(prompt=prompt, llm=llm)
# Let's request user input for the recipe name
print("Provide a recipe name, e.g. baked alaska")
recipe_title = input()
# Invoke the LLM chain and print the response
response = llm_chain.predict(human_input=recipe_title)
print(response)

So for instance if I provide “dorritos consomme”, I’ll get the following recipe:

Sure, I'd be happy to help you create a unique dish named "Doritos Consomme". Here's the recipe:**Ingredients:**
1. 2 cups of Doritos, any flavor
2. 1 small onion, chopped
3. 2 cloves of garlic, minced
4. 1 celery stalk, chopped
5. 1 carrot, chopped
6. 6 cups of vegetable broth
7. Salt and pepper to taste
8. Optional garnish: a handful of crushed Doritos and a sprig of fresh cilantro
**Instructions:**
1. In a large pot, sauté the Doritos, onion, garlic, celery, and carrot over medium heat until the Doritos are slightly toasted and the vegetables are softened.
2. Add the vegetable broth, bring to a boil, then reduce heat and let it simmer for 30 minutes.
3. Strain the mixture through a fine-mesh sieve into a large bowl, pressing on the solids to extract as much liquid as possible.
4. Season the consomme with salt and pepper to taste.
5. Serve the consomme hot, garnished with crushed Doritos and a sprig of fresh cilantro if desired.
**Prep Time:** 10 minutes
**Cooking Time:** 30 minutes
**Difficulty:** Medium (due to the straining process)
Enjoy your creative and delicious Doritos Consomme!

Now that I have a recipe, I want to generate media out of it (images, videos, captions). Unfortunately the raw recipe is not as easily parse-able as it seems. While everything is formatted with numbered lists and bullet points that formatting can change from LLM generation to LLM generation. We also get those annoying helpful assistant messages (“sure…”) that makes parsing a bit challenging too.

The solution is to get that recipe in a nice JSON object format — nothing easier than a dictionary to then process in a structured way the details of my recipe: ingredients, instructions, metadata (cook time etc.).

To do that I’m going to define a Pydantic class from which to derive a JSON object. High level the structure will look as such:

A dish_name field (string) – name of the dish
An ingredient_list field (List[Ingredient]) – which lists ingredients. Each Ingredient contains an ingredient field (string) which describes the ingredient and an illustration field (string) that describes a visual for that ingredient.
A recipe_steps field (List[RecipeStep]) – which lists the recipe steps. Each RecipeStep contains a step field (string) which describes the step and an illustration field (string) that describes a visual for that instruction step.
A prep_time field (int) – preparation time in minutes
A cook_time field (int) – cooking time in minutes
A difficulty field (string) – difficulty rating of the recipe

This time around I assume that the developer likes to use the OpenAI SDK (maybe they are users of GPT-4, find it great but a bit too expensive for the wallet). Allow me to share a trick that will let you use models like the hugely popular Mixtral-8x7B by Mistral AI via an OpenAI API.

All you need is to override OpenAI’s base URL and API key:

client = openai.OpenAI(
base_url="https://text.octoai.run/v1", api_key=OCTOAI_API_TOKEN
)

Next, when you instantiate your chat completion instance, you just need to set the model to mixtral-8x7b-instruct . Oh and the Recipe Pydantic class I sketched out above? We’ll pass this as our response format constraint.

client.chat.completions.create(
model="mixtral-8x7b-instruct",
# Other arguments
response_format={"type": "json_object", "schema": Recipe.model_json_schema()}
)

Let’s take a look at the code for formatting our recipe in JSON mode using the OpenAI SDK overridden to invoke Mixtral-8x7B.

import openai
import json
from pydantic import BaseModel, Field
from typing import Listclient = openai.OpenAI(
base_url="https://text.octoai.run/v1", api_key=OCTOAI_API_TOKEN
)
class Ingredient(BaseModel):
"""The object representing an ingredient"""
ingredient: str = Field(description="Ingredient")
illustration: str = Field(description="Text-based detailed visual description of the ingredient for a photograph or illustrator")
class RecipeStep(BaseModel):
"""The object representing a recipe steps"""
step: str = Field(description="Recipe step/instruction")
illustration: str = Field(description="Text-based detailed visual description of the instruction for a photograph or illustrator")
class Recipe(BaseModel):
"""The format of the recipe answer."""
dish_name: str = Field(description="Name of the dish")
ingredient_list: List[Ingredient] = Field(description="List of the ingredients")
recipe_steps: List[RecipeStep] = Field(description="List of the recipe steps")
prep_time: int = Field(description="Recipe prep time in minutes")
cook_time: int = Field(description="Recipe cooking time in minutes")
difficulty: str = Field(description="Rating in difficulty, can be easy, medium, hard")
chat_completion = client.chat.completions.create(
model="mixtral-8x7b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "{}".format(response)},
],
temperature=0,
response_format={"type": "json_object", "schema": Recipe.model_json_schema()},
max_tokens=1024
)
formatted_response = chat_completion.choices[0].message.content
recipe_dict = json.loads(formatted_response)
print(json.dumps(recipe_dict, indent=2))

Running this code will produce the following JSON output:

{
"ingredient_list": [
{
"ingredient": "2 cups of Doritos, any flavor",
"illustration": "A pile of Doritos chips"
},
{
"ingredient": "1 small onion, chopped",
"illustration": "A chopped onion on a cutting board"
},
{
"ingredient": "2 cloves of garlic, minced",
"illustration": "Two cloves of garlic on a cutting board"
},
{
"ingredient": "1 celery stalk, chopped",
"illustration": "A chopped celery stalk"
},
{
"ingredient": "1 carrot, chopped",
"illustration": "A chopped carrot"
},
{
"ingredient": "6 cups of vegetable broth",
"illustration": "Six cups of vegetable broth in a pot"
},
{
"ingredient": "Salt and pepper to taste",
"illustration": "A salt and pepper shaker"
},
{
"ingredient": "Optional garnish: a handful of crushed Doritos and a sprig of fresh cilantro",
"illustration": "A handful of crushed Doritos and a sprig of fresh cilantro"
}
],
"recipe_steps": [
{
"step": "In a large pot, saut\u00e9 the Doritos, onion, garlic, celery, and carrot over medium heat until the Doritos are slightly toasted and the vegetables are softened.",
"illustration": "A pot with Doritos, onion, garlic, celery, and carrot being saut\u00e9ed"
},
{
"step": "Add the vegetable broth, bring to a boil, then reduce heat and let it simmer for 30 minutes.",
"illustration": "A pot with vegetable broth added to the saut\u00e9ed ingredients, being brought to a boil, and then simmering"
},
{
"step": "Strain the mixture through a fine-mesh sieve into a large bowl, pressing on the solids to extract as much liquid as possible.",
"illustration": "A fine-mesh sieve with the mixture being strained into a large bowl, with pressure being applied to the solids"
},
{
"step": "Season the consomm\u00e9 with salt and pepper to taste.",
"illustration": "A bowl of consomm\u00e9 being seasoned with salt and pepper"
},
{
"step": "Serve the consomm\u00e9 hot, garnished with crushed Doritos and a sprig of fresh cilantro if desired.",
"illustration": "A bowl of hot consomm\u00e9 being garnished with crushed Doritos and a sprig of fresh cilantro"
}
],
"prep_time": 10,
"cook_time": 30,
"difficulty": "Medium"
}

Beauty! Now we’re ready to generate media.

Now that I have a nice JSON object to work with, it should be straightforward to generate an image for each one of the ingredients, and each one of the recipe step.

I’m basically using the JSON object as a story board here for my recipe video.

To generate images, I’m going to rely on the SDXL model by Stability AI. To invoke it, I’m going to rely this time on the OctoAI Python SDK.

All you need here is to instantiate the OctoAI ImageGenerator with your OctoAI API token, then invoke the generate method for each set of images you want to produce. You’ll need to pass in the following arguments:

engine which selects what model to use – we use SDXL here
prompt which describes the image we want to generate
negative_prompt which provides image attributes/keywords that we absolutely don’t want to have in our final image
width, height which helps us specify a resolution and aspect ratio of the final image
sampler which is what’s used in every denoising step, you can read more about them here
steps which specifies the number of denoising steps to obtain the final image
cfg_scale which specifies the configuration scale, which defines how closely to adhere to the original prompt
num_images which specifies the number of images to generate at once
use_refiner which when turned on lets us use the SDXL refiner model which enhances the quality of the image
high_noise_frac which specifies the ratio of steps to perform with the base SDXL model vs. refiner model
style_preset which specifies a stype preset to apply to the negative and positive prompts, you can read more about them here

To read more about the API and what options are supported in OctoAI, head over to this link.

Note: Looking to use a specific SDXL checkpoint, LoRA or controlnet for your image generation needs? You can manage and upload your own collection of stable diffusion assets via the OctoAI CLI, or via the web UI. You can then invoke your own checkpoint, LoRA, textual inversion, or controlnet via the ImageGenerator API.

import PIL
from octoai.clients.image_gen import Engine, ImageGenerator# Instantiate the OctoAI SDK image generator
image_gen = ImageGenerator(token=OCTOAI_API_TOKEN)
# Ingredients stills dictionary (Ingredient -> Image)
ingredient_images = {}
# Iterate through the list of ingredients in the recipe dictionary
for ingredient in recipe_dict["ingredient_list"]:
# We do some simple prompt engineering to achieve a consistent style
prompt = "RAW photo, Fujifilm XT, clean bright modern kitchen photograph, ({})".format(ingredient["illustration"])
# The parameters below can be tweaked as needed, the resolution is intentionally set to portrait mode
image_gen_response = image_gen.generate(
engine=Engine.SDXL,
prompt=prompt,
negative_prompt="Blurry photo, distortion, low-res, poor quality, watermark",
width=768,
height=1344,
num_images=1,
sampler="DPM_PLUS_PLUS_2M_KARRAS",
steps=30,
cfg_scale=12,
use_refiner=True,
high_noise_frac=0.8,
style_preset="Food Photography",
)
ingredient_images[ingredient["ingredient"]] = image_gen_response.images[0].to_pil()
display(ingredient_images[ingredient["ingredient"]])
# Recipe steps stills dictionary (Step -> Image)
step_images = {}
# Iterate through the list of steps in the recipe dictionary
for step in recipe_dict["recipe_steps"]:
# We do some simple prompt engineering to achieve a consistent style
prompt = "RAW photo, Fujifilm XT, clean bright modern kitchen photograph, ({})".format(step["illustration"])
# The parameters below can be tweaked as needed, the resolution is intentionally set to portrait mode
image_gen_response = image_gen.generate(
engine=Engine.SDXL,
prompt=prompt,
negative_prompt="Blurry photo, distortion, low-res, poor quality, watermark",
width=768,
height=1344,
num_images=1,
sampler="DPM_PLUS_PLUS_2M_KARRAS",
steps=30,
cfg_scale=12,
use_refiner=True,
high_noise_frac=0.8,
style_preset="Food Photography",
)
step_images[step["step"]] = image_gen_response.images[0].to_pil()
display(step_images[step["step"]])
# Final dish in all of its glory
prompt = "RAW photo, Fujifilm XT, clean bright modern kitchen photograph, professionally presented ({})".format(recipe_dict["dish_name"])
image_gen_response = image_gen.generate(
engine=Engine.SDXL,
prompt=prompt,
negative_prompt="Blurry photo, distortion, low-res, poor quality",
width=768,
height=1344,
num_images=1,
sampler="DPM_PLUS_PLUS_2M_KARRAS",
steps=30,
cfg_scale=12,
use_refiner=True,
high_noise_frac=0.8,
style_preset="Food Photography",
)
final_dish_still = image_gen_response.images[0].to_pil()
display(final_dish_still)

Now I have beautiful stills that I’ll just have to animate.

I’m going to use Stable Video Diffusion 1.1 to animate each one of the images I generated. The model is open source but under a “non-commercial research community license”. Thankfully OctoAI has a commercial partnership with Stability that lets you use SVD1.1 commercially.

To animate an image using OctoAI’s Python SDK, you just need to instantiate the OctoAI VideoGenerator with yout OctoAI API token, then invoke the generate method for each animation you want to produce. You’ll need to pass in the following arguments:

engine which selects what model to use – we use SVD here
image which encodes the input image we want to animate as a base64 string
steps which specifies the number of denoising steps to obtain each frame in the video
cfg_scale which specifies the configuration scale, which defines how closely to adhere to the image description
fps which specifies the numbers of frames per second
motion scale which indicates how much motion should be in the generated animation
noise_aug_strength which specifies how much noise to add to the initial images – a higher value encourages more creative videos
num_video which represents how many output animations to generate

To read more about the API and what options are supported in OctoAI, head over to this link.

from PIL import Image
from io import BytesIO
from base64 import b64encode, b64decode
from octoai.clients.video_gen import Engine, VideoGenerator# We'll need this helper to convert PIL images into a base64 encoded string
def image_to_base64(image: Image) -> str:
buffered = BytesIO()
image.save(buffered, format="JPEG")
img_b64 = b64encode(buffered.getvalue()).decode("utf-8")
return img_b64
# Instantiate the OctoAI SDK video generator
video_gen = VideoGenerator(token=OCTOAI_API_TOKEN)
# Dictionary that stores the videos for ingredients (ingredient -> video)
ingredient_videos = {}
# Iterate through every ingredient in the recipe
for ingredient in recipe_dict["ingredient_list"]:
key = ingredient["ingredient"]
# Retrieve the image from the ingredient_images dict
still = ingredient_images[key]
# Generate a video with the OctoAI video generator
video_gen_response = video_gen.generate(
engine=Engine.SVD,
image=image_to_base64(still),
steps=25,
cfg_scale=3,
fps=6,
motion_scale=0.5,
noise_aug_strength=0.02,
num_videos=1,
)
video = video_gen_response.videos[0]
# Store the video in the ingredient_videos dict
ingredient_videos[key] = video
# Dictionary that stores the videos for recipe steps (step -> video)
steps_videos = {}
# Iterate through every step in the recipe
for step in recipe_dict["recipe_steps"]:
key = step["step"]
# Retrieve the image from the step_images dict
still = step_images[key]
# Generate a video with the OctoAI video generator
video_gen_response = video_gen.generate(
engine=Engine.SVD,
image=image_to_base64(still),
steps=25,
cfg_scale=3,
fps=6,
motion_scale=0.5,
noise_aug_strength=0.02,
num_videos=1,
)
video = video_gen_response.videos[0]
# Store the video in the ingredient_videos dict
steps_videos[key] = video
# Generate a for the final dish presentation (it'll be used in the intro and at the end)
video_gen_response = video_gen.generate(
engine=Engine.SVD,
image=image_to_base64(final_dish_still),
steps=25,
cfg_scale=3,
fps=6,
motion_scale=0.5,
noise_aug_strength=0.02,
num_videos=1,
)
final_dish_video = video_gen_response.videos[0

It takes about 30s to generate a 4s clip. Since we’re generating each video sequentially this step could take a few minutes to finish, but it’s easy to extend this code to make it asynchronous, or parallelize it.

I’m going to rely on the MoviePy library to create a montage of the videos.

For each short animation (dish, ingredients, steps), we also have corresponding text that goes with it from the original recipe_dict JSON object. This allows us to generate a montage captions.

Each video having 25 frames and being a 6FPS video, they will last 4.167s each. Because the ingredients list can be rather long, we crop each video to a duration of 2s to keep the flow of the video going. For the steps video, we play 4s of each clip given that we need to give the viewer time to read the instructions.

The code below will perform 3 tasks:

Stich together the clips into a montage that starts with final dish, each ingredients, each instruction, and ends with the final dish againg.
Add subtitles throughout the video to provide clear instructions to the viewer.
Add a soundtrack to the video to make it a bit more catchy.

from IPython.display import Video
from moviepy.editor import *
from moviepy.video.tools.subtitles import SubtitlesClip
import textwrap# Video collage
collage = []
# To prepare the closed caption of the video, we define
# two durations: ingredient duration (2.0s) and step duration (4.0s)
ingredient_duration = 2
step_duration = 4
# We keep track of the time ellapsed
time_ellapsed = 0
# This sub list will contain tuples in the following form:
# ((t_start, t_end), "caption")
subs = []
# Let's create the intro clip presenting the final dish
with open('final_dish.mp4', 'wb') as wfile:
wfile.write(final_dish_video.to_bytes())
vfc = VideoFileClip('final_dish.mp4')
collage.append(vfc)
# Add the subtitle which provides the name of the dish, along with prep time, cook time and difficulty
subs.append(((time_ellapsed, time_ellapsed+step_duration), "{} Recipe\nPrep: {}min\nCook: {}min\nDifficulty: {}".format(
recipe_dict["dish_name"].title(), recipe_dict["prep_time"], recipe_dict["cook_time"], recipe_dict["difficulty"]))
)
time_ellapsed += step_duration
# Go through the ingredients list to stich together the ingredients clip
for idx, ingredient in enumerate(recipe_dict["ingredient_list"]):
# Write the video to disk and load it as a VideoFileClip
key = ingredient["ingredient"]
video = ingredient_videos[key]
with open('clip_ingredient_{}.mp4'.format(idx), 'wb') as wfile:
wfile.write(video.to_bytes())
vfc = VideoFileClip('clip_ingredient_{}.mp4'.format(idx))
vfc = vfc.subclip(0, ingredient_duration)
collage.append(vfc)
# Add the subtitle which just provides each ingredient
subs.append(((time_ellapsed, time_ellapsed+ingredient_duration), "Ingredients:\n{}".format(textwrap.fill(ingredient["ingredient"], 35))))
time_ellapsed += ingredient_duration
# Go through the recipe steps to stitch together each step of the recipe video
for idx, step in enumerate(recipe_dict["recipe_steps"]):
# Write the video to disk and load it as a VideoFileClip
key = step["step"]
video = steps_videos[key]
with open('clip_step_{}.mp4'.format(idx), 'wb') as wfile:
wfile.write(video.to_bytes())
vfc = VideoFileClip('clip_step_{}.mp4'.format(idx))
collage.append(vfc)
# Add the subtitle which just provides each recipe step
subs.append(((time_ellapsed, time_ellapsed+step_duration), "Step {}:\n{}".format(idx, textwrap.fill(step["step"], 35))))
time_ellapsed += step_duration
# Add the outtro clip
vfc = VideoFileClip('final_dish.mp4'.format(idx))
collage.append(vfc)
# Add the subtitle: Enjoy your {dish_name}
subs.append(((time_ellapsed, time_ellapsed+step_duration), "Enjoy your {}!".format(recipe_title.title())))
time_ellapsed += step_duration
# Concatenate the clips into one initial collage
final_clip = concatenate_videoclips(collage)
final_clip.to_videofile("collage.mp4", fps=vfc.fps)
# Add subtitles to the collage
generator = lambda txt: TextClip(
txt,
font='Century-Schoolbook-Roman',
fontsize=30,
color='white',
stroke_color='black',
stroke_width=1.5,
method='label',
transparent=True
)
subtitles = SubtitlesClip(subs, generator)
result = CompositeVideoClip([final_clip, subtitles.margin(bottom=70, opacity=0).set_pos(('center','bottom'))])
result.write_videofile("collage_sub.mp4", fps=vfc.fps)
# Now add a soundtrack: you can browse https://pixabay.com for a track you like
# I'm downloading a track called "once in paris" by artist pumpupthemind
import subprocess
subprocess.run(["wget", "-O", "audio_track.mp3", "http://cdn.pixabay.com/download/audio/2023/09/29/audio_0eaceb1002.mp3"])
# Add the soundtrack to the video
videoclip = VideoFileClip("collage_sub.mp4")
audioclip = AudioFileClip("audio_track.mp3").subclip(0, videoclip.duration)
video = videoclip.set_audio(audioclip)
video.write_videofile("collage_sub_sound.mp4")

Now if you open the “collage_sub_sound.mp4” you’ll get a full video showing you how to make a dorritos consomme (or any dish that you enter for the matter).

Here’s a video for “skittles omelette” for your enjoyment.

Sora may not be mainstream yet, but creative developers already have today the building blocks to build valuable GenAI media generation pipelines.

How would you build upon this work to create something completely new? Add some GenAI narration? Or GenAI soundtrack? Mix and match new model combinations?

Let me know in the comments or on OctoAI Discord .

Source link

Source link: https://medium.com/@thierryjmoreau/build-your-own-genai-video-generation-pipeline-cdc1515d1db9?source=rss——openai-5