Data Science

January 18, 2024

AI Singing Voice Cloning with AI in Python

End-to-End Python Guide for Data Processing, Training and Inference of AI Cloned voices. From Voice Data to using Pre-trained and Custom Models

Imagine a world where your voice could harmonize with any tune, adopt any accent, or even replicate the iconic timbre of legendary singers. This is a reality made possible through AI singing voice cloning.

This groundbreaking technology merges the art of music with the precision of machine learning, enabling us to create new songs or reimagine classics in any voice we desire.

AI voice cloning is a technology that captures the unique characteristics of a voice and then reproduces it with astonishing accuracy. This digital alchemy allows us to not only replicate existing voices but also to create entirely new ones.

It’s a tool that has revolutionized content creation, from personalized songs to custom voiceovers, opening up a world of creative possibilities that transcend language and cultural barriers.

The objective of this article is to provide technical readers with a comprehensive Python guide to utilizing AI voice cloning technology — an end-to-end solution for transforming any audio into the tones of a chosen artist or even one’s own voice by training a custom model.

This tutorial article is structured as follows:

1. Technology Background

The technology that we will be using in this article is called Singing Voice Conversion (SVC), in particular a system called SO-VITS-SVC, which stands for “SoftVC VITS Singing Voice Conversion”.

The SO-VITS-SVC system represents a sophisticated implementation of Singing Voice Conversion (SVC) using deep learning technologies. Understanding this system requires an appreciation of the specific machine learning architectures and algorithms it employs.

1.1 Variational Inference and Generative Adversarial Networks

At the heart of SO-VITS-SVC lies the Variational Inference for Text-to-Speech (VITS) architecture. This system ingeniously combines Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
VAEs are utilized for modeling the distribution of mel-spectrograms, which are a key representation of the audio signal in SVC. The VAE component helps in capturing the latent variables of speech.
The VAE loss function is represented as per the formula below. where x is the input mel-spectrogram, z is the latent variable, and KL denotes the Kullback-Leibler divergence.

GANs enhance the realism of the synthesized audio. The discriminator in the GAN critiques the output of the generator, improving its accuracy. The GAN loss function is given by:

For a comprehensive understanding of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), you might want to refer to the original papers introducing these concepts:

VAEs: Kingma, D. P., and Welling, M. “Auto-Encoding Variational Bayes.” arXiv:1312.6114, 2013.
GANs: Goodfellow, I. J., et al. “Generative Adversarial Nets.” arXiv:1406.2661, 2014.

1.2 Shallow Diffusion Process

As illustrated in the accompanying diagram, the shallow diffusion process starts with a noise sample that is progressively refined into a structured mel-spectrogram through a series of transformations.

1. Initial Noise Sample: A visual representation of noise that serves as the starting point for the diffusion process.

2. Transformation Steps: The noise undergoes a series of steps within the diffusion model, transitioning from a disordered state towards a structured mel-spectrogram. This can be described as per below, where xt is the data at step t, and ϵ represents Gaussian noise.

- In the context of SO-VITS-SVC, ‘shallow’ likely signifies fewer layers or steps, striking a balance between computational efficiency and audio quality.

3. Mel-Spectrogram Refinement: The outcome of this process is a mel-spectrogram that encapsulates the audio content of the singing voice, ready for the next synthesis phase.

4. Vocoding: The final vocoding step converts the mel-spectrogram into an audio waveform, which is the audible singing voice.

For an in-depth exploration of diffusion models, the following resources provide detailed explanations and research context:

Sohl-Dickstein, J., et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” arXiv:1503.03585, 2015.
Ho, J., et al. “Denoising Diffusion Probabilistic Models.” arXiv:2006.11239, 2020.

1.3 Integration of the Synthesis Process with the SVC System

Mel-Spectrogram Refinement:

After the shallow diffusion model has structured the noise into a more coherent form, as visualized in the previously mentioned diagram, the resulting mel-spectrogram captures the nuanced audio content of the singing voice. This mel-spectrogram serves as a pivotal bridge between the raw, unstructured data and the final vocal output.

2. Vocoding:

The vocoder is then employed to convert the refined mel-spectrogram into an audio waveform. This step is where the transformation from visual data to the audible singing voice occurs. The vocoder’s role is to synthesize the nuances of pitch, timbre, and rhythm that have been captured in the mel-spectrogram, thereby producing the final singing voice output.

3. Training and Optimization:

To achieve this high-fidelity synthesis, the SO-VITS-SVC system undergoes rigorous training and optimization. The training involves optimizing a combined loss function that balances the contributions of the VAE, GAN, and the diffusion model components.

This optimization is conducted using algorithms like Stochastic Gradient Descent or Adam, with the ultimate goal of minimizing the overall loss. The process ensures that the final output closely resembles the target singing voice in terms of timbre, pitch, and rhythm.

4. Final Output:

The end product of this process is a synthesized voice that closely mirrors the target singing voice. The ability to maintain the musicality and expressive nuances of the source while adopting the tonal qualities of the target is a testament to the sophistication of the SO-VITS-SVC system.

For those who are new to machine learning and deep learning, foundational resources and courses from educational platforms can offer a starting point to build up the necessary background knowledge:

Coursera — Deep Learning Specialization: https://www.coursera.org/specializations/deep-learning
MIT OpenCourseWare — Introduction to Deep Learning: https://ocw.mit.edu/courses/6-036-introduction-to-deep-learning-spring-2021/

1.4 IPython Libraries Used

The SO-VITS-SVC Fork, hosted on GitHub, is a specialized tool designed for real-time singing voice conversion. This fork of the original SO-VITS-SVC project offers enhanced features like more accurate pitch estimation using CREPE, a graphical user interface (GUI), faster training times, and the convenience of installing the tool with pip.

It also integrates QuickVC and corrects some issues present in the original repository. The fork supports real-time voice conversion, making it a versatile tool for voice cloning tasks. Additionally, it simplifies the installation and setup process, making it more accessible for users who wish to experiment with voice cloning technology.

2. Inference: Sing with any Artist’s AI Voice

Inference refers to the process where a neural network model, after being trained on a dataset to understand a particular voice, generates new content with that learned voice.

This phase is where we can ‘sing’ with an artist’s AI voice by providing new inputs (raw singing voice audio) to a pre-trained model, which then produces an output that mimics the artist’s singing style on the raw voice audio.

2.1 Setting Up the SO-VITS-SVC environment

For simplicity, we will be using a Jupyter Notebook with a virtual environment. So we suggest starting there:

Install Anaconda: Download and install Anaconda on your system, which will allow you to create isolated environments for different projects.
Open the Anaconda terminal, create a new environment by running the following command.

				
					conda create -n sovits-svc

If you use VS Code, you can reference the environment from the kernel selection, otherwise if you will continue to use Anaconda, activate the environment with conda activate and the run the Jupyter notebook.

				
					conda activate sovits-svc

Install the necessary libraries in the environment from within the notebook.

				
					!python -m pip install -U pip wheel
%pip install -U ipython
%pip install -U so-vits-svc-fork

To avoid the issue below later on when we run the !svccommand, we should go the the anaconda environment and uninstall Torchaudio with pip uninstall torchaudio and install it it again with pip install torchaudio.

We are now ready to make a ‘Singing Voice Conversion’ using a Pretrained Model on a clean vocals (i.e. no background noise).

2.1 Using a Pre-trained Model for Singing Voice Generation

1) Selecting a pre-trained model

With the environment ready, the next step is to obtain a pre-trained model. We’ve made several pretrained models available that could used, from a Drake to Michael Jackson! Here are the options

One its decided which which model to use, the .pthPytorch model file and the associated config.json must be retrieved and downloaded:

				
					from huggingface_hub import hf_hub_download
import os

# Set the repository ID and local directory
repo_id = 'Entreprenerdly/drake-so-vits-svc'
local_directory = '.'

# Download the config.json file
config_file = hf_hub_download(
    repo_id=repo_id,
    filename='config.json',
    local_dir=local_directory,
    local_dir_use_symlinks=False
)

# Construct the path to the config file in the current directory
local_config_path = os.path.join(local_directory, 'config.json')
print(f"Downloaded config file: {local_config_path}")

# Download the model file
model_file = hf_hub_download(
    repo_id=repo_id,
    filename='G_106000.pth',
    local_dir=local_directory,
    local_dir_use_symlinks=False
)

# Construct the path to the model file in the current directory
local_model_path = os.path.join(local_directory, 'G_83000.pth')
print(f"Downloaded model file: {local_model_path}")

2) Selecting a clean audio file

Next, we’ll download a clean audio file for conversion. The audio file we’re using can be accessed and downloaded via Google Drive.

It’s Justin Bieber clean vocal track, which you can then input into the SO-VITS-SVC system for voice conversion using the selected model.

While this might not be the cleanest track, when preparing your own audio files for conversion, it’s crucial to ensure that they are as clean as possible to avoid any unintended artifacts or quality issues in the generated audio.

The quality of the source audio significantly impacts the fidelity of the voice conversion, and thus a high-quality, clean recording is always recommended.

				
					import requests

vocals_url = 'https://drive.google.com/uc?id=154awrw0VxIZKQ2jQpHQQSt__cOUdM__y'
response = requests.get(vocals_url)
with open('vocals.wav', "wb") as file:
    file.write(response.content)

3) Running the inference

To perform voice conversion using the SO-VITS-SVC model, you will need to specify the paths to your audio file, model checkpoint, and configuration file. Here’s how you can set the paths and run the inference.

Make sure that you are running this command in an environment that has access to the !svc command, such as a command prompt or a script that is executed in an environment where the SO-VITS-SVC tool is installed.

				
					from IPython.display import Audio, display
import os

# Filenames
audio_filename = 'vocals.wav'
model_filename = 'G_106000.pth'
config_filename = 'config.json'

# Construct the full local paths
audio_file = f"\"{os.path.join('.', audio_filename)}\""
model_path = f"\"{os.path.join('.', model_filename)}\""
config_path = f"\"{os.path.join('.', config_filename)}\""

# Running the inference command
!svc infer {audio_file} -m {model_path} -c {config_path}

4) Display the output

After running the inference, you can display the output audio directly in your Jupyter notebook or any IPython interface. The following code snippet will play the resulting audio file:

				
					from IPython.display import Audio, display

# Path for the output audio file
output_audio_path = "vocals.out.wav"

# Display the output audio
display(Audio(output_audio_path, autoplay=True))

5) Optional — Use the GUI to make inferences

For users who prefer a graphical interface, the SO-VITS-SVC system provides an optional GUI for performing voice conversion.

This feature-rich GUI streamlines the inference process, offering an alternative to command-line operations. It can be launched using the following

				
					svcg

Setup: Launch the GUI and configure the necessary paths for the model, configuration, and audio files using the ‘Browse’ buttons.
Model Selection: Choose the appropriate pre-trained model via the ‘Model path’ field.
Configuration: Select the configuration file that corresponds to your model in the ‘Config path’ field.
Audio Input: Load your target audio file for conversion using the ‘Input audio path’.
Inference: Click on the ‘Infer’ button to start the conversion process, adjusting any additional parameters if needed.
Output: Save and listen to the generated output directly through the GUI, ensuring the quality meets your expectations.

The GUI also offers real-time inference capabilities, allowing for adjustments on-the-fly and immediate auditory feedback. It’s designed for ease of use, catering to users who might not be comfortable with scripting or command-line tools.

Click here to upgrade to a paid membership account to continue reading this content..

				
					!pip install spleeter

from spleeter.separator import Separator

# Initialize the separator with the desired configuration.
# Here, 'spleeter:2stems' means we want to separate the audio into two stems: vocals and accompaniment.
separator = Separator('spleeter:2stems')

# Use the separator on the audio file.
# This function will separate the audio file into two files: one containing the vocals, and one containing the background music.
separator.separate_to_file('audiofile.mp3', './')

Splitting Audio track into 15 second snippets: We can use AudioSlicerto split an extensive audio file into 10–15 second snippets suitable for training the model.

				
					from audioslicer import slice_audio

# Path to the input audio file
input_audio_path = 'long_audio_file.mp3'

# Path to the output directory where snippets will be saved
output_directory = 'output/snippets/'

# Length of each audio snippet in seconds
snippet_length = 15  

# Slice the audio file into snippets
slice_audio(input_audio_path, output_directory, snippet_length)

2.2 Automatic Preprocessing

With dataset_raw folder created in the current directory, and the recordings recordings stored in the dataset_raw/{speaker_id} directory, as outlined in the folder structure below, you’re set to initiate the preprocessing and training phases for your singing voice conversion model.

				
					.
├── dataset_raw
│   └── {speaker_id}
│       └── {wav_file}.wav

The preprocessing involves 3 key steps to prepare the audio data for the model training —!svc pre-resample ,!svc pre-config, !svc pre-hubertto finally run !svc train.

1. Resampling Audio: Using the !svc pre-resample command, the system will standardize the sample rate of your audio files. This process will navigate through the dataset_raw/{speaker_id} directory and ensure all .wav files are resampled correctly.

				
					!svc pre-resample

After execution, you will find the resampled audio files in a similar directory structure under a new sample rate directory, typically dataset/44k/{speaker_id}.

2. Configuration File Generation: Next, you’ll generate a configuration file that guides the training process. The svc pre-config command creates this file, leveraging the resampled audio data.

				
					!svc pre-config

The newly generated config.json will be in a config/44k/ directory relative to your current working directory.

3. Pitch Contour Extraction: Pitch extraction is crucial for a singing voice model. By running svc pre-hubert, the system will analyze and extract pitch contours from each audio file in your dataset.

				
					!svc pre-hubert

This step may require you to specify the preferred method for pitch extraction, such as DIO or CREPE.

2.3 Training Configuration

Prior to initiating training, it’s essential to configure your model to ensure optimal learning conditions. This involves editing the config.json file created in the config/44k/ directory during preprocessing. Key parameters within this configuration file include:

log_interval: Sets the frequency at which the loss will be printed to the console during training, allowing you to monitor the model’s learning progress.
eval_interval: Determines how often the model’s state will be saved. This is particularly important as model checkpoints can be large and you want to ensure consistent progress without using excessive storage.
epochs: Defines the total number of training cycles the model will undergo. More epochs can lead to better learning but also require more time.
batch_size: Specifies how many samples will be processed at once. The ideal batch size depends on the available VRAM on your GPU; too large a batch size may exceed your VRAM capacity, while too small could lead to inefficient learning.

For a dataset with 200 samples and a batch size of 20, each epoch equates to 10 steps. If you aim for 100 epochs, this translates to 1,000 steps. It’s advisable to add one to the epoch count to ensure the last step is saved.

The default setting might suggest 10,000 epochs, but depending on your hardware and dataset size, you may need to adjust this. A practical approach could be aiming for 20,000 steps and evaluating the performance before deciding whether to extend training.

2.4 Start Training

Begin the actual model training with the svc train command. This will start the machine learning process, utilizing the preprocessed data from the dataset/44k/{speaker_id} directory and configuration settings in config/44k/config.json.

				
					!svc train

During training, the model’s progress will be saved periodically, as specified by your eval_interval. If you need to stop the training for any reason, you can rest assured that you can resume from the last saved checkpoint. This flexibility allows for training to be a more manageable task, especially when dealing with large datasets or limited computational resources.

As the training proceeds, your console will display the loss at intervals set by log_interval, giving you insight into how the model is learning. Upon completion, the logs/44k directory will contain various files, including the trained models (denoted as G followed by a number). You can then use these models to synthesize your voice, creating your desired audio outputs.

By carefully configuring and running these steps, you will train a model capable of converting singing voice audio with nuances and qualities similar to the original recordings found in your dataset_raw/{speaker_id} directory.

2.5 Model Inference

After the model has been trained, fine-tuned, and validated, the next step is to run inference to convert the source audio into the target voice as per the previous section.

				
					from IPython.display import Audio, display
import os

# Filenames
audio_filename = 'vocals.wav' # vocals to applied trained model
model_filename = 'model.pth' # model file created
config_filename = 'config.json' # config file created 

# Construct the full local paths
audio_file = f"\"{os.path.join('.', audio_filename)}\""
model_path = f"\"{os.path.join('.', model_filename)}\""
config_path = f"\"{os.path.join('.', config_filename)}\""

# Running the inference command
!svc infer {audio_file} -m {model_path} -c {config_path}

After running the inference, you might want to fine-tune the output further. For instance, if the source audio was singing and you need to mix it with background music, or if you wish to adjust the pitch, you can use audio processing tools like Audacity or any other digital audio workstation to mix and match tracks to your preference.

4. Practical Applications of the Solution

Music Production Enhancement — Producers experiment with different vocal textures and styles without requiring the physical presence of the artists. This can be particularly useful for conceptualizing new tracks or for adding backing vocals in different pitches and timbres.
Personalized Music Experience — Imagine a service where fans can receive a version of their favorite song sung in the voice of another artist or even in their own voice using a custom model.
Film and Animation Dubbing — Dubbing for films, cartoons, and games can benefit from SO-VITS-SVC by generating voice overs in various character voices, especially when the original actors are unavailable. It can create consistent vocal performances across different language versions.
Edutational Tools — In language learning, SO-VITS-SVC can enhance comprehension by allowing learners to hear text read aloud in their own voice with the correct accent and intonation of a variety of languages, supporting the development of listening and speaking skills.
Voice Restoration — For individuals who have lost the ability to speak, SO-VITS-SVC could be trained on recordings of their voice to restore their ability to communicate in a manner that retains the essence of their original voice.
Reviving Historical Voices — Bring historical speeches to life by cloning voices from the past, providing a compelling auditory experience in museums or educational content.

Conclusion

The implications for this technology are profound and far-reaching. In the immediate future, we might see a democratization of music production, where independent artists leverage AI to duet with any voice from history, or educators use it to obliterate language barriers in classrooms across the globe. But the true potential of SO-VITS-SVC lies in its long-term applications — those we are yet to fully comprehend or imagine.

Imagine a world where every digital interaction is personalized not just to your visual preferences but tailored to auditory comfort. A world where virtual reality isn’t just a visual spectacle but an auditory feast, where every character, every entity, can converse with you in voices that are indistinguishable from real human ones. The SO-VITS-SVC could be the harbinger of virtual beings with their own unique vocal identities, capable of singing, speaking, and interacting in a multitude of languages and styles.

As we push the boundaries of AI voice cloning, we may soon blur the line between creators and creations. Characters from fiction might release their own albums, and poets long gone might recite new verses. The implications for copyright, identity, and even the essence of creation itself could be challenged and redefined.

Narrating Videos With OpenAI Vision And Whisperer Automatically

We will go through a Python-based method that generates and aligns narrations with both the timeline and visual elements of videos. OpenAI's GPT-4 Vision transcends traditional language models by understanding and interpreting visual content, thereby offering a myriad of possibilities.

Read Article

Google Introduces VideoPoet: Multimodal Video Generation

Developed by a team at Google Research, VideoPoet presents a method of synthesizing videos and accompanying audio from a diverse array of conditioning signals, such as images, videos, text, and audio. More specifically, it offers Text-to-Video, Image-to-Video, Depth and Optical Flow, Video to Audio, Stylization, Outpainting.

Read Article

Uniting LLMs with Knowledge Graphs for Fact-Based Chatbots

3 thoughts on “AI Singing Voice Cloning with AI in Python”

Pingback: The New MetaVoice-1B was just Released. Get Started Here. - Entreprenerdly

Pingback: Why Bland AI Might Be the Future of AI Call Center - Entreprenerdly

Pingback: The Reason Behind Lazy ChatGPT's Sluggish Responses - Entreprenerdly

<img width="230" height="40" src="//entreprenerdly.com/wp-content/uploads/2025/04/Entreprenerdly-Logo-BLACK-min2.svg" alt="Search">