End-to-End Python Guide for Data Processing, Training and Inference of AI Cloned voices. From Voice Data to using Pre-trained and Custom Models
Imagine a world where your voice could harmonize with any tune, adopt any accent, or even replicate the iconic timbre of legendary singers. This is a reality made possible through AI singing voice cloning.
This groundbreaking technology merges the art of music with the precision of machine learning, enabling us to create new songs or reimagine classics in any voice we desire.
AI voice cloning is a technology that captures the unique characteristics of a voice and then reproduces it with astonishing accuracy. This digital alchemy allows us to not only replicate existing voices but also to create entirely new ones.
It’s a tool that has revolutionized content creation, from personalized songs to custom voiceovers, opening up a world of creative possibilities that transcend language and cultural barriers.
The objective of this article is to provide technical readers with a comprehensive Python guide to utilizing AI voice cloning technology — an end-to-end solution for transforming any audio into the tones of a chosen artist or even one’s own voice by training a custom model.
This tutorial article is structured as follows:
- 1. Technology and Theoretical Concepts Explained
- 2. Using SO-VITS-SVC Python Library for Inference
- 3. Training Your Own Custom AI Model to Sing
- 4. Practical Applications and Conclusion
1. Technology Background
The technology that we will be using in this article is called Singing Voice Conversion (SVC), in particular a system called SO-VITS-SVC, which stands for “SoftVC VITS Singing Voice Conversion”.
The SO-VITS-SVC system represents a sophisticated implementation of Singing Voice Conversion (SVC) using deep learning technologies. Understanding this system requires an appreciation of the specific machine learning architectures and algorithms it employs.
1.1 Variational Inference and Generative Adversarial Networks
- At the heart of SO-VITS-SVC lies the Variational Inference for Text-to-Speech (VITS) architecture. This system ingeniously combines Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
- VAEs are utilized for modeling the distribution of mel-spectrograms, which are a key representation of the audio signal in SVC. The VAE component helps in capturing the latent variables of speech.
- The VAE loss function is represented as per the formula below. where x is the input mel-spectrogram, z is the latent variable, and KL denotes the Kullback-Leibler divergence.
Equation 1. This formula encapsulates the VAE loss function, balancing the reconstruction of mel-spectrograms with the regularization of the latent space through the Kullback-Leibler divergence.
- GANs enhance the realism of the synthesized audio. The discriminator in the GAN critiques the output of the generator, improving its accuracy. The GAN loss function is given by:
Equation 2. The GAN loss function showcases the adversarial training dynamics, driving the generative model to produce indistinguishable singing voice replicas.
For a comprehensive understanding of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), you might want to refer to the original papers introducing these concepts:
- VAEs: Kingma, D. P., and Welling, M. “Auto-Encoding Variational Bayes.” arXiv:1312.6114, 2013.
- GANs: Goodfellow, I. J., et al. “Generative Adversarial Nets.” arXiv:1406.2661, 2014.
1.2 Shallow Diffusion Process
- As illustrated in the accompanying diagram, the shallow diffusion process starts with a noise sample that is progressively refined into a structured mel-spectrogram through a series of transformations.
Figure. 1: Shallow Diffusion - The diagram presents the SO-VITS-SVC synthesis pipeline, from the initial noise generation in the shallow diffusion model to the mel-spectrogram refinement and final vocoding for audible voice output.
1. Initial Noise Sample: A visual representation of noise that serves as the starting point for the diffusion process.
2. Transformation Steps: The noise undergoes a series of steps within the diffusion model, transitioning from a disordered state towards a structured mel-spectrogram. This can be described as per below, where xt is the data at step t, and ϵ represents Gaussian noise.
Equation 3. This formula illustrates the gradual transformation in the diffusion process, morphing random noise into structured data to capture the nuances of the target singing voice.
- In the context of SO-VITS-SVC, ‘shallow’ likely signifies fewer layers or steps, striking a balance between computational efficiency and audio quality.
3. Mel-Spectrogram Refinement: The outcome of this process is a mel-spectrogram that encapsulates the audio content of the singing voice, ready for the next synthesis phase.
4. Vocoding: The final vocoding step converts the mel-spectrogram into an audio waveform, which is the audible singing voice.
For an in-depth exploration of diffusion models, the following resources provide detailed explanations and research context:
- Sohl-Dickstein, J., et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” arXiv:1503.03585, 2015.
- Ho, J., et al. “Denoising Diffusion Probabilistic Models.” arXiv:2006.11239, 2020.
1.3 Integration of the Synthesis Process with the SVC System
- Mel-Spectrogram Refinement:
After the shallow diffusion model has structured the noise into a more coherent form, as visualized in the previously mentioned diagram, the resulting mel-spectrogram captures the nuanced audio content of the singing voice. This mel-spectrogram serves as a pivotal bridge between the raw, unstructured data and the final vocal output.
2. Vocoding:
The vocoder is then employed to convert the refined mel-spectrogram into an audio waveform. This step is where the transformation from visual data to the audible singing voice occurs. The vocoder’s role is to synthesize the nuances of pitch, timbre, and rhythm that have been captured in the mel-spectrogram, thereby producing the final singing voice output.
3. Training and Optimization:
To achieve this high-fidelity synthesis, the SO-VITS-SVC system undergoes rigorous training and optimization. The training involves optimizing a combined loss function that balances the contributions of the VAE, GAN, and the diffusion model components.
This optimization is conducted using algorithms like Stochastic Gradient Descent or Adam, with the ultimate goal of minimizing the overall loss. The process ensures that the final output closely resembles the target singing voice in terms of timbre, pitch, and rhythm.
4. Final Output:
The end product of this process is a synthesized voice that closely mirrors the target singing voice. The ability to maintain the musicality and expressive nuances of the source while adopting the tonal qualities of the target is a testament to the sophistication of the SO-VITS-SVC system.
For those who are new to machine learning and deep learning, foundational resources and courses from educational platforms can offer a starting point to build up the necessary background knowledge:
- Coursera — Deep Learning Specialization: https://www.coursera.org/specializations/deep-learning
- MIT OpenCourseWare — Introduction to Deep Learning: https://ocw.mit.edu/courses/6-036-introduction-to-deep-learning-spring-2021/
1.4 IPython Libraries Used
The SO-VITS-SVC Fork, hosted on GitHub, is a specialized tool designed for real-time singing voice conversion. This fork of the original SO-VITS-SVC project offers enhanced features like more accurate pitch estimation using CREPE, a graphical user interface (GUI), faster training times, and the convenience of installing the tool with pip
.
It also integrates QuickVC and corrects some issues present in the original repository. The fork supports real-time voice conversion, making it a versatile tool for voice cloning tasks. Additionally, it simplifies the installation and setup process, making it more accessible for users who wish to experiment with voice cloning technology.
2. Inference: Sing with any Artist’s AI Voice
Inference refers to the process where a neural network model, after being trained on a dataset to understand a particular voice, generates new content with that learned voice.
This phase is where we can ‘sing’ with an artist’s AI voice by providing new inputs (raw singing voice audio) to a pre-trained model, which then produces an output that mimics the artist’s singing style on the raw voice audio.
2.1 Setting Up the SO-VITS-SVC environment
For simplicity, we will be using a Jupyter Notebook with a virtual environment. So we suggest starting there:
- Install Anaconda: Download and install Anaconda on your system, which will allow you to create isolated environments for different projects.
- Open the Anaconda terminal, create a new environment by running the following command.
conda create -n sovits-svc
- If you use VS Code, you can reference the environment from the kernel selection, otherwise if you will continue to use Anaconda, activate the environment with conda activate and the run the Jupyter notebook.
conda activate sovits-svc
- Install the necessary libraries in the environment from within the notebook.
!python -m pip install -U pip wheel
%pip install -U ipython
%pip install -U so-vits-svc-fork
- To avoid the issue below later on when we run the
!svc
command, we should go the the anaconda environment and uninstall Torchaudio withpip uninstall torchaudio
and install it it again withpip install torchaudio
.
Figure. 2: This error is often related to the TorchAudio library - which is essential for running the !svc command. Ensuring all dependencies are correctly installed and paths are properly set can resolve this issue. Typically uninstalling and installing TorchAudio resolves the issues.
We are now ready to make a ‘Singing Voice Conversion’ using a Pretrained Model on a clean vocals (i.e. no background noise).
2.1 Using a Pre-trained Model for Singing Voice Generation
1) Selecting a pre-trained model
With the environment ready, the next step is to obtain a pre-trained model. We’ve made several pretrained models available that could used, from a Drake to Michael Jackson! Here are the options
Figure. 3: Diverse collection of SO-VITS-SVC pre-trained models, each fine-tuned to replicate the unique vocal stylings of different artists, available for voice synthesis experimentation and application on Huggingface via Entreprenerdly.com.
One its decided which which model to use, the .pth
Pytorch model file and the associated config.json
must be retrieved and downloaded:
from huggingface_hub import hf_hub_download
import os
# Set the repository ID and local directory
repo_id = 'Entreprenerdly/drake-so-vits-svc'
local_directory = '.'
# Download the config.json file
config_file = hf_hub_download(
repo_id=repo_id,
filename='config.json',
local_dir=local_directory,
local_dir_use_symlinks=False
)
# Construct the path to the config file in the current directory
local_config_path = os.path.join(local_directory, 'config.json')
print(f"Downloaded config file: {local_config_path}")
# Download the model file
model_file = hf_hub_download(
repo_id=repo_id,
filename='G_106000.pth',
local_dir=local_directory,
local_dir_use_symlinks=False
)
# Construct the path to the model file in the current directory
local_model_path = os.path.join(local_directory, 'G_83000.pth')
print(f"Downloaded model file: {local_model_path}")
2) Selecting a clean audio file
Next, we’ll download a clean audio file for conversion. The audio file we’re using can be accessed and downloaded via Google Drive.
It’s Justin Bieber clean vocal track, which you can then input into the SO-VITS-SVC system for voice conversion using the selected model.
While this might not be the cleanest track, when preparing your own audio files for conversion, it’s crucial to ensure that they are as clean as possible to avoid any unintended artifacts or quality issues in the generated audio.
The quality of the source audio significantly impacts the fidelity of the voice conversion, and thus a high-quality, clean recording is always recommended.
import requests
vocals_url = 'https://drive.google.com/uc?id=154awrw0VxIZKQ2jQpHQQSt__cOUdM__y'
response = requests.get(vocals_url)
with open('vocals.wav', "wb") as file:
file.write(response.content)
3) Running the inference
To perform voice conversion using the SO-VITS-SVC model, you will need to specify the paths to your audio file, model checkpoint, and configuration file. Here’s how you can set the paths and run the inference.
Make sure that you are running this command in an environment that has access to the !svc
command, such as a command prompt or a script that is executed in an environment where the SO-VITS-SVC tool is installed.
from IPython.display import Audio, display
import os
# Filenames
audio_filename = 'vocals.wav'
model_filename = 'G_106000.pth'
config_filename = 'config.json'
# Construct the full local paths
audio_file = f"\"{os.path.join('.', audio_filename)}\""
model_path = f"\"{os.path.join('.', model_filename)}\""
config_path = f"\"{os.path.join('.', config_filename)}\""
# Running the inference command
!svc infer {audio_file} -m {model_path} -c {config_path}
4) Display the output
After running the inference, you can display the output audio directly in your Jupyter notebook or any IPython interface. The following code snippet will play the resulting audio file:
from IPython.display import Audio, display
# Path for the output audio file
output_audio_path = "vocals.out.wav"
# Display the output audio
display(Audio(output_audio_path, autoplay=True))
5) Optional — Use the GUI to make inferences
For users who prefer a graphical interface, the SO-VITS-SVC system provides an optional GUI for performing voice conversion.
This feature-rich GUI streamlines the inference process, offering an alternative to command-line operations. It can be launched using the following
svcg
Figure. 4: A screenshot of the SO-VITS-SVC graphical user interface, displaying the settings for input paths, audio processing parameters, and real-time inference options.
- Setup: Launch the GUI and configure the necessary paths for the model, configuration, and audio files using the ‘Browse’ buttons.
- Model Selection: Choose the appropriate pre-trained model via the ‘Model path’ field.
- Configuration: Select the configuration file that corresponds to your model in the ‘Config path’ field.
- Audio Input: Load your target audio file for conversion using the ‘Input audio path’.
- Inference: Click on the ‘Infer’ button to start the conversion process, adjusting any additional parameters if needed.
- Output: Save and listen to the generated output directly through the GUI, ensuring the quality meets your expectations.
The GUI also offers real-time inference capabilities, allowing for adjustments on-the-fly and immediate auditory feedback. It’s designed for ease of use, catering to users who might not be comfortable with scripting or command-line tools.
!pip install spleeter
from spleeter.separator import Separator
# Initialize the separator with the desired configuration.
# Here, 'spleeter:2stems' means we want to separate the audio into two stems: vocals and accompaniment.
separator = Separator('spleeter:2stems')
# Use the separator on the audio file.
# This function will separate the audio file into two files: one containing the vocals, and one containing the background music.
separator.separate_to_file('audiofile.mp3', './')
- Splitting Audio track into 15 second snippets: We can use
AudioSlicer
to split an extensive audio file into 10–15 second snippets suitable for training the model.
from audioslicer import slice_audio
# Path to the input audio file
input_audio_path = 'long_audio_file.mp3'
# Path to the output directory where snippets will be saved
output_directory = 'output/snippets/'
# Length of each audio snippet in seconds
snippet_length = 15
# Slice the audio file into snippets
slice_audio(input_audio_path, output_directory, snippet_length)
2.2 Automatic Preprocessing
With dataset_raw folder created in the current directory, and the recordings recordings stored in the dataset_raw/{speaker_id}
directory, as outlined in the folder structure below, you’re set to initiate the preprocessing and training phases for your singing voice conversion model.
.
├── dataset_raw
│ └── {speaker_id}
│ └── {wav_file}.wav
The preprocessing involves 3 key steps to prepare the audio data for the model training —!svc pre-resample
,!svc pre-config
, !svc pre-hubert
to finally run !svc train
.
1. Resampling Audio: Using the !svc pre-resample
command, the system will standardize the sample rate of your audio files. This process will navigate through the dataset_raw/{speaker_id}
directory and ensure all .wav
files are resampled correctly.
!svc pre-resample
After execution, you will find the resampled audio files in a similar directory structure under a new sample rate directory, typically dataset/44k/{speaker_id}
.
2. Configuration File Generation: Next, you’ll generate a configuration file that guides the training process. The svc pre-config
command creates this file, leveraging the resampled audio data.
!svc pre-config
The newly generated config.json
will be in a config/44k/
directory relative to your current working directory.
3. Pitch Contour Extraction: Pitch extraction is crucial for a singing voice model. By running svc pre-hubert
, the system will analyze and extract pitch contours from each audio file in your dataset.
!svc pre-hubert
This step may require you to specify the preferred method for pitch extraction, such as DIO or CREPE.
2.3 Training Configuration
Prior to initiating training, it’s essential to configure your model to ensure optimal learning conditions. This involves editing the config.json
file created in the config/44k/
directory during preprocessing. Key parameters within this configuration file include:
log_interval
: Sets the frequency at which the loss will be printed to the console during training, allowing you to monitor the model’s learning progress.eval_interval
: Determines how often the model’s state will be saved. This is particularly important as model checkpoints can be large and you want to ensure consistent progress without using excessive storage.epochs
: Defines the total number of training cycles the model will undergo. More epochs can lead to better learning but also require more time.batch_size
: Specifies how many samples will be processed at once. The ideal batch size depends on the available VRAM on your GPU; too large a batch size may exceed your VRAM capacity, while too small could lead to inefficient learning.
Figure. 5: Snapshot of the configuration settings for a the model training, as specified in the 'config.json' file, displaying key parameters like batch size, learning rate, and intervals for evaluation and logging.
For a dataset with 200 samples and a batch size of 20, each epoch equates to 10 steps. If you aim for 100 epochs, this translates to 1,000 steps. It’s advisable to add one to the epoch count to ensure the last step is saved.
The default setting might suggest 10,000 epochs, but depending on your hardware and dataset size, you may need to adjust this. A practical approach could be aiming for 20,000 steps and evaluating the performance before deciding whether to extend training.
2.4 Start Training
Begin the actual model training with the svc train
command. This will start the machine learning process, utilizing the preprocessed data from the dataset/44k/{speaker_id}
directory and configuration settings in config/44k/config.json
.
!svc train
During training, the model’s progress will be saved periodically, as specified by your eval_interval
. If you need to stop the training for any reason, you can rest assured that you can resume from the last saved checkpoint. This flexibility allows for training to be a more manageable task, especially when dealing with large datasets or limited computational resources.
As the training proceeds, your console will display the loss at intervals set by log_interval
, giving you insight into how the model is learning. Upon completion, the logs/44k
directory will contain various files, including the trained models (denoted as G
followed by a number). You can then use these models to synthesize your voice, creating your desired audio outputs.
By carefully configuring and running these steps, you will train a model capable of converting singing voice audio with nuances and qualities similar to the original recordings found in your dataset_raw/{speaker_id}
directory.
2.5 Model Inference
After the model has been trained, fine-tuned, and validated, the next step is to run inference to convert the source audio into the target voice as per the previous section.
from IPython.display import Audio, display
import os
# Filenames
audio_filename = 'vocals.wav' # vocals to applied trained model
model_filename = 'model.pth' # model file created
config_filename = 'config.json' # config file created
# Construct the full local paths
audio_file = f"\"{os.path.join('.', audio_filename)}\""
model_path = f"\"{os.path.join('.', model_filename)}\""
config_path = f"\"{os.path.join('.', config_filename)}\""
# Running the inference command
!svc infer {audio_file} -m {model_path} -c {config_path}
After running the inference, you might want to fine-tune the output further. For instance, if the source audio was singing and you need to mix it with background music, or if you wish to adjust the pitch, you can use audio processing tools like Audacity or any other digital audio workstation to mix and match tracks to your preference.
4. Practical Applications of the Solution
- Music Production Enhancement — Producers experiment with different vocal textures and styles without requiring the physical presence of the artists. This can be particularly useful for conceptualizing new tracks or for adding backing vocals in different pitches and timbres.
- Personalized Music Experience — Imagine a service where fans can receive a version of their favorite song sung in the voice of another artist or even in their own voice using a custom model.
- Film and Animation Dubbing — Dubbing for films, cartoons, and games can benefit from SO-VITS-SVC by generating voice overs in various character voices, especially when the original actors are unavailable. It can create consistent vocal performances across different language versions.
- Edutational Tools — In language learning, SO-VITS-SVC can enhance comprehension by allowing learners to hear text read aloud in their own voice with the correct accent and intonation of a variety of languages, supporting the development of listening and speaking skills.
- Voice Restoration — For individuals who have lost the ability to speak, SO-VITS-SVC could be trained on recordings of their voice to restore their ability to communicate in a manner that retains the essence of their original voice.
- Reviving Historical Voices — Bring historical speeches to life by cloning voices from the past, providing a compelling auditory experience in museums or educational content.
Conclusion
The implications for this technology are profound and far-reaching. In the immediate future, we might see a democratization of music production, where independent artists leverage AI to duet with any voice from history, or educators use it to obliterate language barriers in classrooms across the globe. But the true potential of SO-VITS-SVC lies in its long-term applications — those we are yet to fully comprehend or imagine.
Imagine a world where every digital interaction is personalized not just to your visual preferences but tailored to auditory comfort. A world where virtual reality isn’t just a visual spectacle but an auditory feast, where every character, every entity, can converse with you in voices that are indistinguishable from real human ones. The SO-VITS-SVC could be the harbinger of virtual beings with their own unique vocal identities, capable of singing, speaking, and interacting in a multitude of languages and styles.
As we push the boundaries of AI voice cloning, we may soon blur the line between creators and creations. Characters from fiction might release their own albums, and poets long gone might recite new verses. The implications for copyright, identity, and even the essence of creation itself could be challenged and redefined.
Related Articles:
Narrating Videos With OpenAI Vision And Whisperer Automatically
Google Introduces VideoPoet: Multimodal Video Generation
Newsletter
3 thoughts on “AI Singing Voice Cloning with AI in Python”