Skip to content

Demo of AI song covers using RVC (Retrieval-based Voice Conversion)

Published:

Before I go on, press “Play” below. While the song is playing, click the circular “joystick” inside the triangle and drag it around. Try taking the joystick all the way towards any of the vertices to hear more from a single artist. If you are intrigued how this works, then read on.

Original Song by Shilpa Rao | Clone #1 by Arijit Singh (reference song) | Clone #2 by Kishore Kumar (reference song from 1958)

In Slavic folklore, there is a powerful woodland spirit called Leshy, who mimics human voices to lure lost wanderers into its caves. While Leshy’s imitation of friendly voices sounded magical a thousand years ago, we now live an age where it is possible using Artificial Intelligence. The above is not just a demo of voice cloning, but a glimpse of how AI is reshaping creative expression and arming artists and creators with new tools to bring their ideas to life.

2025 is shaping up to be the year of open-source AI which started with Deepseek R1 starting a revolution and OpenAI and Google both launching high performance reasoning models for free. With commoditization of AI on the horizon, it is hard to find value in keeping up with AI unless you work on the bleeding edge of this AI race. For an average person, value can lie in how you wield these tools to fuel creativity and bring your ideas to life in ways that could very well have looked like magic before. This voice cloning demo was created using a fantastic piece of technology called RVC. The triangular user interface to sample the same song in real-time in different singer’s voices was also created using ChatGPT.

In this blog post, we’ll cover:

Resources

Here are some helpful links to get started:


What is RVC?

RVC stands for Retrieval-based Voice Conversion, an open-source AI algorithm that enables speech-to-speech transformation with remarkable accuracy. Unlike text-to-speech (TTS), which generates speech from text, RVC focuses on converting one voice into another while preserving vocal attributes like modulation, timbre, and emotional tone. You can speak into it or feed in a voice, and it spits back what you spoke, but in another voice (the voice on which the model was trained), while retaining the voice acting.

Some popular models for text-to-speech (TTS) and speech-to-speech are:

Language is very powerful and complex. Where text-to-speech fails is ability to convey emotion via tone, pauses, emphasis, pitch, etc. Where Speech-to-Speech excels is in retaining those qualities. However, the output is dependent on size and quality of training data. RVC is not competing with TTS, they are often complementary. You can use TTS to generate speech from text and then pass it through RVC for lifelike voice modulation.


Practical Applications of RVC

  1. Content Creation: Create a clone of your own voice or create a unique voices for your videos. The workflow can often use TTS for script narration which is then passed through RVC for realistic tones. If you are creating an audiobook, a combination of TTS + RVC can help.

  2. Gaming, Entertainment: Design personalized voice profiles for video game characters. Mix voices to create entirely new ones for creative projects.

  3. Memes: Ever wondered how Shah-Rukh Khan would sound apologizing for making the film Happy New Year? I recorded my own voice, trained an RVC model on SRK’s 18-minute audio clip for just 20 epochs as a test. And here is the result:

The possibilities are endless. Create your own sports commentary with the voice of your favorite sports caster, fan fiction content, AI song covers and even resurrect old artists from the sixties.

Note on the Ethical Concerns: While exciting, this technology raises concerns about misuse in deepfakes, identity theft and fake political campaigns. Proceed responsibly!


How Does RVC Work?

As with any large scale AI models, the under the hood workings of the model can be like a black box in a lot of ways. Imagine RVC as a vocal sculptor: it isolates the raw clay of speech (phonetics) and reshapes it with the texture of a target voice (timbre). The RVC model has roughly the following stages

[Input Speech] → [Feature Extraction] → [Voice Conversion Model] → [Output Speech]

  1. Input Speech Analysis: The algorithm captures input speech in chunks and processes them in real-time, maintaining low latency for voice conversion. It analyzes your input speech—capturing pitch, tone, and timbre.

  2. Feature Extraction: Using a database of pre-trained voice features, it extracts relevant vocal features from the input speech using advanced algorithms like RMVPE (Robust Model for Vocal Pitch Estimation), which helps prevent muted sound problems and ensures accurate pitch detection.

  3. Retrieval and Voice Conversion: The algorithm finds pre-recorded speech segments that are most similar to the input speech. It then converts the input speech to match the target voice characteristics. However it maintains modulation and timbre which are useful to convey emotion. The magic happens here—in a way, the RVC model separates the input voice features in two buckets:

    • the voice characteristics that are replaced with the target voice (using a model trained on the target voice)
    • the phonetic and emotions that are retained from the input voice, and used with the new target voice

Tutorial: How to Use RVC

Ready to try this yourself? Here’s a step-by-step guide (some knowledge of coding is helpful, if you get stuck, use Google or ChatGPT to solve the errors that you may encounter).

Overall Workflow Overall RVC workflow

Pre-requisites

Preparing Training Data

Good training data is key! Otherwise, if you feed in garbage, the output you will get is processed garbage. If recording your own audio, use high-quality microphone and record in a quiet room to avoid background noise. If using existing audio, ensure to use high quality audio. For a decent training data size, it is recommended to have at least 30 to 40 minutes of audio for training, but 15 to 20 minute can also give decent results. Ensure language consistency between training and output data— example, if you want to output to be in English, use training data that is also in English. Otherwise, you’ll end up with weird accents (unless that’s what you are going for!)

I highly recommend processing the training data using UVR to remove any noise artifacts. Especially when training on Songs, UVR is a powerful tool to extract vocals using stem separation. For creating AI song covers, use the above workflow for preparing training data using UVR.

Pro tips for better training data:

Setting up RVC Web UI

These instructions are only for Windows/Linux/WSL2 and only for Nvidia GPUs. For other GPUs or for Mac installation check out the detailed set-up instructions

1. Clone the official Github repository Download all files or clone the repository from the official GitHub repository. I prefer to download all files and run the AI models using WSL2, but using Windows is just as fine.

2. Prepare the environment

Rest of the instructions are for both Windows and Linux/WSL. For Mac users, better to see the instructions from the official source.

3. Running the Web UI The hard part is now over.

Training the model

In the Web UI, you will see something like below after going to the “Train” section from the top navigation. RVC model training interface

Now you can begin training. Once the training is complete, there will be a model weights file (.pth file of pytorch) with the same name as your experiment name in the assets directory (example: assets\weights\mi-test.pth)

Pro tips for good training:

Inference

Now the fun part begins. You can now feed an input voice/audio to the model and it will convert it into an output audio with the voice that the model is trained on. RVC model inference interface

Pro tips for good inference


BONUS: How I built the front end UI above to showcase the model output

First I created the three audio files:

Then I used ChatGPT with the following prompt to arrive at the base Javascript, CSS and HTML code that I can add to a webpage.

I have used voice cleaning to create 2 AI covers of the same song. Song 1 (original song in voice of artist 1), Song 2 (in voice of artist 2), and song 3 (in voice of artist 3). Now I want a creative way to present it on my blog. I want to create a javascript based UI which is an equilateral triangle. The triangle is divided in three equal parts with three lines that meet at its centroid. The centroid is a joystick like button that can be moved anywhere within the triangle. As the centroid is moved, the lines connecting the button and each of the vertices also move such that they are always connected to the vertices and the button. All three songs should be played simultaneously in perfect sync. Each section of the triangle that is divided in 3 parts should have a background image of each artist. When the button in moved around, the volume of each artist song should vary based on the surface area visible for each artist image in the triangle. Example, if the button is moved all the way to one of the vertex, then the entire surface area of triangle is under a single artist, so the other 2 artists should go almost mute. Can you write to me the necessary javascript, css and html code to achieve this?

After a few follow-up prompts, I was able to arrive at the desired result. I also used Canva to upscale the artist images (AI is being used everywhere)


Final Thoughts

RVC isn’t just a cool demo—it’s about enabling creativity at scale. Whether you’re an artist looking to expand your toolkit or just someone who loves experimenting with tech, this is your playground.

So what will a forest dweller from 9th century Kievan Rus (modern day Ukraine) think about today’s voice cloning technology? In a way, it is still magic! Somehow, the humans tricked some rocks into thinking. What a time to be alive.

Liked this? Subscribe to my newsletter on substack