How to Transform a Voice?

The Problem

You have someone's audio recording -- a podcast, a music track of a singer -- and you want to make it speak or sing faster or slower, change the pitch, change the timbre, etc.

So, how do you change only what you want while preserving everything else about the voice?

Quick Answer

The basics are in the OverLap-Add (OLA), but it requires quite a lot of care and a bit of AI squeezed at the proper joints to get the top quality and a blazing speed.

How you do it with Pitchmeld

1import soundfile as sf

2import pitchmeld as pm

3audio, sr = sf.read("input.wav")

4syn = pm.transform(audio, sr, psf=1.5, esf=0.7, pbf=0.75)

5sf.write("output.wav", syn, sr)

Audio Examples

Below are examples of combined transformations applied to a male singing vocal recording. Each sample adds another transformation parameter.

First, a simple pitch raise by a ~fifth:

Loading audio samples...

Now let's add some spectral envelope compression to change the timbre:

Loading audio samples...

And finally, let's slow it down by 25% on top of the previous transformations:

Loading audio samples...

Voilà! Three transformations, one call.

The Underlying Issues

A waveform represents both temporal events and frequency characteristics. In order to control one characteristic independently from the others, it is necessary to dissociate them. For example, frame-based approaches have proven perfectly adequate for controlling time and pitch independently.

There might be a challenge in the frame manipulation, to get the proper effect. Even though it is straightforward for time stretching and pitch scaling. There is another big challenge in stitching the frames back without discontinuities. Each junction is a potential click, and hundreds per second quickly accumulate into audible artifacts. See the OLA post for the details.

State of the Art

Phase Vocoder: operates in the spectral domain via FFT. Elegant, but introduces "phase vocoder warble" -- a metallic artifact from phase mismatch errors. The quality depends heavily on window size: short windows preserve transients but smear harmonics; long windows preserve harmonics but smear transients.

PSOLA: the classical time-domain approach for pitch scaling. Each window of maximum energy is repositioned according to a new timeline or a new pitch. Then recombined with OLA cross-fading. Combined with OLA, the result is practically artifact-free pitch scaling with no phase accumulation errors. Problem: We need a very robust pitch estimation and localize the maximum of energy of each pitch period. Two problems that look easy in the classroom, but ill-defined at the scale of an industrial context.

AI-based: Many approaches exist here. A naive one would be to transcribe the audio with Automatic Speech Recognition (ASR), then resynthesize at a different pitch, different speed, different voice model. Computationally expensive, requires GPU. Risks of introducing timbral changes, identity drift, and transcription errors. A smarter one follows the same principle, but we don't fully transcribe. We use an intermediate representation (ex. HuBERT, wav2vec 2.0) and then we resynthesize at the pitch we want (ex. using VITS-based synthesizer). Still requires GPU, computationally expensive, better control on timbral and identity changes (ex. RVC). Another approach is to start with a solid generative Text-To-Speech (TTS) synthesizer and train it to synthesize with condition inputs, which are in this case: The original audio recording and a descriptive text of the desired transformation. We're not gonna lie, results are particularly impressive and the number of various transformations that can be done are close to unlimited (ex. ElevenLabs). Problem: Needs GPU power, a local one, not available on all devices, or on the cloud with an API connection, adding tremendous latency for real-time processing. This just doesn't fit certain scenarios. In some scenarios, such as time stretching, pitch scaling, and other simple transformations, a heavy end-to-end AI pipeline is just a costly overkill.

The Method Used in Pitchmeld

Pitchmeld uses an OLA-based pipeline with frame resampling:

Framing: The signal is split into overlapping frames.
Frame scheduling: Frames are dropped or duplicated to fit the new timeline.
Frame processing: Each frame is transformed to the target pitch and/or timbre.
OLA Recombination: The processed frames are recombined, preserving the original duration.

As argued in mix DM and AI post, this is a problem where 90% of the solution is just well-defined operations.

Pitchmeld finally adds 2 extra thin layers of AI running locally on your CPU in order to fill the remaining 10% gap. The result is both high quality and computationally extremely efficient.

Closing note

Thus, don't drop the equation please. Save yourself the hassle of API connections and GPU availability. Save some GPU, save some trees, and run it everywhere you want, on any hardware, without excessive battery drain. Keep your audio local, no cloud, no latency, no data leaving your machine.