How to Transform Audio Without Artifacts: From Cross-Fading to OLA

The Problem

You have an audio clip and want to change its some of its properties, time duration, pitch, timbre, etc. How do you do that without introducing clicks, distortions, warbling, or metallic coloration?

⚠ Disclaimer: This is a fundamental technique that was developed in the signal processing field and is still used extensively for audio processing with AI. There is, currently, no reasons to replace it, human perception loves it, the maths behind it is rock solid and it is mindbooglingly fast.

Quick Answer

Spit the audio clip into short segments, named frames.
Transform those frames.
Recombine them into a new audio clip.

That's it. Invented during the 80' and still widely used in AI pipelines. The process is called OverLap and Add (OLA). Done properly, the result is artifact-free transformation.

Besides the transformation step that depends on the goal of the task, the tricky bit is the recombination. If blindly concatenated, artefacts will appear. So, you can do some clever things in either spectral domaine (ex. Phase vocoder), or in time domain (ex. SOLA).

How the basics works

The bad idea: Concatenate

You want to connect two audio clip. The naive approach, as shown below, is to concatenate them.

However, the audio waveform varying quite a lot, the junction from one clip to the next is very likely to be discontinuous. Repeat that meany times and your final audio is a mess full of clicks.

Cross-fade once

The common solution for that is to cross fade the two audio segments:

No more hard transition. No more click.

You can transform the first and second segments in different ways, the waveform will transition smoothly from one to the next.

But the transition is a bit wobbly. So the best is to actually fine-tune the position of the right segment so that its content align the best with the one on the left:

and boom! Super smooth, super good quality.

This fine tunning is actually the key to high-quality audio. Many methods have been suggested, among them the Phase vocoder and SOLA.

Cross-fade each 10ms ⇒ OverLap-Add (OLA)

You can repeat this cross-fading process, each 10ms, and you get what's called an OverLap and Add (OLA) process:

Closing note

OLA is a good example of the "use the bl**dy equation" principle. The problem has a well-defined mathematical formulation. The process is very straightforward and incredibly efficient.