AI and Direct Methods: Finding the Right Mix

The Problem

The current ecosystem says AI is the answer, even though tons of battle-tested approaches exist. In audio processing, how do you decide what to hand off to AI and what to leave to Direct Methods (DM) like classical signal processing?

⚠ Disclaimer: The field here is audio and voice processing specifically. Some of the following might not be true outside of this field.

Quick Answers

"If an equation exists, use the bl**dy equation!" — An emeritus professor
If the problem is about detecting systematic events: It is likely DM will do the job just fine and its efficiency will be unbeatable anyway.
If the problem is perceptual, and simple enough, a DM might already exist. Otherwise, it is very likely you will need AI.
If the problem is about recognizing complex patterns: You will need AI, no workaround.
If you don't know where your tech problem is in this list, continue reading.

The Underlying Issue

In audio engineering, we can roughly classify technical problems in order of complexity:

Trivial → Closed-form solution A mathematical model exists that fits very well the data and a simple sequence of deterministic instructions leads to an exact closed-form solution. Ex. resampling in audio, measuring the energy or loudness in equalisation (transformation docs).

Systematic → Direct Method (DM) An explicit model exists, but there's no closed-form solution. Nevertheless, a deterministic sequence of instructions, a process, converges to an estimated solution. Ex. Newton's method, auto-correlation, pitch detection in speech.

Pattern → Machine Learning (AI) There is no explicit model of the data. Instead, we let a neural network learn the mapping from inputs to outputs by training on examples — capturing the countless contextual dependencies that are intractable to encode as systematic rules. Ex. Automatic Speech Recognition (ASR), Text-to-Speech synthesis (TTS), Voice Conversion (VC).

Ill-defined There's no solid model of the data, neither explicit nor implicit. The mapping from input to output is fundamentally ambiguous. Ex. estimating vocal tract shape from audio (many shapes produce the same sound); separating emotion, identity, and linguistic stress when they all share the same audio cues. These problems can sometimes be simplified with extra information — but that side information is rarely available when a user hands you a simple mono audio file.

Performance

Now, let's see how the performance relates to the problems' complexity.

We can first get rid of the two obvious extreme categories: For Trivial, the performance of a closed-form solution is almost impossible to beat ("use the bl**dy equation" they say). For Ill-defined, let's keep in mind that this case exists: Either we jump into a black hole of infinite R&D iterations and blow up our token budget, or, there might be a handy simplification that makes us fall back to the two remaining categories, which are the ones to argue about:

The so-called "duel" between Direct Methods and AI:

Systematic: DM are traditionally very efficient, the non-efficient methods usually stay in laboratories, or get optimized. The more complex the problem, the more complicated and the less efficient DM become. Also, facing a pattern-based issue, DM are just not fit for the task. For perception-based problems, DM might handle very simple tasks (ex. measure instantaneous loudness), but fail to embed the whole complexity of human perception (ex. for Voice Conversion).

Pattern: AI is traditionally not efficient at all. We needed to steal the GPU of our kids' gaming computer to make it usable. For very simple tasks, AI tools have overheads (ex. converting inputs and outputs to the model's format). Their current memory footprint is terrible as well. They shine on pattern and perception-based issues because they excel at contextualization by modeling statistical properties of datasets. In these scenarios, the task requires to lift up all this processing and memory power we have, so we're happy to do it.

Figure below illustrates these discrepancies.

Currently, DM and AI is seen as a duel, mainly because of the impressive wave of innovation brought by recent advances in AI. The latter pushes away ~~researchers~~DM from their traditional comfort zone.

The approach used in Pitchmeld

Audio and especially voice technologies require light infrastructures. Don't dare dedicating more than 1% of the computing power to audio for an AAA video game. You just can't.

So, if we want to push AI further into audio technologies, there is no other choice than merging with DM.

The tricky part is this trench in orange in the plot below, where Direct Methods miss the contextualization the AI offers, and where current AI models are too heavy.

For example, in Pitchmeld, in order to estimate the pitch curve of a singing voice, most of the heavy work is lifted by a method that follows a traditional correlation-based approach. However, for the final decisions (ex. the octave component of the pitch), AI is used. No need of deep learning, no need of heavy transformers, just a shallow neural net of a few thousand parameters. An AI agent finally jumps in, optimizes the model's architecture, re-implement it with C++ and battle-tested libraries and voilà!

The result is a hybrid pipeline that achieves AI quality at DM costs and zero overhead. We get the adaptability of learned models where we need it, and the efficiency of Closed-form and DM solutions everywhere else. For standard transformation, pitchmeld processes a music track of 6min in ~2s, on a single CPU thread, no multithreading, no GPU.

Closing note

High computation usage isn't just a nerdy challenge. It translates directly to:

Economic portability: Efficient algorithms run on cheap hardware or low battery drain. Inefficient ones require expensive GPUs on the cloud.
Security: Sending audio over cloud services poses risks for security. With Pitchmeld, your audio stays where you decide to process them.

A Direct Method (DM) that delivers 90% of the result at 1% of the cost builds the frame of a fast car. A light AI model that closes the remaining 10% for just an extra 1% is the engine that leaves the competition stuck in their morning standup.