Producing training sets for Extract:Dialogue by Acon Digital

This article summarises the extensive experimental work we have done to develop and build a data training set for Acon Digital’s Extract:Dialogue; an audio plugin that removes common noise problems, such as those encountered with lavaliere microphones. High-quality training sets are crucial to efficiently train the HANCE audio engine.. In the following text, we detail one example of how HANCE has tailored training sets to solve real-world audio problems.

Auhtor thumbnail
Peder Jørgensen - Feb 4, 2021
Training sets for extract dialogue

Lavaliere microphones are small microphones placed on actors to capture dialogue for movies and TV. Lavalieres typically need to be hidden under the actor’s clothes, which creates a distinct set of problems. The most common being noise from clothes touching the microphone membrane. Other issues arise from the conditions in which the scenes are filmed, such as broadband or static noise from fans, lightning, generators, traffic, and other unwanted audio sources.

News inner
Figure 1: A typical lavaliere microphone with a paper clip for size comparison.

We started by dividing the noise palette into the two categories static and burst . We outline static noise as long stretches of unwanted signal data, such as traffic, rain, air conditioning units, ocean waves, and so forth. We gathered a wide variety of material from pre-recorded audio assets provided by Soundly, a leading sound effects platform and library used by professional film and TV producers worldwide. We further compiled an extensive collection of static noises from real-life recording sessions done with lavaliere microphones.

Bursts are short periods of noise that typically arise due to the lavaliere microphone touching clothes, hair, and hands. We used three different lavaliere microphones to record typical bursts. In addition, we collected natural bursts from set recording sessions. Noises such as mouth clicks, mics thumps and other short burst noises were also added to the training set to make it even more useful.

Having compiled a comprehensive set of unwanted noise, we could finally focus on what we were after: clean, artifact-free voice recordings! To get as close to real-life examples as possible, we recorded voices of different ages and genders in a controlled environment using lavaliere microphones.

To create quality models, we feed our algorithms wanted audio (voice), unwanted audio (noise), and the two combined (noise + voice). We used the Python programming language to generate training sets, and HANCE developed several advanced applications to streamline this process. One of these applications, jokingly named the merge-devide-and-conquer script by our engineers, would for example execute the following procedure:

  1. Extract a random 11 second stretch of a voice recording.
  2. Pick a noise file from the static sound pool, looping it seamlessly if the noise length exceeded the voice’s duration.
  3. Mix in bursts at random intervals to the pulse code modulated (PCM) data.
  4. Normalize the voice and noise to a set limit.
  5. Save the voice, noise, and combined data separately.
  6. Export meta-data to files describing the training script set and what parts to use for validation.
Noise sets loaded in the Soundly App.
Figure 2: The noise sets as seen in the Soundly Application.

We were excited to hear the noise reduction’s effectiveness in our first model, but we were still a long way off a usable product. HANCE spent several months tweaking the set, adding more noise and voice as needed.

One particular problem we discovered was that our voice recordings were too controlled compared to real-world recordings. We experienced a gating effect at the end of words when applying the model to recordings from noisy film sets and similar. After researching several options, we found that the model would obtain a more natural transition between voice and silence by adding a short reverb to some of the training set’s clean voice recordings.

Building the lavaliere training set gave us a valuable insight into the kind of problems audio engineers face when working with film and TV. The work resulted in an extensive training set for this specific purpose. We believe this is one of the great strengths of the HANCE algorithms, such as the HANCE Audio Engine: combining training data for several different applications to find solutions to unique problems.