Friday, February 20, 2009

More Explanations

And so, after over a month's worth of CPU time spent on calculations, my program is finally finished generating its data set. I think it is well worth it, considering I started writing this program in October of 2006, to be now finally almost finished.

Here's brief explanation of what my computer has been doing:

Preamble

First of all, audio is a sequence of micro-changes air pressure which are transmitted over distances and perceived by the human ear as sound. These air pressure variations can be expressed as a positive or negative value, depending on what increase or decrease beyond "neutral" air pressure is created:


^ | __ __ __
+ |/ \ / \ / \
0|----\----/----\----/----\----
- | \ / \ / \ /
v | -- -- --
time ->

Digital audio waveforms are merely a collection of values, in essence, between -1 and 1, inclusive, which are called samples. Samples represent what the air pressure will be at the point in time where the sample occurs. The rate at which the samples are converted to air pressure variations ("playing" the digital audio) is described as the sampling rate. The sampling rate is expressed in the number samples that represent one second of digital audio.

Through the graphing of the audio samples' values by time, we can see how they represent the original waveform:


+1| .. .. ..
|- - - - - -
0|----*----*----*----*----*----
| - - - - - -
-1| -- -- --
time ->

Because digital audio uses computerized values for the representation of these samples, the maximum height of a waveform is limited to +/-1.

Language

Regarding the deconstruction of audio data, there are certain terms used in the scope of this blog that require explanation:

amplitude

In sound, amplitude refers to the loudness of the waveform, also described as the amount of air pressure change that the waveform produces. In physical sound, it is measures in pascals. In digital sound, however it becomes a measure of its height within the limit of the samples' maximum value:


+1| __ __ __.........._
|/ \ / \ / \ |--- amplitude
0|----\----/----\----/----\----....-
| \ / \ / \ /
-1| -- -- --
time ->

frequency

A frequency, or pitch is an expression of how many times per second a given waveform completes its cycle, or period. It is expressed in Hz, or cycles per second.


+---------+---------+------ peaks
| | |
v v v
+1| __ __ __
|/ \ / \ / \
0|----\----/----\----/----\----.
| \ / \ / \ /.
-1| -- -- -- .
time -> .
. .
|----------1 second-----------|
3 peaks in 1 second => frequency: 3 Hz

frequency "bin"

A frequency bin, or bin, means a summation of the amplitudes of all frequencies associated with that bin as they occur in the deconstructed waveform.

Frequency bins are an effective representation of the loudness, or amplitude of the musical notes, or tones present in a waveform.

magnitude

Because higher-frequency waveforms carry more energy than lower-frequency ones, it becomes necessary to think of a waveform in terms of the total energy carried by it.The energy said to be carried by a given waveform is expressed as the total change in air displacement incurred by the transmission of that waveform. This can be found by finding the absolute derivative of the waveform.

The magnitude of a waveform is an expression of this. It is the amplitude of the waveform weighted by its frequency, and is found by multiplying its amplitude by its frequency.

phase, or phasing

The phase of a given waveform refers to the point in its cycle at which it is first observed. It is typically measured in radians from 0 the value of theta would be in the waveform's sine function.

A simpler way of defining it would be to think of it in terms of how far the beginning of the period of the waveform is from the origin:

An unphased waveform:


+------ first period starts at the origin; no phasing
|
v
+1| __ __ __
|/ \ / \ / \
0|----\----/----\----/----\----
| \ / \ / \ /
-1| -- -- --
time ->

A waveform with phasing:


+--- first period begins here
|
+------ began observing here (3/4 through last period)
| | => phasing of +1/4 of a period = +pi/2
v v
+1| __ __ __
| / \ / \ / \
0|--/----\----/----\----/----\-
| / \ / \ / \
-1|- -- --
time ->

The Fourier Transform

Joseph Fourier discovered that within complex waveforms exist many simpler waveforms with their own individual pitches. The Fourier transform was built as a way to decompose these complex waveforms into the simpler ones.

By design, the computerized Fourier transform gives a deconstruction of a waveform as a set of complex values, whose real and imaginary portions relate to one another in a way that it is possible to reconstruct the original waveform, complete with phased and non-phased waveforms.

For the purposes of the audio analysis performed by Audicon, the deconstuction is a one-way process. Therefore it becomes necessary to simplify the Fourier transform deconstruction's data in such a way that it is able to be processed in an ad hoc manner.

Audio "Featureset" Calculations

Each audio file is broken down into clips of audio data (each adjacent ½ of a second), along with a texture buffer of audio data in which the clip occurs (1 second before + the clip + 1 second after = 2½ seconds) and sent through the following functions:

Centroid

The centroid frequency of an audio sample is basically the center of its spectral gravity. Let me explain: If you had a waveform deconstruction from the Fourier transform, and you graphed it like so:


+----------+------- two equal parts
| |
|-----||------------|
^
^|
|| .
|| ||.
a| .|||
m| |||||.
p| | .||||||||.|. .|
+-------------------------->
freq -->

then the center of spectral gravity would be between the two divisions you see at the top of the graph.

The centroid frequency is found by way of the following function:

   F
---
\
\ f * N[f]
/
/
---
f
-------------
F
---
\
\ N[f]
/
/
---
f

where F is the highest frequency bin, f is the frequency and N[f] is the unweighted amplitude of the frequency bin representing frequency f as an output of the Fourier transform.

Magnitude Ratio

Remember the "texture" frame that was mentioned earlier? It comes into play here. Using the weighted Fourier transform bins, the magnitude ratio is just the sum of all the magnitudes in the clip divided by the sum of all the magnitudes in the texture frame. Here's the characteristic function:

       Fc
---
\
\ fc * C[fc]
/
/
---
fc
M = -------------
Ft
---
\
\ ft * T[ft]
/
/
---
ft

where Fc, Ft are the highest frequency bins for the clip and texture frame, fc and ft are their respective bins' frequencies and C[fc] and T[ft] are the respective unweighted amplitudes of the frequency bins representing frequencies fc and ft as an output of the Fourier transform.

Spectral Flux

The spectral flux is just a measure of how much the entire spectrum changes from one clip of audio to to the next. It's a sum of the absolute value of the changes of each frequency bin. Here's the characteristic function:

    F
---
\
\ (N(t)[f] - N(t-1)[f])^2
/
/
---
f

where F is the highest frequency bin, N(t)[f] and N(t-1)[f] are the unweighted amplitude of the Fourier transform bin representing frequency f at the current frame t, and previous frame # t-1, respectively.

Spectral Rolloff

This is a measure of where the vast majority of the audio information ends. It's a point (in frequency) in the spectrum where the sum of all the bins before it equals 85% of the sum of all the bins in the spectrum. In effect, it's an expression of where (about) the spectrum begins to fall off.

It uses the weighted bins for this. In order to find this frequency, one needs to solve for f in the following function:

  R              N
--- ---
\ \
\ M[f] >= c * \ M[f]
/ /
/ /
--- ---
f=1 f=1

where M[f] is the weighted amplitude of the FFT frequency bin corresponding to frequency f, N is equal to the total number of frequency bins and c is the "concentration" value (in this case 0.85) from 0 to 1, representing the target concentration for the rolloff point.

Zero Crossings

This is relatively simple. It's the number of times the clip's waveform crosses the zero-axis. This is an effective measure of the "noisiness" of the signal. It doesn't even use the Fourier transform.

Here's the characteristic function:


N
---
\
\ | sign(x[n]) - sign(x[n-1]) |
/
/
---
n
--------------------------------------
2 * H

where x[n] is the value of the sample at index n, N is the total number of samples in the signal, H is the number of samples per second of signal (the sampling rate), and sign() is a pseudo-function such that sign(x) = x / |x|.


Whoo! That's a lot to absorb in one sitting. Add to that the 96-output function described in the previous post and you'll see why this has taken me so long.

Next time: How to put all this crap together into something coherent.

No comments: