Friday, February 20, 2009

More Explanations

And so, after over a month's worth of CPU time spent on calculations, my program is finally finished generating its data set. I think it is well worth it, considering I started writing this program in October of 2006, to be now finally almost finished.

Here's brief explanation of what my computer has been doing:

Preamble

First of all, audio is a sequence of micro-changes air pressure which are transmitted over distances and perceived by the human ear as sound. These air pressure variations can be expressed as a positive or negative value, depending on what increase or decrease beyond "neutral" air pressure is created:


^ | __ __ __
+ |/ \ / \ / \
0|----\----/----\----/----\----
- | \ / \ / \ /
v | -- -- --
time ->

Digital audio waveforms are merely a collection of values, in essence, between -1 and 1, inclusive, which are called samples. Samples represent what the air pressure will be at the point in time where the sample occurs. The rate at which the samples are converted to air pressure variations ("playing" the digital audio) is described as the sampling rate. The sampling rate is expressed in the number samples that represent one second of digital audio.

Through the graphing of the audio samples' values by time, we can see how they represent the original waveform:


+1| .. .. ..
|- - - - - -
0|----*----*----*----*----*----
| - - - - - -
-1| -- -- --
time ->

Because digital audio uses computerized values for the representation of these samples, the maximum height of a waveform is limited to +/-1.

Language

Regarding the deconstruction of audio data, there are certain terms used in the scope of this blog that require explanation:

amplitude

In sound, amplitude refers to the loudness of the waveform, also described as the amount of air pressure change that the waveform produces. In physical sound, it is measures in pascals. In digital sound, however it becomes a measure of its height within the limit of the samples' maximum value:


+1| __ __ __.........._
|/ \ / \ / \ |--- amplitude
0|----\----/----\----/----\----....-
| \ / \ / \ /
-1| -- -- --
time ->

frequency

A frequency, or pitch is an expression of how many times per second a given waveform completes its cycle, or period. It is expressed in Hz, or cycles per second.


+---------+---------+------ peaks
| | |
v v v
+1| __ __ __
|/ \ / \ / \
0|----\----/----\----/----\----.
| \ / \ / \ /.
-1| -- -- -- .
time -> .
. .
|----------1 second-----------|
3 peaks in 1 second => frequency: 3 Hz

frequency "bin"

A frequency bin, or bin, means a summation of the amplitudes of all frequencies associated with that bin as they occur in the deconstructed waveform.

Frequency bins are an effective representation of the loudness, or amplitude of the musical notes, or tones present in a waveform.

magnitude

Because higher-frequency waveforms carry more energy than lower-frequency ones, it becomes necessary to think of a waveform in terms of the total energy carried by it.The energy said to be carried by a given waveform is expressed as the total change in air displacement incurred by the transmission of that waveform. This can be found by finding the absolute derivative of the waveform.

The magnitude of a waveform is an expression of this. It is the amplitude of the waveform weighted by its frequency, and is found by multiplying its amplitude by its frequency.

phase, or phasing

The phase of a given waveform refers to the point in its cycle at which it is first observed. It is typically measured in radians from 0 the value of theta would be in the waveform's sine function.

A simpler way of defining it would be to think of it in terms of how far the beginning of the period of the waveform is from the origin:

An unphased waveform:


+------ first period starts at the origin; no phasing
|
v
+1| __ __ __
|/ \ / \ / \
0|----\----/----\----/----\----
| \ / \ / \ /
-1| -- -- --
time ->

A waveform with phasing:


+--- first period begins here
|
+------ began observing here (3/4 through last period)
| | => phasing of +1/4 of a period = +pi/2
v v
+1| __ __ __
| / \ / \ / \
0|--/----\----/----\----/----\-
| / \ / \ / \
-1|- -- --
time ->

The Fourier Transform

Joseph Fourier discovered that within complex waveforms exist many simpler waveforms with their own individual pitches. The Fourier transform was built as a way to decompose these complex waveforms into the simpler ones.

By design, the computerized Fourier transform gives a deconstruction of a waveform as a set of complex values, whose real and imaginary portions relate to one another in a way that it is possible to reconstruct the original waveform, complete with phased and non-phased waveforms.

For the purposes of the audio analysis performed by Audicon, the deconstuction is a one-way process. Therefore it becomes necessary to simplify the Fourier transform deconstruction's data in such a way that it is able to be processed in an ad hoc manner.

Audio "Featureset" Calculations

Each audio file is broken down into clips of audio data (each adjacent ½ of a second), along with a texture buffer of audio data in which the clip occurs (1 second before + the clip + 1 second after = 2½ seconds) and sent through the following functions:

Centroid

The centroid frequency of an audio sample is basically the center of its spectral gravity. Let me explain: If you had a waveform deconstruction from the Fourier transform, and you graphed it like so:


+----------+------- two equal parts
| |
|-----||------------|
^
^|
|| .
|| ||.
a| .|||
m| |||||.
p| | .||||||||.|. .|
+-------------------------->
freq -->

then the center of spectral gravity would be between the two divisions you see at the top of the graph.

The centroid frequency is found by way of the following function:

   F
---
\
\ f * N[f]
/
/
---
f
-------------
F
---
\
\ N[f]
/
/
---
f

where F is the highest frequency bin, f is the frequency and N[f] is the unweighted amplitude of the frequency bin representing frequency f as an output of the Fourier transform.

Magnitude Ratio

Remember the "texture" frame that was mentioned earlier? It comes into play here. Using the weighted Fourier transform bins, the magnitude ratio is just the sum of all the magnitudes in the clip divided by the sum of all the magnitudes in the texture frame. Here's the characteristic function:

       Fc
---
\
\ fc * C[fc]
/
/
---
fc
M = -------------
Ft
---
\
\ ft * T[ft]
/
/
---
ft

where Fc, Ft are the highest frequency bins for the clip and texture frame, fc and ft are their respective bins' frequencies and C[fc] and T[ft] are the respective unweighted amplitudes of the frequency bins representing frequencies fc and ft as an output of the Fourier transform.

Spectral Flux

The spectral flux is just a measure of how much the entire spectrum changes from one clip of audio to to the next. It's a sum of the absolute value of the changes of each frequency bin. Here's the characteristic function:

    F
---
\
\ (N(t)[f] - N(t-1)[f])^2
/
/
---
f

where F is the highest frequency bin, N(t)[f] and N(t-1)[f] are the unweighted amplitude of the Fourier transform bin representing frequency f at the current frame t, and previous frame # t-1, respectively.

Spectral Rolloff

This is a measure of where the vast majority of the audio information ends. It's a point (in frequency) in the spectrum where the sum of all the bins before it equals 85% of the sum of all the bins in the spectrum. In effect, it's an expression of where (about) the spectrum begins to fall off.

It uses the weighted bins for this. In order to find this frequency, one needs to solve for f in the following function:

  R              N
--- ---
\ \
\ M[f] >= c * \ M[f]
/ /
/ /
--- ---
f=1 f=1

where M[f] is the weighted amplitude of the FFT frequency bin corresponding to frequency f, N is equal to the total number of frequency bins and c is the "concentration" value (in this case 0.85) from 0 to 1, representing the target concentration for the rolloff point.

Zero Crossings

This is relatively simple. It's the number of times the clip's waveform crosses the zero-axis. This is an effective measure of the "noisiness" of the signal. It doesn't even use the Fourier transform.

Here's the characteristic function:


N
---
\
\ | sign(x[n]) - sign(x[n-1]) |
/
/
---
n
--------------------------------------
2 * H

where x[n] is the value of the sample at index n, N is the total number of samples in the signal, H is the number of samples per second of signal (the sampling rate), and sign() is a pseudo-function such that sign(x) = x / |x|.


Whoo! That's a lot to absorb in one sitting. Add to that the 96-output function described in the previous post and you'll see why this has taken me so long.

Next time: How to put all this crap together into something coherent.

Friday, February 6, 2009

Damn You, GSL!

One part of this program I wrote is a calculator of respective loudness of the keys of, as a default and in the context of this program, a standard 88-key piano. It does this by maintaining a matrix of Gaussian normal curves in memory, which look something like this:

A representation of just the middle C octave. Note that in the actual program, the full spectrum of octaves is represented.

Which it then multiplies by a 1-row matrix of the incoming amplitudes of the audio samples' frequency bin values from Math::FFTW.

This, in essence, works by taking the Y values of each key's curve and multiplying it the corresponding values in the FFT deconstruction, then summing the results.

Anyway, I'll get off of this digression and back to the problem at hand.

Any way you slice it, this whole process involves a LOT of matrix math between HUGE matrices, including constructing a lot of new matrices for a second, then multiplying them and returning the result.

At first I chose a simple and easy-to-use matrix math library, Math::MatrixReal. This was a little slow to begin with, plus I had to code in a workaround for creating the aforementioned huge matrices, but it worked... But it was S-L-O-W.

I decided to switch to a C library that would hopefully get the job done faster. I found, much to my delight, Math::GSL, which had a specific library for dealing with matrices at a very low level. It even had an extensive basic linear algebra system (BLAS) to do exactly what I had in mind, not to mention that it was the motherfucking Gnu Scientific Library I would be playing with. Yay!

After I switched, I noticed that when the program was processing audio samples, instead of using ~200 megs of memory, it was now using ~70 megs of memory.

All was not right, though. Now I was having a problem with my program. As it got further and further into processing the file, I noticed it took up more and more memory, eventually ballooning out to 2 gigs of memory once. This especially happened when I was processing multiple files at once.

Long story short, I figured out that it was the GSL matrices that were staying latent in memory long after the program's Math::GSL::Matrix objects were destroyed. Luckily I found a way to solve it:

use Math::GSL::Matrix qw/ :all /;

my $matrix = Math::GSL::Matrix->new($rows, $cols);

# ...

# clean up!
&gsl_matrix_free($matrix->raw);

Now that I solved this problem, it should purr like a kitten.

Thursday, February 5, 2009

And so it begins. After two years of sitting in the dusty corners of my hard drive, I have started—after being inspired by my newfound unemployment—to take up the old reins on programming my automatic musical genre classification program, Audicon.


While the vast majority of the reasons I originally gave up on it were due to forgetting the old axiom:


The most important early product on the way to developing a good product is an imperfect version.

In essence, I had too many ideas about how to improve the existing code, whether by exhaustively/obsessively searching for a better way to write a small portion of it, or, in the case of the entire project, getting fed up with the limited syntax of Scilab's programming language and opting to write it in a language I knew could do the job with style (i.e. Perl).


More details on now this thing works to come...