Wednesday, April 20, 2011

How is audio represented with numbers?

I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB values. These are the simplest ways to represent text and images.

What is the simplest way that audio can be represented with numbers? I want to learn how to write programs that work with audio and thought this would be a good way to start. I can't seem to find any good explanations on the internet, though.

From stackoverflow
  • I think samples of the waveform at a specific sample frequency would be the most basic representation.

    Jimmy : http://en.wikipedia.org/wiki/Audio_file_format seems to indicate this as well
    Tim Lesher : Correct--the technical term for this is "linear pulse-code modulation."
  • Have you ever looked at a waveform close up? The Y-axis is simply represented as an integer, typically in 16 bits.

  • Audio can represented by digital samples. Essentially, a sampler (also called an Analog to digital converter) grabs a value of an audio signal every 1/fs, where fs is the sampling frequency. The ADC, then quantizes the signal, which is a rounding operation. So if your signal ranges from 0 to 3 Volts (Full Scale Range) then a sample will be rounded to, for example a 16-bit number. In this example, a 16-bit number is recorded once every 1/fs/

    So for example, most WAV/MP3s are sampled an audio signal at 44 kHz. I don't know how detail you want, but there's this thing called the "Nyquist Sampling Rate" the says that the sampling frequency must be at least twice the desired frequency. So on your WAV/MP3 file you are at best going to be able to hear up tp 22 kHz frequencies.

    There is a lot of detail you can go into in this area. The simplest form would certainly be the WAV format. It is uncompressed audio. Formats like mp3 and ogg are have to be decompressed before you can work with them.

  • Look up things like analog-digital conversion. That should get you started. These devices can convert a audio signal (sine waves) into digital representations. So, a 16-bit ADC would be able to represent a sine from between -32768 to 32768. This is in fixed-point. It is also possible to do it in floating-point (though not recommended for performance reasons but may be needed for range reasons). The opposite (digital-analog conversion) happens when we convert numbers to sine waves. This is handled by something called a DAC.

  • I think a good way to start playing with audio would be with Processing and Minim. This program will draw the frequency spectrum of sound from your microphone!

    import ddf.minim.*;
    import ddf.minim.analysis.*;
    
    AudioInput in;
    FFT fft;
    
    void setup()
    {
      size(1024, 600);
      noSmooth();
      Minim.start(this);
      in = Minim.getLineIn();
      fft = new FFT(in.bufferSize(), in.sampleRate());
    }
    
    void draw()
    {
      background(0);
      fft.forward(in.mix);
      stroke(255);
      for(int i = 0; i < fft.specSize(); i++)
        line(i*2+1, height, i*2+1, height - fft.getBand(i)*10);
    }
    
    void stop()
    {
      in.close();
      Minim.stop();
      super.stop();
    }
    
  • The simplest way to represent sound as numbers is PCM (Pulse Code Modulation). This means that the amplitude of the sound is recorded at a set frequency (each amplitude value is called a sample). CD quality sound for example is 16 bit samples (in stereo) at the frequency 44100 Hz.

    A sample can be represented as an integer number (usually 8, 12, 16, 24 or 32 bits) or a floating point number (16 bit float or 32 bit double). The number can either be signed or unsigned.

    For 16 bit signed samples the value 0 would be in the middle, and -32768 and 32767 would be the maximum amplitues. For 16 bit unsigned samples the value 32768 would be in the middle, and 0 and 65535 would be the maximum amplitudes.

    For floating point samples the usual format is that 0 is in the middle, and -1.0 and 1.0 are the maximum amplitudes.

    The PCM data can then be compressed, for example using MP3.

  • Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.

    If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

    Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.

    Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

    Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.

    This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

    yjerem : Very nice explanation, thanks!

0 comments:

Post a Comment