MP3 and OGG Audio Compression

 

 

GSBS Course GS00-0610

Digital Signal Processing

Special Project in Medical Physics

Professor Richard Wendt

 

 

 

 

 

Aziz H. Poonawalla

13th August 2001

 

 

Introduction

Lossy Audio Compression

Audio compression is a widespread application of digital signal processing. Standard digital audio is represented as a time-domain waveform, whose storage requirements are massive. A single audio compact disk (CD), for example, has a storage capacity of approximately 600 MB. The amount of "CD Quality" audio (sampled at 44.1 kHz, stereo channels, 16-bit data) that a single CD can hold can be easily determined from a simple calculation:

Therefore, a single audio CD can hold 3500 seconds’ worth (less than one hour) of stereo music. If we assume a standard length of a selection to be 3 minutes (=180 seconds, or 30.5 MB), it would therefore take over an hour to download at standard modem speeds of 56 kbps. Data is stored on standard audio CDs in the "Red Book" format, and usually stored on disk as data in the WAV or Pulse Code Modulation (PCM) format, all of which are essentially equivalent.

Lossy audio compression relies on the fact that much of the audio signal is redundant, or irrelevant to how the audio signal is perceived by the human ear. By discarding this information, a much smaller amount of data can be used to represent the same audio information, and that data is interpolated to approximate the original signal. These calculations are typically performed in the frequency domain.

MP3 Audio Compression

The MP3 codec (coder-decoder) was originally patented by the German corporation Fraunhofer, IIS. The technology is licensed to the Motion Picture Experts Group, whose initials name the algorithm. The 3 stands for Audio Layer 3, successor to earlier technologies). A diagram from the official site has a high-level description of how the algorithm works:

(source: http://www.iis.fhg.de/amm/techinf/layer3/index.html)

There are two important elements of the diagram that underlie the technique. First, the incoming data stream is filtered using the Modified Discrete Cosine Transform (MDCT), commonly used in audio processing literature for spectrum analysis. Simultaneously, the data is filtered using a "Perceptual Model" which models the human ear in order to determine what frequencies and harmonics are actually perceived while listening. The output of the Perceptual Model is used to weight the filtering step. A more detailed description of the algorithm is found at the Fraunhofer website 1.

The advantage of this codec is that an order of magnitude compression can be achieved. With standard time-domain audio (such as WAV or Red Book), one minute of CD-quality stereo audio takes up 10 MB of data, whereas with MP3 it only occupies 1 MB. However, due to the patent held by Fraunhofer IIS, this codec is not freely available and carries licensing fees. Fraunhofer has recently demanded royalties from freeware MP3 encoders, further lessening the value of MP3 to the academic and artistic communities2,3.

OGG Audio Compression

The OGG Vorbis codec is a new one developed within the past two years by the Open Source community. It is Free software released under the Free Software Foundation’s General Public License (GPL) and is completely in the public domain. The advantages of Free Software are explained in detail on the GNU Foundation’s website4. OGG is a general name for a project to produce free, open codecs for all kinds of multimedia, and Vorbis the specific OGG project devoted to audio compression. More information can be found at the Vorbis.com website Frequently Asked Questions document5.

The compression codec is similar in general concept to the MP3 approach, in that it also uses a MDCT filter, as well as a "Psychoacoustic Model" to model human hearing and discard redundant and irrelevant data. A detailed overview is not available but the actual software is freely accessible and can be investigated directly for more information as to the technique6. There is some question as to whether OGG would survive a legal challenge by Fraunhofer regarding patent infringement. The Vorbis file format does support additional capabilities not found in MP3, such as :

Bit Rate

The bit rate is the amount of data used to model the audio per second. A standard of 192 kbps is considered adequate for faithful representation of CD-quality audio. Two types of bit rate encoding exist, Constant (CBR) and Variable (VBR). Constant means that the bit rate is fixed for the entire duration of the selection, which means that the output size will be a fixed function of length and sample rate. Variable encoding is a more efficient method where the data is analyzed and a bit rate is chosen in real-time to match the complexity of the data. Sections of the audio where there is simple tone or even silence, for example, do not need as much bit depth to represent as sections where there are multiple voices, instruments, etc. For VBR encoding, the bitrate is actually an "average" or "minimum" (depending on settings) bitrate that the encoder attempts to maintain. The MP3 codec supports both CBR and VBR, but the beta version of OGG only supports VBR for testing purposes. Full CBR and arbitrary VBR rate functionality will be offered in the first full release of OGG Vorbis.

Materials and Methods

Three 30-second audio samples were chosen to represent three types of data, spoken speech, pure music, and song (mixture of both). Each was extracted from audio CD in 16-bit stereo, sampled at 44.1 kHz, and saved as WAV file format (of fixed size). The audio extraction from CD Red Book Audio format to WAV format is lossless and noise-free since these are isomorphic representations of the data.

Each source WAV file was encoded into both OGG and MP3 formats using generally-available tools7,8 and the following parameters:

The OGG encoder, since it is still a beta release, only supports variable bit-rate encoding, which meant that for a fair comparison VBR had to be used for both formats.

The MP3 and OGG output for each of the three audio types (speech, music, and song) were then converted back to WAV files using the direct write-to-disk method of the WinAmp decoder9 (and the OGG plugin10). This is also a lossless step since the interpolation is already performed by the decoder from the OGG or MP3 format to time-domain signal to drive the computer speakers. Therefore the output audio stream is simply redirected to disk instead of line-out. This is diagrammed as:

Note that in the diagram above, the black boxes represent the lossy stage, and the regular arrows represent lossless stages.

For first comparison, a subjective audio comparison was used as a subjective measure of ‘Audio Quality". The output WAV files were used instead of the OGG and MP3 formats because the information is identical and to simplify output software requirements. Similar tests have been carried out in the past but have not compared equivalent encoding techniques11,12.

The second comparison was more quantitative. The input and output WAV files were imported into MATLAB and analyzed using the Fast Fourier Transform. Both frequency and phase analysis was performed.

Results and Discussion

Data Size

Each of the source WAV files had a sample length of N = 132300 which matches expectation ( = 30 seconds * 44100 samples/sec ). The data sizes of the output in MATLAB (including both channels) were :

Data size

Speech

Song

Music

Source WAV

1323001x2

1323001x2

1323001x2

OGG WAV

1327097x2

1327097x2

1327097x2

MP3 WAV

1324800x2

1324800x2

1324800x2

The source WAV files had an extra byte which was a header file written by the data extraction software. The OGG and MP3 encoders had consistent output for the three file types, meaning that whatever the source of the size discrepancy, it was not related to actual information content, just a function of original file sampling rate, bit depth, and duration.

The output files were not uniformly sized due to the restriction of VBR encoding. The byte lengths of the output WAV files was greater than the inputs, even though the total length of the selections was constant at 30 seconds and the sampling rates and bit depths were constant. This suggests that both OGG and MP3 impose some frequency scaling on the data, which made direct spectral arithmetic (for example, difference or average) impossible. Part of the discrepancy for MP3 was "padding" of near-zero (noisy) samples at the beginning and the end of the data vector, but the actual time duration was identical, suggesting that the sampling frequency is actually somewhat less than 44.1 kHz for both methods.

Subjective Comparison

The input and output audio selections were compared using standard computer speakers in a closed room by six independent observers, and all agreed that no real difference was observable in the sound quality. The table below demonstrates the comparison. Click the audio icons to hear the selections.

Quality

Input

OGG

MP3

Speech

Music

Song

 

Spectral Analysis

As noted, the compression codecs’ output gave differing file sizes in the final WAV output, which suggests that the sampling frequency was no longer precisely 44.1 kHz since the output duration was still constant at 30 seconds. This meant that direct subtraction/arithmetic of spectra was less meaningful because the information content in each frequency channel was shifted by non-integer amounts relative to the input.

This scaling can be represented mathematically. If the Fourier transform pair is :

then the scaling we observed in the time domain is represented as :

where a is the actual scaling and b represents a phase offset. The FFT of this results in :

In our data, the scaling imposed by the codecs did not depend on the information content, only the original sampling frequency and bit depth. For the OGG files, the value of a was

For MP3, the scaling was less since there was some zero padding in the time domain, which accounted for most of the extra samples. However it is clear that in either case this is not a large effect and this suggests that the information in each frequency channel is likely to be matched reasonably well. This is a source of noise, however its significance is difficult to quantify. In the difference plots below, it is clear that the highest changes are at the higher frequencies, which is logical given that the bulk of the information in the signal is at lower frequencies. By analogy with MRI, the center of "k-space" contains the low-frequency information and even small "keyholes" of data at the center can still produce a reasonably recognizable image compared to the full data set. The amplitude scaling is irrelevant since we have arbitrary units on the vertical frequency scale, and are more interested in relative amplitudes than absolute.

To compensate, the FFT length was chosen to be N = 1327104, which is the smallest power of 8 larger than the maximum data size. Since all data vectors were smaller than this, the spectra were interpolated by standard zero-padding. The benefit of this approach was that this unified the data sizes, allowing direct subtraction of spectra for comparison. The disadvantage is that the input signal also had to be scaled and zero padded, so we are no longer comparing the true source file to the output, philosophically counter to the purpose of the analysis. The spectra and the spectra differences for each audio selection are shown below. The vertical scale for the spectral plots are log scale. (see next page)

Spectral plots for SPEECH audio selection

Spectral difference plots for SPEECH audio selection

Spectral plots for MUSIC audio selection

Spectral difference plots for MUSIC audio selection

Spectral plots for SONG audio selection

Spectral difference plots for SONG audio selection

 

Phase Analysis

The phase was also computed for each audio selection and compared to the input for each codec. In all cases, the OGG encoding more faithfully preserved the phase behavior than the MP3 encoding. Both codecs did add a phase shift, with the shift from MP3 being larger than the shift from OGG. This phase shift can be expressed as :

The phase plots are reproduced below. Note that the MP3 codec more strongly deviates from the true phase at higher frequencies. As with the frequency analysis, this is likely because the higher frequencies allow more room for modification because the bulk of the information resides at low frequency.

Phase plots for SPEECH audio selection

Phase plots for MUSIC audio selection

Phase plots for SONG audio selection

 

Conclusion

Both MP3 and OGG encoding impart a phase shift and frequency scaling to the original input data. For an input signal x[n], the corresponding scaling and shift, and its effect on the spectrum, are represented mathematically as :

The audio quality of the compressed audio was essentially identical to the original input when encoded according to CD-quality standards. Both the Perceptual Model of MP3 and the Psychoacoustic Model of OGG likely operate mostly at higher frequencies, from inspection of the spectral differences (though there is some error in that analysis due to the frequency scaling). From inspection of the phase, we see that the MP3 model deviates from the original data more strongly than OGG, again at higher frequencies.

The analysis was also limited by the fact that OGG only supports VBR encoding in beta release. When CBR encoding is offered on OGG, a more robust analysis can be performed, which will not suffer the frequency scaling effect. Also, to better characterize the codecs’ effects at high frequencies, data such as classical music or real-world sounds should be incorporated into the audio selections. A full range of bitrates should also be attempted on each selection in order to see what the phase and frequency implications are and to separate out any effects from the current analysis.

 

APPENDIX: MATLAB Code

Below is the MATLAB script used to generate the various plots seen in the quantitative analysis section.

% SCRIPT - process audio selections

%

% LOAD THE DATA, SAVE IT

%

% raw data = 30 sec, 44 K , 16 bits

% = 30 * 44100 = 132300 samples ( = N, define)

% = 13200 * 16 = 21168016 bytes

% fname = 'speech' ;

% mp3 1324800x2 21196800 double array

% ogg 1327097x2 21233552 double array

% orig 1323001x2 21168016 double array

%

% orig = wavread([ 'listen/orig/orig-' fname '-stereo-44k-16bit.wav']) ;

% mp3 = wavread([ 'listen/wav/mp3-' fname '-stereo-44k-16bit.wav']) ;

% ogg = wavread([ 'listen/wav/ogg-' fname '-stereo-44k-16bit.wav']) ;

%

% clear

%

% fname = 'song' ;

% mp3 1324800x2 21196800 double array

% ogg 1327097x2 21233552 double array

% orig 1323001x2 21168016 double array

%

% orig = wavread([ 'listen/orig/orig-' fname '-stereo-44k-16bit.wav']) ;

% mp3 = wavread([ 'listen/wav/mp3-' fname '-stereo-44k-16bit.wav']) ;

% ogg = wavread([ 'listen/wav/ogg-' fname '-stereo-44k-16bit.wav']) ;

%

% save song

%

% clear

%

% fname = 'music' ;

% mp3 1324800x2 21196800 double array

% ogg 1327097x2 21233552 double array

% orig 1323001x2 21168016 double array

%

% orig = wavread([ 'listen/orig/orig-' fname '-stereo-44k-16bit.wav']) ;

% mp3 = wavread([ 'listen/wav/mp3-' fname '-stereo-44k-16bit.wav']) ;

% ogg = wavread([ 'listen/wav/ogg-' fname '-stereo-44k-16bit.wav']) ;

%

% save music

fnames = { 'speech', 'music', 'song' } ;

N = 132300 ;

dt = 2.2676e-05 ;

FFTlength = 1327104 % max size of all data, rounded to upper multiple of 8

df = [ 0:1327104 - 1 ] - (1327103/2) ;

% offset padding is effectively constant for mp3 format

% (counted from corrected wav length)

% +1800 : -1105 ... data ... +695

% (padding is not exactly zero due to noise)

for jj = 2:length(fnames) ;

fname = char(fnames(jj))

eval([ 'load ' fname ]);

clear origfft mp3fft oggfft

origfft = fftshift( abs( fft( orig(:,1) , FFTlength ) ) ) ;

mp3fft = fftshift( abs( fft( mp3(:,1) , FFTlength ) ) ) ;

oggfft = fftshift( abs( fft( ogg(:,1) , FFTlength ) ) ) ;

clear origphase mp3phase oggphase

origphase = unwrap( angle( fftshift( fft( orig(:,1), FFTlength ) ) ) ) ;

mp3phase = unwrap( angle( fftshift( fft( mp3(:,1), FFTlength ) ) ) ) ;

oggphase = unwrap( angle( fftshift( fft( ogg(:,1), FFTlength ) ) ) ) ;

figure (1); clf ;

subplot(3,1,1) ; plot(df, log( origfft(:,1))) ; ylabel(['Input - ' fname]);

axis([ -7e5 7e5 -10 10 ])

subplot(3,1,2) ; plot(df, log( oggfft(:,1))) ; ylabel(['OGG - ' fname]);

axis([ -7e5 7e5 -10 10 ])

subplot(3,1,3) ; plot(df, log( mp3fft(:,1))) ; ylabel(['MP3 - ' fname]);

axis([ -7e5 7e5 -10 10 ])

xlabel('Frequency - Hz');

saveas(gcf, [fname '-freq'], 'jpg')

figure (2) ; clf ;

subplot(2,1,1) ; plot(df, log(origfft(:,1)) - log(oggfft(:,1)) )

axis([ -7e5 7e5 -10 10 ])

ylabel(['difference OGG - ' fname]);

subplot(2,1,2) ; plot(df, log(origfft(:,1)) - log(mp3fft(:,1)) )

axis([ -7e5 7e5 -10 10 ])

ylabel(['difference MP3 - ' fname]);

xlabel('Frequency - Hz');

saveas(gcf, [fname '-diff'], 'jpg')

figure (3) ; clf ; hold

plot(df, origphase(:,1), 'b' ) ; ylabel(['Input - ' fname]);

plot(df, oggphase(:,1) , 'r' ) ; ylabel(['OGG - ' fname]);

plot(df, mp3phase(:,1) , 'g' ) ; ylabel(['MP3 - ' fname]);

xlabel('Frequency - Hz');

legend('Input', 'OGG', 'MP3', -1);

title('PHASE - speech');

saveas(gcf, [fname '-phase'], 'jpg')

end % jj

REFERENCES

  1. http://www.iis.fhg.de/amm/techinf/layer3/index.html
  2. http://www.xiph.org/about.html#fraunhofer
  3. http://news.cnet.com/news/0-1005-200-4101023.html
  4. http://www.gnu.org/philosophy/free-sw.html
  5. http://www.vorbis.com/faq.psp
  6. http://www.xiph.org/ogg/vorbis/
  7. http://home.pi.be/~mk442837/lame370.zip
  8. http://www.vorbis.com/files/rc2/windows/oggenc-1.0rc2.zip
  9. http://www.winamp.com/
  10. http://www.winamp.com/plugins/detail.jhtml?componentId=60647
  11. http://www.washingtonpost.com/wp-dyn/articles/A55501-2001Jul12.html
  12. http://www.twice.com/html/pagebeta.cfm?InputKey=2853