A friend has problems redoing the voice-over in a video. The video or audio always end up too long and thus one of them is cut off to match the other), even though the recordings are both the same duration. So I wrote this little explanation and a possible solution:

Short Primer on Audio in a Computer

An audio wave drawn below the corresponding video frames, with lines through it at regular intervals that indicate the sample points the computer records.

The purple line above is supposed to be a sound wave. The pictures at the top are video frames.

Now the computer can only do individual numbers, it can’t do wave forms, so it simply checks several times per second how high the soundwave is.

So to save the start of the wave above to your video file, it writes a bunch of pixel images for the frames and the numbers 1.3, 2.5, 3.0, 3.1 and 2.8 to the file. This is imprecise. When the computer uses these numbers to recreate the sound wave, it looks more like this:

Lines drawn between the recorded sample points, which lead to a set of "stairs" that look like our original wave if you squint hard enough.

Luckily, the human ear isn’t perfect, so if you check more often, the human ear actually won’t hear the little “steps” in the curve, it will sound close enough.

The Problem with Audio in Movies

But when it wants to play back your movie, it needs to play the right audio at the right time, it needs to align these numbers with the video frames so the lip positions line up with speech, for example.

Now, if you have 30 video frames in 1 second, and 44000 audio samples in one second, that makes 1466 2/3rd frames (44000 ÷ 30)

The problem here is, that you can’t have “half a sample”. So the computer has to decide what to do with the extra 2/3rds. So it usually just shifts the sample into the next frame, which very quickly adds up, leading to the audio being longer than the video:

Video frames with little boxes under them, indicating audio samples. Some boxes cross two video frames

(Like above - green is the real audio, purple is what the computer makes from it)

If you have a sample rate of 48000 samples, that evenly divides by 30: 1600 audio samples per one video frame.

So either somewhere in your video recording, or exporting, you had 44000 where you should have had 48000, or you had an odd framerate. E.g. NTSC TV you watch in the US uses 29.97 video frames per second, which makes it really hard to find something that evenly divides by that.

Also, many programs round that up to 30 fps, leading to issues because the picture will run just slightly slow.

What this means in practice for video editing is: Make sure that everywhere where your recording and editing software let you specify a sample rate or frame rate, they are the same, and that audio and video match (i.e. divide evenly by each other).