Standard approaches for video (audio) and its transcript

Note (May 2013): At a SIG meeting, I was told that it is well-known and easy to make subtitles from transcripts by forced alignment. Forced alignment has already been impletemented in standard speech recognition engines.

With one of the implementations in HTK, it is instructed how to automatically segment audio using forced alignments at voxforge.org.

After the meeting, I found an automatic and accurate captioning system called Autocap released in or before 2010. Details of the system is explained in A. Knight and K. Almeroth (2010), "Fast Caption Alignment for Automatic Indexing of Audio," International Journal of Multimedia Data Engineering & Management (IJMDEM), vol. 1, no. 2, pp. 1-17. This system runs faster than the duration of the video, and is applicable to both online and offline captioning.

On the other hand, the subtitles in this site were made with a simple and lightweight method because I was just unaware of these sophisticated approaches.