Recently I have been investigating tools to help with automating at least part of the transcription process for the audiovisual materials my institution makes available online. As we have hundreds (thousands?) of hours worth of A/V content already digitally available via our various platforms, and are actively producing more, using automation for a portion of the transcription process makes both logical and fiscal sense for us.
The tool I have been most actively experimenting with is the IBM Watson speech to text service. I should note that this post is simply a reflection of my thoughts on this process so far and is not intended as an explicit recommendation or endorsement of Watson over any of the other similar tools. This testing involved signing up for a non-paid level account on IBM Watson.
To generate video subtitles using Watson I have been using two scripts. One, written in bash that handles the actual piping of the file to Watson by converting an input to 16 kHz mono, uploading to Watson and then storing the raw Watson output in a
.json file. It also creates a rough dump of the raw transcript into text. The second, written in Ruby takes the raw Watson output and parses it into four second segments formatted as a
.vtt subtitle file. This file can then be used directly with its corresponding video to provide a rough subtitle track.
I made the decision to parse the text into four second segments with the idea of minimizing the amount of intervention required by someone proofreading the
.vtt files. By standardizing the text chunks, it eliminates the need for any manipulation of text timing (which also makes it easier to do editing in a basic text editor rather than specialized software). Hopefully, having the editor only focus on textual content will save time and money (as well as the sanity of the proofreaders).
So far the results, while far from perfect, have been honestly better than I expected them to be! Quality of the transcript is very dependent on the style of speech employed by speakers in the video, and as the speech model employed seems to be tuned towards a very particular kind of American English, it struggles with speakers who speak in a more ‘casual’ style (such frequently interjecting ‘ya know’) where it will mis-assign and make up words. Proper nouns and non-English words are also mis-assigned frequently. A word that Watson often struggles with is ‘Archivist’, which is unfortunate given the context. This being said, I have been surprised that for several of the videos I have tested there have been broad segments (again, dependent on speaker) that created extremely usable subtitles even with no proofreading.
To show an example of a completely human created transcript compared with an unedited Watson created transcript I have uploaded two
.vtt files to Github Gist, with the Watson Transcript here and the professionally transcribed file here. The original video is viewable on our Vimeo page here. I think this auto-transcript is relatively representative of my testing so far, being of medium accuracy.