Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now


Opinion: open captioning in the AV universe

Ken Frommert, president of ENCO and Eduardo Martinez, director technology at StreamGuys, discuss how improvements in speed and accuracy are opening new doors for captioning live and streamed content.

The production and transmission of live manual captioning have long been challenged by high costs, availability, varied latency and inconsistent accuracy rates. Today the transition to more automated, software-defined captioning workflows has introduced a new series of challenges.

This is true of both closed and open captioning. Closed captioning is traditionally associated with broadcast TV. Simply put, closed captions are encoded within the video stream and decoded by the TV or other viewing/receiving device. More common in commercial AV, open captions are laid on top of video from a meeting or seminar; they not only serve the hearing-impaired, but also people viewing content that is not in their first language.

Automatic speech recognition removes the costs and staffing concerns of manual captioning. The speed and accuracy of speech-to-text conversion continue to improve with advances in deep neural networks. The statistical algorithms associated with these advances – coupled with larger multilingual databases to mine – more effectively interpret, and accurately spell out, the speech coming through the microphone.

Scalable architecture

In single classrooms, conference rooms and auditoriums, that audio is processed by a standalone system, consisting of a specialised speech-to-text engine that instantaneously outputs to a display. An alternative workflow can more efficiently process open captions to potentially hundreds of rooms on a campus. In this case, a dedicated device in each room would receive the audio and send to a private or public cloud server, which would then instantaneously output the speech as data back to the devices in each space. Those devices would distribute the captions with the video to the display in that room. This approach offers a very scalable architecture, with dedicated devices serving each space.

Meanwhile, the more powerful computing engines within captioning technology have significantly reduced the latency to near real time. As an example, ENCO’s just-released enCaption4 closes the gap between speech and the on-screen appearance of captions to about three seconds.

“The speed and accuracy of speech-to-text conversion continue to improve with advances in deep neural networks”

As more systems move to software-defined platforms, the captioning workflow for pre-recorded and/or long-form content has been greatly simplified. For facilities captioning existing content for later viewing, operators can essentially drag and drop video files into a file-based workflow that extracts the audio track for text conversion. These files can then be delivered in various lengths and formats for digital signage, online learning and other platforms. The integration of dictionaries into the captioning software is growing in significance.

Convergence with streaming

Captioning systems are applicable in both on-premise and cloud configurations. In the latter case, some systems are now offered as SaaS platforms, with monthly fees that include hardware costs coming out to as low as approximately $15 per hour for average rates of use.

Cloud-based captioning software also extends the service for online audiences outside the local facility. One emerging opportunity is automatic generation of transcriptions for live and archived content. The latter aligns especially well with the desire to caption the large volumes of stored media in higher education and corporate campuses. There are also uses within government for court reporting or community meetings.

Content repurposing software from companies like StreamGuys, previously used for podcasting and speciality AV streams, are being tailored for open captioning applications. In this architecture, previously ingested content is recalled through archive search. This level of integration also enables users to label and search for specific speakers for improved recognition and tracking, and later search the system for all content that contains a specific speaker, including exact spoken sentences.

The overarching business benefit of updating the captioning workflow remains clear: the cost reduction opportunities multiply as these systems move to software-defined platforms. This not only provides ongoing low cost of ownership for the end-user, but allows systems integrators to potentially plan for open captioning as part of a broader AV-over-IP ecosystem to help facilities more efficiently manage this growing need.