Live Video Transmuxing/Transcoding: FFmpeg vs TwitchTranscoder, Part II
By: Jeff Gong, Software Engineer, jeffgon@twitch.tv Sahil Dhanju, Software Engineer Intern Chih-Chiang Lu, Senior Software Engineer, chihchil@twitch.tv Yueshi Shen, Principal Research Engineer, yshen@twitch.tv
Special thanks go to: Christopher Kennedy, Staff Video Engineer at Crunchyroll/Ellation John Nichols, Principal Software Engineer at Xilinx, jnichol@xilinx.com for their information on FFmpeg and reviewing this article.
Note: This is the second part of a 2-part series. Make sure you read Part 1 first.
FFmpeg’s 1-In-N-Out Pipeline. Why doesn’t it handle the technical issues discussed earlier?
How does FFmpeg programmatically deal with instances where a single input stream is required to generate multiple transcoded and/or transmuxed outputs? We went directly into the latest FFmpeg Release 3.3. source code in order to understand its threading model and transcoding pipeline.
In the top-level ffmpeg.c file, the transcode() function (line 4544) loops and repeatedly calls transcode_step() (line 4478) until its inputs are completely processed, or until the user interrupts the execution. Transcode_step() wraps the main pipeline and orchestrates file I/O, filtering, decoding and encoding amongst many other immediate steps.
During the initial setup phase, init_input_threads() (line 4020) is called, and based on the number of input files, a number of new threads may be spawned to process the input.
{% raw %}
_if (nb_input_files == 1) {_
{% endraw %}
{% raw %}
_return 0;_
{% endraw %}
{% raw %}
_}_
{% endraw %}
{% raw %}
_for (i = 0; i < nb_input_files; i++) {_
{% endraw %}
{% raw %}
_..._
{% endraw %}
{% raw %}
_ret = av_thread_message_queue_alloc(&f->in_thread_queue, f->thread_queue_size, sizeof(AVPacket)); // line 4033_
{% endraw %}
{% raw %}
_}_
{% endraw %}
In line 4033, we see that the number of threads spawned is solely determined by the number of inputs. This means FFmpeg will process a 1-in-N-out scenario using only a single thread.
__In get_input_packet() (line 4055), the multithreaded companion function get_input_packet_mt() (line 4047) is only called if the number of input files is greater than one. get_input_packet_mt() can read input frames from a message queue in a nonblocking fashion. Otherwise, we use av_read_frame() (line 4072) to read a single frame for processing at one time.
{% raw %}
_#if HAVE_PTHREADS_
{% endraw %}
{% raw %}
_if (nb_input_files > 1) {_
{% endraw %}
{% raw %}
_get_input_packet_mt(f, pkt);_
{% endraw %}
{% raw %}
_}_
{% endraw %}
{% raw %}
_#endif_
{% endraw %}
{% raw %}
_return av_read_frame(f->ctx, pkt);_
{% endraw %}
Following the frame to the end of the pipeline, it enters process_input_packet() (line 2591) which decodes the frame and processes it through all the applicable filters. Timestamp correction and subtitle handling also occurs in this function. Finally, prior to returning, the decoded frame is copied to each relevant output stream.
{% raw %}
_for (i = 0; pkt && i < nb_output_streams; i++) {_
{% endraw %}
{% raw %}
_... // check constraints_
{% endraw %}
{% raw %}
_do_streamcopy(ist, ost, pkt); // line 2756_
{% endraw %}
{% raw %}
_}_
{% endraw %}
Lastly, reap_filters() (line 1424) is called from transcode_step() to loop through each output stream. The body of reap_filters()’s for loop collects frames ready for processing from the buffer and encodes them before muxing them into an output file.
{% raw %}
_// reap_filters line 1423_
{% endraw %}
{% raw %}
_for (i = 0; i < nb_output_streams; i++) { // loop through all output streams_
{% endraw %}
{% raw %}
_... // initialize contexts and files_
{% endraw %}
{% raw %}
_OutputStream *ost = output_streams[i];_
{% endraw %}
{% raw %}
_AVFilterContext *filter = ost->filter->filter;_
{% endraw %}
{% raw %}
_AVFrame filtered_frame = ost->filtered_frame;_
{% endraw %}
{% raw %}
_while (1) { // process the video/audio frame for one output stream_
{% endraw %}
{% raw %}
_... // frame is not already complete_
{% endraw %}
{% raw %}
_ret = av_buffersink_get_frame_flags(filter, filtered_frame, …);_
{% endraw %}
{% raw %}
_if (ret < 0) {_
{% endraw %}
{% raw %}
_... // handle errors and logs_
{% endraw %}
{% raw %}
_break;_
{% endraw %}
{% raw %}
_}_
{% endraw %}
{% raw %}
_switch (av_buffersink_get_type(filter)) {_
{% endraw %}
{% raw %}
_case AVMEDIA_TYPE_VIDEO:_
{% endraw %}
{% raw %}
_do_video_out(of, ost, filtered_frame, float_pts);_
{% endraw %}
{% raw %}
_case AVMEDIA_TYPE_AUDIO:_
{% endraw %}
{% raw %}
_do_audio_out(of, ost, filtered_frame);_
{% endraw %}
{% raw %}
_}_
{% endraw %}
{% raw %}
_..._
{% endraw %}
{% raw %}
_}_
{% endraw %}
By following this pipeline, we can see redundancy in how these frames are handled sequentially through the context of a single thread. We can conclude that FFmpeg may be suboptimal in producing results using only a single thread since the 1-in-N-out streaming model of transcoding is most valuable to us. FFmpeg documentation also suggests that in our use case, it may make more sense to launch multiple FFmpeg instances in parallel. Our key insight here is that while multithreaded functionality does exist for this tool, it does not match the exact needs of Twitch’s streaming service and cannot be used as we would like.
Benchmarks
TwitchTranscoder is our in-house software developed to address the technical issues discussed earlier. It has been used in our production to process tens of thousands of concurrent live streams 24/7.
To determine if the TwitchTranscoder would perform better than FFmpeg on daily transcoding tasks, we performed a series of basic benchmark tests. For our tests, we fed both tools with a Twitch live stream as well as a 1080p60 video file using the same presets, profiles, bitrates, and other flags. Each source was transcoded to our typical stack of 720p60, 720p30, 480p30, 360p30, and 160p30.
Our hypothesis was that FFmpeg would transcode slower than TwitchTranscoder for file input, and might even fail to keep up for live streaming.
The results in figures 9, 10 and 11 compare the execution time of TwitchTranscoder vs. FFmpeg. They show that our transcoder is indeed faster for offline transcoding, even as we handled the same task and more (providing audio only transcoding, thumbnail generation, and so on in addition to the stack specified above).
FFmpeg is slightly faster for single-variant output 720p60 because TwitchTranscoder handles more tasks as explained above. When the number of variants increases, TwitchTranscoder’s multithreading model has a greater advantage that helps it outperform FFmpeg. For Twitch’s full ABR ladder, TwitchTranscoder saves 65% execution time in comparison to FFmpeg.
We conducted our live stream transcoding tests by comparing how many parallel instances of FFmpeg would be able to run on a single machine before issues appeared. Issues can include dropped frames, video artifacts, etc. In our production servers, we are able to support multiple channels being transcoded simultaneously while many more channels are transmuxed. Unfortunately, running more than a single FFmpeg instance causes a slew of errors that affect the transcoded outputs, and requires much more CPU utilization (see screenshot in Figure 12).
Conclusion
In this article, we examined FFmpeg as a live stream RTMP-to-HLS transcoder and provided insight on how to operate the tool. Such a solution is simple to deploy and has a few technical issues, such as segment misalignment, unnecessary performance sacrifice, and a lack of flexibility to support our product features. Therefore, we implemented our own in-house transcoder software stack called TwitchTranscoder, which runs in a custom designed threading model and outputs N variants in a single process.