Using the FFMpeg library in my Android app, I try to understand how I can seek in an audio file, at a very precise position.
For example, I want to set the current position in my file to the frame #1234567 (in a file encoded at 44100 Hz), which is equivalent to seek at 27994.717 milliseconds.
To achieve that, here is what I tried:
// this:
av_seek_frame(formatContext, -1, 27994717, 0);
// or this:
av_seek_frame(formatContext, -1, 27994717, AVSEEK_FLAG_ANY);
// or even this:
avformat_seek_file(formatContext, -1, 27994617, 27994717, 27994817, 0);
Using a position in microseconds gives me the best result so far.
But for some reason, the positioning is not totally accurate: when I extract the samples from the audio file, it doesn't start exactly at the expected position. There is a slight delay of about 30-40 milliseconds (even if I seek to the position 0, surprisingly...).
Do I use the function the right way, or even the right function?
EDIT
Here is how I can get the position:
AVPacket packet;
AVStream *stream = NULL;
AVFormatContext *formatContext = NULL;
AVCodec *dec = NULL;
// initialization:
avformat_open_input(&formatContext, filename, NULL, NULL);
avformat_find_stream_info(formatContext, NULL);
int audio_stream_index = av_find_best_stream(formatContext, AVMEDIA_TYPE_AUDIO, -1, -1, &dec, 0);
stream = formatContext->streams[audio_stream_index];
...
// later, when I extract samples, here is how I get my position, in microseconds:
av_read_frame(formatContext, &packet);
long position = (long) (1000000 * (packet.pts * ((float) stream->time_base.num / stream->time_base.den)));
Thanks to that piece of code, I can get the position of the beginning of the current frame (frame = bloc of samples, the size depends on the audio format - 1152 samples for mp3, 128 to 1152 for ogg, ...)
The problem is: the value I get in position is not accurate: it's actually 30 ms late, approximately. For example, when it says 1000000, the actual position is approximately 1030000...
What did I do wrong? Is it a bug in FFMpeg?
Thanks for your help.
It depends on the codec. For example aac has a resolution of 1024 samples per frame, no matter what the sample rate, it also has priming samples that may be discarded. MP3 has 576 or 1152 samples per frame depending on the layer.
If you need perfection, use an uncompressed format like wav or riff.
Late, but hopefully, it helps someone. The idea is to save timestamp when seeking and then compare AVPacket->pts with this value (You can do that with AVStream->dts, but it wasn't giving good results in my experiments). If pts is still lower than our target timestamp, then skip frames using AV_PKT_DATA_SKIP_SAMPLES ability of AVPacket->side_data.
Code for seeking method:
void audio_decoder::seek(float seconds) {
    auto stream = m_format_ctx->streams[m_packet->stream_index];
    // convert seconds provided by the user to a timestamp in a correct base,
    // then save it for later.
    m_target_ts = av_rescale_q(seconds * AV_TIME_BASE, AV_TIME_BASE_Q, stream->time_base);
    avcodec_flush_buffers(m_codec_ctx.get());
    // Here we seek within given stream index and the correct timestamp 
    // for that stream. Using AVSEEK_FLAG_BACKWARD to make sure we're 
    // always *before* requested timestamp.
    if(int err = av_seek_frame(m_format_ctx.get(), m_packet->stream_index, m_target_ts, AVSEEK_FLAG_BACKWARD)) {
        error("audio_decoder: Error while seeking ({})", av_err_str(err));
    }
}
And code for decoding method:
void audio_decoder::decode() {
   <...>
   while(is_decoding) {
       // Read data as usual.
       av_read_frame(m_format_ctx.get(), m_packet.get());
       // Here is the juicy part. We were seeking, but the seek 
       // wasn't precise enough so we need to drop some frames.
       if(m_packet->pts > 0 && m_target_ts > 0 && m_packet->pts < m_target_ts) {
            auto stream = m_format_ctx->streams[m_packet->stream_index];
            // Conversion from delta timestamp to frames.
            auto time_delta = static_cast<float>(m_target_ts - m_packet->pts) / stream->time_base.den;
            int64_t skip_frames = time_delta * m_codec_ctx->time_base.den / m_codec_ctx->time_base.num;
            // Next step: we need to provide side data to our packet,
            // and it will tell the codec to drop frames.
            uint8_t *data = av_packet_get_side_data(m_packet.get(), AV_PKT_DATA_SKIP_SAMPLES, nullptr);
            if(!data) {
                 data = av_packet_new_side_data(m_packet.get(), AV_PKT_DATA_SKIP_SAMPLES, 10);
            }
            // Define parameters of side data. You can check them here:
            // https://ffmpeg.org/doxygen/trunk/group__lavc__packet.html#ga9a80bfcacc586b483a973272800edb97
            *reinterpret_cast<uint32_t*>(data) = skip_frames;
            data[8] = 0;
        }
        // Send packet as usual.
        avcodec_send_packet(m_codec_ctx.get(), m_packet.get());
        // Proceed to the receiving frames as usual, nothing to change there.
   }
   <...>
}
If it's unclear without context, you can check the same code in my project audio_decoder.cpp.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With