(I will put a 500 reputation bounty on this question as soon as it's eligible - unless the question got closed.)
Problem in one sentence
Reading frames from a VideoCapture advances the video much further than it's supposed to.
Explanation
I need to read and analyze frames from a 100 fps (according to cv2 and VLC media player) video between certain time-intervals. In the minimal example that follows I am trying to read all the frames for the first ten seconds of a three minute video.
I am creating a cv2.VideoCapture object from which I read frames until the  desired position in milliseconds is reached. In my actual code each frame is analyzed, but that fact is irrelevant in order to showcase the error.
Checking the current frame and millisecond position of the VideoCapture after reading the frames yields correct values, so the VideoCapture thinks it is at the right position - but it is not. Saving an image of the last read frame reveals that my iteration is grossly overshooting the destination time by over two minutes.
What's even more bizarre is that if I manually set the millisecond position of the capture with VideoCapture.set to 10 seconds (the same value VideoCapture.get returns after reading the frames) and save an image, the video is at (almost) the right position!
Demo video file
In case you want to run the MCVE, you need the demo.avi video file. You can download it HERE.
MCVE
This MCVE is carefully crafted and commented. Please leave a comment under the question if anything remains unclear.
If you are using OpenCV 3 you have to replace all instances of cv2.cv.CV_ with cv2.. (The problem occurs in both versions for me.)
import cv2
# set up capture and print properties
print 'cv2 version = {}'.format(cv2.__version__)
cap = cv2.VideoCapture('demo.avi')
fps = cap.get(cv2.cv.CV_CAP_PROP_FPS)
pos_msec = cap.get(cv2.cv.CV_CAP_PROP_POS_MSEC)
pos_frames = cap.get(cv2.cv.CV_CAP_PROP_POS_FRAMES)
print ('initial attributes: fps = {}, pos_msec = {}, pos_frames = {}'
      .format(fps, pos_msec, pos_frames))
# get first frame and save as picture
_, frame = cap.read()
cv2.imwrite('first_frame.png', frame)
# advance 10 seconds, that's 100*10 = 1000 frames at 100 fps
for _ in range(1000):
    _, frame = cap.read()
    # in the actual code, the frame is now analyzed
# save a picture of the current frame
cv2.imwrite('after_iteration.png', frame)
# print properties after iteration
pos_msec = cap.get(cv2.cv.CV_CAP_PROP_POS_MSEC)
pos_frames = cap.get(cv2.cv.CV_CAP_PROP_POS_FRAMES)
print ('attributes after iteration: pos_msec = {}, pos_frames = {}'
      .format(pos_msec, pos_frames))
# assert that the capture (thinks it) is where it is supposed to be
# (assertions succeed)
assert pos_frames == 1000 + 1 # (+1: iteration started with second frame)
assert pos_msec == 10000 + 10
# manually set the capture to msec position 10010
# note that this should change absolutely nothing in theory
cap.set(cv2.cv.CV_CAP_PROP_POS_MSEC, 10010)
# print properties  again to be extra sure
pos_msec = cap.get(cv2.cv.CV_CAP_PROP_POS_MSEC)
pos_frames = cap.get(cv2.cv.CV_CAP_PROP_POS_FRAMES)
print ('attributes after setting msec pos manually: pos_msec = {}, pos_frames = {}'
      .format(pos_msec, pos_frames))
# save a picture of the next frame, should show the same clock as
# previously taken image - but does not
_, frame = cap.read()
cv2.imwrite('after_setting.png', frame)
MCVE output
The print statements produce the following output.
cv2 version = 2.4.9.1
initial attributes: fps = 100.0, pos_msec = 0.0, pos_frames = 0.0
attributes after reading: pos_msec = 10010.0, pos_frames = 1001.0
attributes after setting msec pos manually: pos_msec = 10010.0, pos_frames = 1001.0
As you can see, all properties have the expected values.
imwrite saves the following pictures.
first_frame.png

after_iteration.png

after_setting.png

You can see the problem in the second picture. The target of 9:26:15 (real time clock in picture) is missed by over two minutes. Setting the target time manually (third picture) sets the video to (almost) the correct position.
What am I doing wrong and how do I fix it?
Tried so far
cv2 2.4.9.1 @ Ubuntu 16.04
cv2 2.4.13 @ Scientific Linux 7.3 (three computers)
cv2 3.1.0 @ Scientific Linux 7.3 (three computers)
Creating the capture with
cap = cv2.VideoCapture('demo.avi', apiPreference=cv2.CAP_FFMPEG)
and
cap = cv2.VideoCapture('demo.avi', apiPreference=cv2.CAP_GSTREAMER)
in OpenCV 3 (version 2 does not seem to have the apiPreference argument).
Using cv2.CAP_GSTREAMER takes extremely long (about 2-3 minutes to run the MCVE) but both api-preferences produce the same incorrect images.
When using ffmpeg  directly to read frames (credit to this tutorial) the correct output images are produced.
import numpy as np
import subprocess as sp
import pylab
# video properties
path = './demo.avi'
resolution = (593, 792)
framesize = resolution[0]*resolution[1]*3
# set up pipe
FFMPEG_BIN = "ffmpeg"
command = [FFMPEG_BIN,
           '-i', path,
           '-f', 'image2pipe',
           '-pix_fmt', 'rgb24',
           '-vcodec', 'rawvideo', '-']
pipe = sp.Popen(command, stdout = sp.PIPE, bufsize=10**8)
# read first frame and save as image
raw_image = pipe.stdout.read(framesize)
image = np.fromstring(raw_image, dtype='uint8')
image = image.reshape(resolution[0], resolution[1], 3)
pylab.imshow(image)
pylab.savefig('first_frame_ffmpeg_only.png')
pipe.stdout.flush()
# forward 1000 frames
for _ in range(1000):
    raw_image = pipe.stdout.read(framesize)
    pipe.stdout.flush()
# save frame 1001
image = np.fromstring(raw_image, dtype='uint8')
image = image.reshape(resolution[0], resolution[1], 3)
pylab.imshow(image)
pylab.savefig('frame_1001_ffmpeg_only.png')
pipe.terminate()
This produces the correct result! (Correct timestamp 9:26:15)
frame_1001_ffmpeg_only.png:

Additional information
In the comments I was asked for my cvconfig.h file. I only seem to have this file for cv2 version 3.1.0 under /opt/opencv/3.1.0/include/opencv2/cvconfig.h. 
HERE is a paste of this file.
In case it helps, I was able to extract the following video information with VideoCapture.get.
brightness 0.0
contrast 0.0
convert_rgb 0.0
exposure 0.0
format 0.0
fourcc 1684633187.0
fps 100.0
frame_count 18000.0
frame_height 593.0
frame_width 792.0
gain 0.0
hue 0.0
mode 0.0
openni_baseline 0.0
openni_focal_length 0.0
openni_frame_max_depth 0.0
openni_output_mode 0.0
openni_registration 0.0
pos_avi_ratio 0.01
pos_frames 0.0
pos_msec 0.0
rectification 0.0
saturation 0.0
Your video file data contains just 1313 non-duplicate frames (i.e. between 7 and 8 frames per second of duration):
$ ffprobe -i demo.avi -loglevel fatal -show_streams -count_frames|grep frame
has_b_frames=0
r_frame_rate=100/1
avg_frame_rate=100/1
nb_frames=18000
nb_read_frames=1313        # !!!
Converting the avi file with ffmpeg reports 16697 duplicate frames (for some reason 10 additional frames are added and 16697=18010-1313).
$ ffmpeg -i demo.avi demo.mp4
...
frame=18010 fps=417 Lsize=3705kB time=03:00.08 bitrate=168.6kbits/s dup=16697
#                                                                   ^^^^^^^^^
...
BTW, thus converted video (
demo.mp4) is devoid of the problem being discussed, that is OpenCV processes it correctly.
In this case the duplicate frames are not physically present in the avi file, instead each duplicate frame is represented by an instruction to repeat the previous frame. This can be checked as follows:
$ ffplay -loglevel trace demo.avi
...
[ffplay_crop @ 0x7f4308003380] n:16 t:2.180000 pos:1311818.000000 x:0 y:0 x+w:792 y+h:592
[avi @ 0x7f4310009280] dts:574 offset:574 1/100 smpl_siz:0 base:1000000 st:0 size:81266
video: delay=0.130 A-V=0.000094
    Last message repeated 9 times
video: delay=0.130 A-V=0.000095
video: delay=0.130 A-V=0.000094
video: delay=0.130 A-V=0.000095
[avi @ 0x7f4310009280] dts:587 offset:587 1/100 smpl_siz:0 base:1000000 st:0 size:81646
[ffplay_crop @ 0x7f4308003380] n:17 t:2.320000 pos:1393538.000000 x:0 y:0 x+w:792 y+h:592
video: delay=0.140 A-V=0.000091
    Last message repeated 4 times
video: delay=0.140 A-V=0.000092
    Last message repeated 1 times
video: delay=0.140 A-V=0.000091
    Last message repeated 6 times
...
In the above log, frames with actual data are represented by the lines starting with "[avi @ 0xHHHHHHHHHHH]". The "video: delay=xxxxx A-V=yyyyy" messages indicate that the last frame must be displayed for xxxxx more seconds.
cv2.VideoCapture() skips such duplicate frames, reading only frames that have real data. Here is the corresponding (though, slightly edited) code from the 2.4 branch of opencv (note, BTW, that underneath ffmpeg is used, which I verified by running python under gdb and setting a breakpoint on CvCapture_FFMPEG::grabFrame):
bool CvCapture_FFMPEG::grabFrame()
{
    ...
    int count_errs = 0;
    const int max_number_of_attempts = 1 << 9; // !!!
    ...
    // get the next frame
    while (!valid)
    {
        ...
        int ret = av_read_frame(ic, &packet);
        ...        
        // Decode video frame
        avcodec_decode_video2(video_st->codec, picture, &got_picture, &packet);
        // Did we get a video frame?
        if(got_picture)
        {
            //picture_pts = picture->best_effort_timestamp;
            if( picture_pts == AV_NOPTS_VALUE_ )
                picture_pts = packet.pts != AV_NOPTS_VALUE_ && packet.pts != 0 ? packet.pts : packet.dts;
            frame_number++;
            valid = true;
        }
        else
        {
            // So, if the next frame doesn't have picture data but is
            // merely a tiny instruction telling to repeat the previous
            // frame, then we get here, treat that situation as an error
            // and proceed unless the count of errors exceeds 1 billion!!!
            if (++count_errs > max_number_of_attempts)
                break;
        }
    }
    ...
}
In a nutshell: I reproduced your problem on an Ubuntu 12.04 machine with OpenCV 2.4.13, noticed that the codec used in your video (FourCC CVID) seems to be rather old (according to this post from 2011), and after converting the video to codec MJPG (aka M-JPEG or Motion JPEG) your MCVE worked. Of course, Leon (or others) may post a fix for OpenCV, which may be the better solution for your case.
I initially tried the conversion using
ffmpeg -i demo.avi -vcodec mjpeg -an demo_mjpg.avi
and
avconv -i demo.avi -vcodec mjpeg -an demo_mjpg.avi
(both also on a 16.04 box). Interestingly, both produced "broken" videos. E.g., when jumping to frame 1000 using Avidemux, there in no real-time clock! Also, the converted videos were only about 1/6 of the original size, which is strange since M-JPEG is a very simple compression. (Each frame is JPEG-compressed independently.)
Using Avidemux to convert demo.avi to M-JPEG produced a video on which the MCVE worked.  (I used the Avidemux GUI for the conversion.)  The size of the converted video is about 3x the original size.  Of course, it may also be possible to do the original recording using a codec that is supported better on Linux.  If you need to jump to specific frames in the video in your application, M-JPEG may be the best option.  Otherwise, H.264 compresses much better.  Both are well-supported in my experience and the only codes I have seen implemented directly on webcams (H.264 only on high-end ones).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With