Sophie: x264-0.125-0.20120914.stable.3.mga3.tainted x86

x264-0.125-0.20120914.stable.3.mga3.tainted.x86_64.rpm

Historical notes:
Slice-based threads was the original threading model of x264.  It was replaced with frame-based threads in r607.  This document was originally written at that time.  Slice-based threading was brought back (as an optional mode) in r1364 for low-latency encoding.  Furthermore, frame-based threading was modified significantly in r1246, with the addition of threaded lookahead.

Old threading method: slice-based
application calls x264
x264 runs B-adapt and ratecontrol (serial)
split frame into several slices, and spawn a thread for each slice
wait until all threads are done
deblock and hpel filter (serial)
return to application
In x264cli, there is one additional thread to decode the input.

New threading method: frame-based
application calls x264
x264 requests a frame from lookahead, which runs B-adapt and ratecontrol parallel to the current thread, separated by a buffer of size sync-lookahead
spawn a thread for this frame
thread runs encode, deblock, hpel filter
meanwhile x264 waits for the oldest thread to finish
return to application, but the rest of the threads continue running in the background
No additional threads are needed to decode the input, unless decoding is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel.

Penalties for slice-based threading:
Each slice adds some bitrate (or equivalently reduces quality), for a variety of reasons: the slice header costs some bits, cabac contexts are reset, mvs and intra samples can't be predicted across the slice boundary.
In CBR mode, multiple slices encode simultaneously, thus increasing the maximum misprediction possible with VBV.
Some parts of the encoder are serial, so it doesn't scale well with lots of cpus.

Some numbers on penalties for slicing:
Tested at 720p with 45 slices (one per mb row) to maximize the total cost for easy measurement. Averaged over 4 movies at crf20 and crf30. Total cost: +30% bitrate at constant psnr.
I enabled the various components of slicing one at a time, and measured the portion of that cost they contribute:
    * 34% intra prediction
    * 25% redundant slice headers, nal headers, and rounding to whole bytes
    * 16% mv prediction
    * 16% reset cabac contexts
    * 6% deblocking between slices (you don't strictly have to turn this off just for standard compliance, but you do if you want to use slices for decoder multithreading)
    * 2% cabac neighbors (cbp, skip, etc)
The proportional cost of redundant headers should certainly depend on bitrate (since the header size is constant and everything else depends on bitrate). Deblocking should too (due to varing deblock strength).
But none of the proportions should depend strongly on the number of slices: some are triggered per slice while some are triggered per macroblock-that's-on-the-edge-of-a-slice, but as long as there's no more than 1 slice per row, the relative frequency of those two conditions is determined solely by the image width.


Penalties for frame-base threading:
To allow encoding of multiple frames in parallel, we have to ensure that any given macroblock uses motion vectors only from pieces of the reference frames that have been encoded already. This is usually not noticeable, but can matter for very fast upward motion.
We have to commit to one frame type before starting on the frame. Thus scenecut detection must run during the lowres pre-motion-estimation along with B-adapt, which makes it faster but less accurate than re-encoding the whole frame.
Ratecontrol gets delayed feedback, since it has to plan frame N before frame N-1 finishes.

Benchmarks:
cpu: 8core Nehalem (2x E5520) 2.27GHz, hyperthreading disabled
kernel: linux 2.6.34.7, 64-bit
x264: r1732 b20059aa
input: http://media.xiph.org/video/derf/y4m/1080p/park_joy_1080p.y4m

NOTE: the "thread count" listed below does not count the lookahead thread, only encoding threads.  This is why for "veryfast", the speedup for 2 and 3 threads exceeds the logical limit.

threads  speedup       psnr
      slice frame   slice  frame
x264 --preset veryfast --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.41x 2.29x  -0.005 -0.002
 3:   1.70x 3.65x  -0.035 +0.000
 4:   1.96x 3.97x  -0.029 -0.001
 5:   2.10x 3.98x  -0.047 -0.002
 6:   2.29x 3.97x  -0.060 +0.001
 7:   2.36x 3.98x  -0.057 -0.001
 8:   2.43x 3.98x  -0.067 -0.001
 9:         3.96x         +0.000
10:         3.99x         +0.000
11:         4.00x         +0.001
12:         4.00x         +0.001

x264 --preset medium --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.54x 1.59x  -0.002 -0.003
 3:   2.01x 2.81x  -0.005 +0.000
 4:   2.51x 3.11x  -0.009 +0.000
 5:   2.89x 4.20x  -0.012 -0.000
 6:   3.27x 4.50x  -0.016 -0.000
 7:   3.58x 5.45x  -0.019 -0.002
 8:   3.79x 5.76x  -0.015 -0.002
 9:         6.49x         -0.000
10:         6.64x         -0.000
11:         6.94x         +0.000
12:         6.96x         +0.000

x264 --preset slower --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.54x 1.83x  +0.000 +0.002
 3:   1.98x 2.21x  -0.006 +0.002
 4:   2.50x 2.61x  -0.011 +0.002
 5:   2.93x 3.94x  -0.018 +0.003
 6:   3.45x 4.19x  -0.024 +0.001
 7:   3.84x 4.52x  -0.028 -0.001
 8:   4.13x 5.04x  -0.026 -0.001
 9:         6.15x         +0.001
10:         6.24x         +0.001
11:         6.55x         -0.001
12:         6.89x         -0.001