Direct DCT scaling to a fourth and to a half

This is an implementation of a completely new approach to a reduced-size DCT output in jidctred.c, based on ideas from Rakesh Dugad and Narendra Ahuja: "A Fast Scheme for Image Size Change in the Compressed Domain" (2001). Related ideas discussing the DFT and Hartley transform are given by Andrew B. Watson in "Ideal Shrinking and Expansion of Discrete Sequences" (1986). Further ideas are presented by A. N. Skodras and C. A. Christopoulos in "Down-Sampling of Compressed Images in the DCT Domain" (1998) and "On the Down-Scaling of Compressed Pictures" (1998). Skodras and Christopoulos present yet another approach and call the here presented (Dugad & Ahuja) approach "Frequency Masking", the old approach "Traditional".

The new jidctred.c code

The old jidctred.c code

Advantages:

Downscale to a fourth (2x2 output):
NO multiplications required other than the usual dequantizing.
Need only the 2x2 top left (low-freq) input DCT coefficients.
=> A LOT FASTER!
Downscale to a half (4x4 output):
NO multiplications required in the even part calculation.
Only 3 multiplications (1 rotation) in the odd part.
(The old implementation uses 2 multiplications in the
even part, and 8 multiplications in the odd part, so total 10.)
Need only the 4x4 top left (low-freq) input DCT coefficients.
=> A LOT FASTER!
Compact, easy to comprehend source code: 9 KBytes new versus 14 KBytes old.
No other changes to the library or header files required, just replace and recompile.
No specific adaptions required.

There are no known disadvantages!
Since the method is equivalent to the method of Dugad and Ahuja, the visual and measured quality of the downsampled output is at least as good as common bilinear (averaging) downsampling.

Derivation of the fast 2x2 IDCT algorithm

The 2x2 point DCT and IDCT (same due to symmetry) is given by

  /  cos(pi/4)   cos(pi/4)  \     /  1/sqrt(2)   1/sqrt(2)  \                /  1   1  \
  |                         |  =  |                         |  =  1/sqrt(2)  |         |
  \  cos(pi/4)  -cos(pi/4)  /     \  1/sqrt(2)  -1/sqrt(2)  /                \  1  -1  /

This leads to a simple add/sub (sum/diff) calculation scheme. The scalar multiplication is deferred to the descaling operation at the end of the calculation. After column and row application the factor is 1/sqrt(2) * 1/sqrt(2) = 1/2. We also have to account for the different size DCT with a factor of 1/4, so the final descaling factor is 1/2 * 1/4 = 1/8 (right shift 3).

Derivation of the fast 4x4 IDCT algorithm

The 4x4 point DCT is given by

             /  C4   C4   C4   C4  \          /   1    1    1    1  \
             |                     |          |                     |
             |  C2   C6  -C6  -C2  |          |  c2   c6  -c6  -c2  |
  1/sqrt(2)  |                     |  =  1/2  |                     |
             |  C4  -C4  -C4   C4  |          |   1   -1   -1    1  |
             |                     |          |                     |
             \  C6  -C2   C2  -C6  /          \  c6  -c2   c2  -c6  /


  where  Ck = cos(k*pi/16),    ck = sqrt(2) * Ck    (note C4 = 1/sqrt(2))

We use the indexing from the 8x8 point DCT for comparison. The IDCT is the transpose of the DCT, hence

       /  1   c2   1   c6  \
       |                   |
       |  1   c6  -1  -c2  |
  1/2  |                   |
       |  1  -c6  -1   c2  |
       |                   |
       \  1  -c2   1  -c6  /

The scalar multiplication is deferred to the descaling operation at the end of the calculation. After column and row application the factor is 1/2 * 1/2 = 1/4. We also have to account for the different size DCT with a factor of 1/2, so the final descaling factor is 1/4 * 1/2 = 1/8 (right shift 3).

Now let (x0 x1 x2 x3) be our input vector, (y0 y1 y2 y3) our output. Then we have

  / y0 \     /  1   c2   1   c6  \     / x0 \     /  x0 + c2 * x1 + x2 + c6 * x3  \
  |    |     |                   |     |    |     |                               |
  | y1 |     |  1   c6  -1  -c2  |     | x1 |     |  x0 + c6 * x1 - x2 - c2 * x3  |
  |    |  =  |                   |  *  |    |  =  |                               |
  | y2 |     |  1  -c6  -1   c2  |     | x2 |     |  x0 - c6 * x1 - x2 + c2 * x3  |
  |    |     |                   |     |    |     |                               |
  \ y3 /     \  1  -c2   1  -c6  /     \ x3 /     \  x0 - c2 * x1 + x2 - c6 * x3  /

You can see that the even part (x0 x2) does not contain any multiplication.
Now let

  t10 = x0 + x2
  t12 = x0 - x2            (even part)

  t0 = c2 * x1 + c6 * x3
  t2 = c6 * x1 - c2 * x3   (odd part)

Then we have

  y0 = t10 + t0
  y3 = t10 - t0
  y1 = t12 + t2
  y2 = t12 - t2

This would require 4 multiplications. We can save 1 multiplication by simple algebraic transformation:

  t0 = (c2 - c6) * x1 + c6 * (x1 + x3)
  t2 = c6 * (x1 + x3) - (c2 + c6) * x3

This requires 3 multiplications with

  c6 = sqrt(2) * cos(6*pi/16) = 0.541196100
 [c2 = sqrt(2) * cos(2*pi/16) not needed]
  c2 + c6 = 1.847759065
  c2 - c6 = 0.765366865

This is the same 'rotation' as in the even part of the 8x8 point LL&M IDCT algorithm.

More scalings

Direct 2:3 (8:12) DCT upscaling with fast 12x12 IDCT