New djpeg -scale N/8 with all N=1...16 feature

The decompress part of the Independent JPEG Group library has got a direct rescaling feature while decoding with factors 1/2, 1/4, and 1/8.
This is interesting because different spatial size output can be retrieved directly from the JPEG (DCT) data without separate full decode and spatial resample.
Using a new method, this feature has now been extended to a broader and finer range of scaling factors: All scalings with factors N/8 with N=1...16 (downscale and upscale) are now supported with the djpeg -scale option or with setting the scale_num/scale_denom variables in library application.
Furthermore, the performance in the old cases (1/2, 1/4, 1/8) has been improved considerably with the new method.

A complementing cjpeg -scale 8/N (N=1...16) option has also been developed, see Fast DCT scaling and new cjpeg -scale feature for more information.

Prior approaches to the new method were presented in my article Fast DCT scaling to a fourth and to a half.

Application notes

Just set the desired scaling factor of N/8 (N=1...16) in the djpeg -scale option or via the scale_num/scale_denom variables in library application. The new library is available for download in the following forms: NOTE: Version 8 of the JPEG software introduced new flavors of JPEG files with arbitrary block sizes from 1x1 to 16x16, not just the common 8x8.
The scale_num/scale_denom variables will always be initialized with the actual DCT block size of the given JPEG file after calling the jpeg_read_header() function.
You can set only scale_num to one of the 1...16 values to achieve different scaling effects, while leaving scale_denom unchanged (or use djpeg -scale N only). This value then specifies the DCT scaled size to be applied on the given image.
The Jpegcrop code demonstrates this.

Implementation notes

The core implementation was done by extending the file jidctint.c. This file is HUGE now (186 KB, beware) and contains lot of optimized IDCT routines with various output sizes. See the comment on top of this file for more details about the new method and implementation. Note that the old jidctred.c file has been removed to straighten the file structure of the library.

Otherwise the library was well prepared for handling the new sizes, also with odd factors and upscaling.

Note that the new 16x16 IDCT output routine is also used for efficient resolving of the common 2x2 chroma subsampling case without additional spatial upsampling while normal (1/1) decoding in the library.
Separate spatial upsampling for those kind of files is now only necessary for -scale N/8 with N>8 cases.

Furthermore, separate IDCT output routines are provided for direct resolving of the common asymmetric subsampling cases (2x1 and 1x2) without additional resampling.
Beside performance advantage this effort was necessary due to current limitations in the library which cause odd and undesirable effects when rescaling such files:
In the current library, the decompress part can only upsample, the compress part can only downsample data. When downscaling an image while decoding you would rather expect data to be downsampled than upsampled. The current library partially works around this problem by appropriately adapting the IDCT routine to the sampling factors. However, this only works well for symmetric sampling cases if only symmetric IDCTs are provided.
Here is an example:
Assume we have a common 2x1 horizontal only subsampled JPEG image, and we want to downscale by 1/2. The 2x1 horizontal subsampling means that the color is already downscaled by 1/2 in the horizontal direction, so we would only need to downscale further the color by 1/2 in the vertical direction for 1/2 size output. However, since the library cannot downscale data in the decompress part, it is forced to select an IDCT 4x4 output routine for the color data. This means that data is reduced from size 8 to 4 in the horizontal direction in the IDCT part, and must afterwards be upsampled in the upsampling part by 2h1v to size 8. This is odd because you have the full 8 size data in the file, but the library reduces to 4 first and then expands to 8, which is a considerable data loss and quality degradation.
With the provided asymmetric IDCT output routines, these problems are avoided, and optimal output and performance is guaranteed also for rescaling oddly subsampled files.

An error has also been identified in the old IDCT sampling adaption logic: Create for example a JPEG file with cjpeg -sample 3x2, and then try downscaled decode with djpeg -scale 1/2. djpeg will fail with an error in this case, although the file is perfectly valid JPEG (and can be decoded in the normal 1/1 case). This error has been solved with the new library - it works, though not optimal (we don't bother to optimize performance too much for such unusual cases).

Note that due to the support of asymmetric samplings and scalings the new library is not binary compatible with the old one. In the decompress master record the internal DCT_scaled_size variables had to be split into separate DCT_h_scaled_size and DCT_v_scaled_size fields, and this change also caused lot of appropriate, but rather trivial, adaptions in many other source modules.

Computational efficiency of NxN IDCT output from 8x8 DCT coefficients algorithms

The computational efficiency of the developed algorithms is estimated here by the number of multiplications per output pixel. We do not take into account the dequantization mults here, which would improve the upscaling results (N>8) even further because only input values (less than output in case of upscaling) need to be dequantized.

N1-D kernel loopsmultiplications in 1-D kernelmultiplications per output pixel
= total loops * total mults / (N*N)
columnrowtotaleven partodd part total
11110000
22240000
33361121.3
44480331.5
555102352
666122131
7771475123.4
8881639123
9891755102.1
108101857122.2
11811191113243.8
1281220213152.1
13813211316293.6
1481422812202.2
1581523913222.2
1681624820282.6

Note that only the N<=8 cases are full IDCTs, the others are partial IDCTs. It is interesting that under this condition all routines are in a close efficiency range, regarding the output size.
The most expensive cases with slightly larger, but still very good, rating compared to the 8-point case are the prime numbers 7, 11, and 13. It seems that the even part multiplications cannot be reduced below N in these cases, while all other are smaller.
The other cases with N>8 have even slightly better ratings than the standard 8-point case.
The cases below 7 have considerably better ratings than the standard 8-point case.
Especially remarkable is the N=6 case, which has just 3 multiplications in the 1-D kernel, and a rating of 1 multiplication per output pixel.

The numbers are result of a considerable optimization effort (If you have developed that many IDCT algorithms, you will find the standard 8x8 case rather trivial ;-). However, I cannot guarantee that the numbers are already minimal. If you are a mathematical (algebraic) wizard, and can get down one of the multiplication counts, I'd like to hear from you.

Note that additional performance advantages are obtained by adapted entropy decoding (sequential mode only):
The DC-only-optimized Huffman decoding of the old library (only useful for case N=1) has been extended to arbitrary band-limited decoding for every scaled DCT block size. This gives noticeable additional performance advantages especially for the small N values (<=4) where only few AC coefficients need to be fully decoded from the input bitstream, while the rest can be flushed.