Sophie: gmp-ecm-6.3-1.fc14 x86

gmp-ecm-6.3-1.fc14.x86_64.rpm

ToDo's (see also TODO.sp):

Table of contents:
1) efficiency/memory
2) interface
3) documentation
4) installation
5) bugs
6) others

1) efficiency/memory
- try gmp_redc_2 (when in the public API of GMP). See
  http://gmplib.org/devel/REDC_1_TO_REDC_2_THRESHOLD.html.
- try GMP's mpn/generic/powm.c code to select routines to perform modular
  multiplication depending on different thresholds
- try the mpn/generic/{sb,dc,mu}_bdiv_qr.c functions in GMP >= 4.3.0 for REDC
- the conversion from mpz_t to NTT primes in function mpzspv_from_mpzv()
  (file mpzspv.c) is quadratic. A faster conversion is possible with a
  remainder tree (in addition since several conversions are made, the
  cost of computing the tree would be small). The same applies to the
  backward conversion (routine mpzspv_to_mpzv).
- even worse, mpzspm_init seems to be cubic in the input size (because the CRT
  algorithm used is quadratic in sp_num). We should use a subquadratic CRT.
- the "Reducing  G * H" step is faster in NTT than with KS. This is probably
  due to the fact that some transforms are cached in the NTT mode.
- the "Reducing  G * H" step can be improved as follows: first compute
  D = GH*I mod (x^d+1) where d = deg(F), and I = 1/F mod (x^d+1); then
  compute E = D*F mod (x^d-1); finally compute T = (GH-E)/2 mod (x^d+1).
  T equals the Montgomery product GH/(x^d+1) mod F.
  See the paper "Fast convolution meets Montgomery" by Preda Mihailescu
  (Mathematics of Computation).
- slowdown in stage 1 with REDC between a 58672-digit number and a
  58688-digit number [reported by Christophe.CLAVIER@gemalto.com, 29 Aug 2007]
  (((2003663613*2^195000-2)/(2*23*173*3863))/1954173900202379)/3612632846010637
  ((2003663613*2^195000-2)/(2*23*173*3863))/1954173900202379
  with B1=1000 on an Opteron (44.2s for c58672, 67.5s for c58688).
  The culprit seems to be the REDC routine in mpmod.c: indeed, in case the 
  modulus has n limbs, but the most significant one has only a few bits,
  the product (called x in REDC) has only 2n-1 limbs, and we never call
  Mulders's short product in ecm_redc_n (however the else-code using full
  products seem faster in that case).
  For c58672, if one replaces if (xn == 2 * n) in mpmod.c/REDC by
  if (xn >= 2 * n - 1), the time of stage 1 grows from 44s to 64s, whereas
  ecm_redc_n should be faster...
  This problem is still present in 6.2, ecm_redc_n should be better tuned,
  in particular the choice k=0.75*n in ecm_mul_lo_n() is far from optimal.
- in Brent-Suyama's extension, the evaluation of a polynomial of degree k
  over N consecutive values is currently done using a O(k N) algorithm
  [table of differences]. One can do O(N/k M(k)), cf Section 3 from 
  "Linear recurrences with polynomial coefficients and application to
   integer factorization and Cartier-Manin operator", by Alin Bostan,
   Pierrick Gaudry and Eric Schost, SIAM journal on computing, 
   vol. 36, no. 6, pp. 1777 - 1806, 2007.
   It is not clear if this result also applies to ECM, but at least it
   should word for P-1 and P+1.
- why restrict the use of mpn_mul_fft to Fermat numbers? We could use it
  for any cofactor of 2^(n*BITS_PER_MP_LIMB)+1, as long as
  mpn_fft_next_size (n, mpn_fft_best_k (n, S1 == S2)) == n.
- use mpres in step 2 (Target: 7.0)
- write a mpn version of add3 and duplicate  
- rewrite entire mpmod.c to be based on mpn_* functions, not mpz_*
- take relative speed of multiplying/squaring into account in PRAC
  (DN: couldn't get any significant speed increase)
- use/implement a mpn_mul_hi_n routine for use in mpn_REDC
- use mpn_addmul_2, mpn_addmul_4 in the basecase REDC [for machines
  where it exists]. ASM code should perhaps be moved into GMP.
- try McLaughlin's algorithm for Montgomery's modular multiplication
  (http://www.ams.org/mcom/0000-000-00/S0025-5718-03-01543-6/home.html)
- consider Colin Percival's generalized DWT for multiplication modulo
  k*a^n+b, where k*a*b is highly composite. May belong to GMP rather than
  GMP-ECM.
- implement assembly code (redc.asm) for other architectures
- allow composite d2, or better use the S1+S2 idea from the P+-1 algorithm
  of Montgomery and Kruppa.
- init mpz_t's with correct amount of memory allocated to avoid reallocs.
  Check for reallocs with GMP's memory interface routines. (Partly done.)
- try sliding window multiplication for ECM stage 1 (Target: 7.0)
- choose Brent/Suyama polynomial according to B2/k and not B2!
- Adjust estimated memory to take into account -treefile and NTT
  (done but improvement possible)
- when GWNUM is used, lower the default B2 (James Wanless, 17 Mar 2006,
	james at grok.ltd.uk)
- implement enhanced standard continuation? With graph cover algorithm?
- parallel/distributed stage 2?
- add curve selection for torsion group of order 8 or 16, see Montgomery's
  thesis. (request of Peter-Lawrence Montgomery)
- Torbj"orn Granlund suggested faster code for mpn_mod_1(), used extensively
  in NTT. See
    http://lists.gforge.inria.fr/pipermail/ecm-discuss/2008-May/003365.html

2) interface
- with -resume, print %time for THIS RUN instead of total run?
	[suggested by SleepHound <sleephound@yahoo.com>]
  Add CPUTIME=... in the save file, to take into account the total cpu time
  spend so far (in seconds). George Woltman agrees for that change. It won't
  hurt prime95/mprime -> will be added for his next version.
- when resuming, print the *initial* x0 for P-1/P+1?
- [from Jakub Pawlewicz <pan@mimuw.edu.pl>] add an option -stage1time t
  to tell the step 1 time, when done by another program. PZ: or better
  have it in resume file? (Target: 6.1. Command line option done)

3) documentation
- add examples for typical use to README 
  (see http://www.mersenneforum.org/showthread.php?t=3922)

4) installation
- check for __builtin_constant_p and __builtin_expect at configure time
- add a configure option --enable-shared --disable-static to link GMP
  dynamically
- [suggested by Peter Montgomery] add the possibility to compile a "fat"
  binary, which automatically selects the best mulredc assembly code
  depending on the cpuid [see TODO.fat]
- [suggested by Thomas Kunz, who did port GMP-ECM to the PS3, i.e., to the
  Cell architecture]: several changes to make it easier to port GMP-ECM to
  specific architectures. Cf TODO.kunz.

5) bugs
- potential bug: _mpz_realloc changes the value to 0 when it does not fit
	(mpmod.c)
- ecm -v 1e6 3-2 < c155 prints wrong expected number of curves (should be
  the same as ecm -v 1e6 2). Reported by Peter Montgomery, 30 Mar 2006.
  This is because ecmprob() gets only B2min, B2 as parameters and assumes 
  B1=B2min. Needs rewrite to integrate over interval [B2min, B2] but take 
  B1 as smoothness bound. Target: 6.3
- [reported by James Wanless <james@grok.ltd.uk>, 2 May 2008] the following
  should find a factor but does not (same with sigma=1101118787):
  [james@bympy2 gmp-ecm]$ echo '(2^86243+1)/(3)' | ./ecm6.1.1gmp4.2.1gwnum 
  -n -no-ntt -power 1 -sigma 3793703189 11e3
  GMP-ECM 6.1.1 [powered by GMP 4.2.1 and GWNUM 24.11] [ECM]
  Input number is (2^86243+1)/(3) (25962 digits)
  Using B1=11000, B2=1684420, polynomial x^1, sigma=3793703189
  Step 1 took 33023ms
  Step 2 took 137815ms
  Due to GWNUM, fixed with newer version of GWNUM.

6) others
- from Dan Bernstein <djb@cr.yp.to> 7 Mar 2006: use the curve
  (16b+18)y^2 = x^3 + (4b+2)x^2 + x with starting point (x=2,y=1).
  Caveat: with Montgomery multiplication, even if b is small (the argument
  used in duplicate), we still have to divide by R^l. Workaround: replace
  the mpres_mul call by a mpres_mul_ui call (b is not transformed).
- add primality proving of factors/cofactors? Maybe link Pari for this?
- add point counting algorithm? SEA implementation exists for Pari/GP,
   use that?
- let user specify previous factoring work, compute distribution of
  candidate factors, compute probability of/est. time to finding a factor 
  with given parameters.
- re-write in C++? Lots of work, but would make parts of the code much
  cleaner.