This implements a speedup to ScalarMult using the endomorphism available to secp256k1.
Note the constants lambda, beta, a1, b1, a2 and b2 are from here:
https://bitcointalk.org/index.php?topic=3238.0
Preliminary tests indicate a speedup of between 17%-20% (BenchScalarMult).
More speedup can probably be achieved once splitK uses something more like what fieldVal uses. Unfortunately, the prime for this math is the order of G (N), not P.
Note the NAF optimization was specifically not done as that's the purview of another issue.
Changed both ScalarMult and ScalarBaseMult to take advantage of curve.N to reduce k.
This results in a 80% speedup to large values of k for ScalarBaseMult.
Note the new test BenchmarkScalarBaseMultLarge is how that speedup number can
be checked.
This closes#1
This commit reworks the way that the pre-computed table which is used to
accelerate scalar base multiple is generated and loaded to make use of the
go generate infrastructure and greatly reduce the memory needed to compile
as well as speed up the compile.
Previously, the table was being generated using the in-memory
representation directly written into the file. Since the table has a very
large number of entries, the Go compiler was taking up to nearly 1GB to
compile. It also took a comparatively long period of time to compile.
Instead, this commit modifies the generated table to be a serialized,
compressed, and base64-encoded byte slice. At init time, this process is
reversed to create the in-memory representation. This approach provides
fast compile times with much lower memory needed to compile (16MB versus
1GB). In addition, the init time cost is extremely low, especially as
compared to computing the entire table.
Finally, the automatic generation wasn't really automatic. It is now
fully automatic with 'go generate'.
The addition of the pre-computed values for the ScalarBaseMult
optimizations added a benign race condition since a pointer to each
pre-computed Jacobian point was being used in the call to addJacobian
instead of a local stack copy.
This resulted in the code which normalizes the field values to check for
further optimization conditions such as equal Z values to race against the
IsZero checks when multiple goroutines were performing EC operations since
they were all operating on the same point in memory.
In practice this was benign since the value was being replaced with the
same thing and thus it was the same before and after the race, but it's
never a good idea to have races.
Code uses a windowing/precomputing strategy to minimize ECC math.
Every 8-bit window of the 256 bits that compose a possible scalar multiple has a complete map that's pre-computed.
The precomputed data is in secp256k1.go and the generator for that file is in gensecp256k1.go
Also fixed a spelling error in a benchmark test.
Results so far seem to indicate the time taken is about 35% of what it was before.
Closes#2
Since the Z values are normalized (which ordinarily mutates them as
needed) before checking for equality, the race detector gets confused when
using a global value for the field representation of the value 1 and
passing it into the various internal arithmetic routines and reports a
false positive.
Even though the race was a false positive and had no adverse effects, this
commit silences the race detector by creating new variables at the top
level and passing them instead of the global fieldOne variable. The
global is still used for comparison operations since those have no
potential to mutate the value and hence don't trigger the race detector.
This commit essentially rewrites all of the primitives needed to perform
the arithmetic for ECDSA signature verification of secp256k1 signatures to
significantly speed it up. Benchmarking has shown signature verification
is roughly 10 times faster with this commit over the previous.
In particular, it introduces a new field value which is used to perform the
modular field arithmetic using fixed-precision operations specifically
tailored for the secp256k1 prime. The field also takes advantage of
special properties of the prime for significantly faster modular reduction
than is available through generic methods.
In addition, the curve point addition and doubling have been optimized
minimize the number of field multiplications in favor field squarings
since they are quite a bit faster. They routines also now look for
certain assumptions such as z values of 1 or equivalent z values which
can be used to further reduce the number of multiplicaitons needed when
possible.
Note there are still quite a few more optimizations that could be done
such as using precomputation for ScalarBaseMult, making use of the
secp256k1 endomorphism, and using windowed NAF, however this work already
offers significant performance improvements.
For example, testing 10000 random signature verifications resulted in:
New btcec took 15.9821565s
Old btcec took 2m34.1016716s
Closesconformal/btcd#26.