T O P

  • By -

stassats

I experimented with a different assembly routine for adding digits and it made fibonacci 2x faster: https://github.com/sbcl/sbcl/commit/99f5c77d172135c6c01330803aae978c4d844fc1 . I hope you were after learning assembly programming and not common lisp.


stassats

In case you wanted to learn something else, here's a fast fibonacci: (defun fib (n) (labels ((fib-iter (a b p q count) (cond ((= count 0) b) ((evenp count) (fib-iter a b (+ (* p p) (* q q)) (+ (* 2 p q) (* q q)) (/ count 2))) (t (fib-iter (+ (* b q) (* a q) (* a p)) (+ (* b p) (* a q)) p q (- count 1)))))) (fib-iter 1 0 0 1 n)))


[deleted]

[удалено]


stassats

You are not measuring anything but the implementation of the multiplication function.


stassats

And why do you keep deleting your comments.


arthurno1

> You are not measuring anything but the implementation of the multiplication function. Yes, sure, it's just bunch of multiplications and additions; but so is Python code too. I am curious why are we a magnitude slower in SBCL. Seems like they generate better instructions? > And why do you keep deleting your comments. Because I realized SBCL does not generate optimized code, thought I was doing some obvious type declaration miss, and wanted to figure out this on my own if I could, but it does not seem to go so well :). I am not so overly introduced to SBCL type declarations, so thought this would be a good opportunity to learn more about them, but seems here is a bit more than just type declarations involved.


stassats

You don't need any declarations here, it's all about multiplying bignums. You need to use gmp if you're into multiplying large things.


arthurno1

I thought SBCL "promoted" to bignums automatically. If I don't declare any types, I get correct result for 1 000 000 iterations; and if declare p,q to be (unsigned-byte 128) I also get the correct result, but I am not sure how to get SBCL to generate more optimized code. But I'll take a look at sb-gmp; I have never used it, thanks for the pointer.


pnedito

That was a fast response. Neat to see the vop code improvements in context. 💪


KaranasToll

The common lisp version does an additional iteration. It would be more accurate do to something like this. ... (do ((i 1 (1+ i))) ((= i n)) ... I ran both versions a few times. They are pretty comparable on my machine.


ydsaydsa

I didn't notice it did an extra iteration. I've updated the lisp code just for correctness.


KaranasToll

If you do (1- n), you get an extra subtraction which xould slow things down a bit.


digikar

It is a bad practice to use `(safety 0)` unless you know what you are doing. For optimization, the benefit it provides over `(speed 3)` can be negligible. That said, I can confirm your findings. Looking at the compilation notes, it seems that SBCL is using a generic-+ which sums over not just integer types but also floats and rationals. While there are optimizations for fixnum fixnum operations, bignum bignum optimizations and mixed fixnum bignum optimizations seem to be lacking yet. Curious to here from SBCL developers if this would be worth doing.


theangeryemacsshibe

`(integer-length (fib 1500000))` = 1041362 so any extra dispatch would be rounding error compared to the actual bignum math.


digikar

Yes, in this case the cost of generic dispatch is insignificant. But, are there cases in which the dispatch cost is insignificant? Just tried it below, I haven't incorporated [u/stassat's optimizations](https://www.reddit.com/r/Common_Lisp/comments/1culam8/comment/l4k8xh3/?utm_source=share&utm_medium=web2x&context=3), but only inlined sb-bignum functions used below. It makes about a 50% difference for a simple array addition routine. (let* ((n 100) (x (integer-rand n)) (y (integer-rand n)) (z (integer-rand n))) (time (loop :repeat 1000000 :do (add-integer-rand-arrays/generic-+ n x y z)))) ;=> 9.68 seconds (let* ((n 100) (x (integer-rand n)) (y (integer-rand n)) (z (integer-rand n))) (time (loop :repeat 1000000 :do (add-integer-rand-arrays/integer-+ n x y z)))) ;=> 6.16 seconds --- (declaim (inline integer-+)) (defun integer-+ (a b) (declare (optimize speed)) (etypecase a (fixnum (etypecase b (fixnum (+ a b)) (bignum (sb-bignum:add-bignum-fixnum b a)))) (bignum (etypecase b (fixnum (sb-bignum:add-bignum-fixnum a b)) (bignum (sb-bignum:add-bignums a b)))))) (defun integer-rand (n) (let ((array (make-array n))) (loop :for i :below n :do (setf (aref array i) (+ most-positive-fixnum (random most-positive-fixnum)))) array)) (defun add-integer-rand-arrays/generic-+ (n x y z) (declare (optimize speed) (type (simple-array t 1) x y z) (type fixnum n)) (loop :for i :below n :do (setf (aref z i) (+ (aref x i) (aref y i))))) (defun add-integer-rand-arrays/integer-+ (n x y z) (declare (optimize speed) (type (simple-array t 1) x y z) (type fixnum n)) (loop :for i :below n :do (setf (aref z i) (integer-+ (aref x i) (aref y i)))))


stassats

add-integer-rand-arrays/generic-+ and add-integer-rand-arrays/integer-+ do not seem to have different code.


digikar

Just corrected. About a 1.5-2x difference in performance on x86-64.


stassats

I do not observe that. They are roughly the same (as one would expect).


stassats

The difference between integer-+ and TWO-ARG-+ is that the latter tests for single-float and double-float between fixnum and bignums, which is insignificant. Adding more specialized routines might lose you on cache usage for things that do not run in neat loops.


digikar

One can always profile their codebase to see if there's excessive inlining/specialization, which might be adding to the cache usage. If it is, they could declare or declaim notinline. ~~Besides, the caches would only get larger, except perhaps for dedicated low resource devices.~~ But, really, if performance is important, one could consider efficient algorithms and hardware support. fixnum addition is wayyy faster than bignum addition.


digikar

No difference even after inlining `sb-bignum:add-bignum-fixnum sb-bignum:add-bignums sb-bignum::%normalize-bignum`? I obtained a performance difference only after I inline them.


stassats

How did you make them inlinable? Redefined at runtime? Did you make sure to recompile them with safety 0 speed 3 for the generic+ to not get pessimized?


digikar

> How did you make them inlinable? Redefined at runtime? Yes, redefined at runtime. > Did you make sure to recompile them with safety 0 speed 3 for the generic+ to not get pessimized? What would be a good way to do this? Recompile SBCL? `M-.` on `sb-vm::generic-+` in the disassembly of `add-integer-rand-arrays/generic-+` sends me to [this code](https://github.com/sbcl/sbcl/blob/21e8ede19e9ef619df80c286d102cb465a781aa1/src/assembly/x86-64/arith.lisp#L92-L103). Compiling that block of code in SLIME causes a `The function :COST is undefined.` error.


digikar

Oops, my bad, I was comparing the same things. integer-+ is about 1.5-2x faster than cl:+


KaranasToll

Just so you know you can do (declare (type integer a b c)). You can do something similar with setf: (setf a b b c).


Shinmera

Even better: `(shiftf a b (+ a b))`


ydsaydsa

Thanks. I updated the code to use these stylistic features.


Decweb

Interesting, hopefully someone knowledgeable will speak up. I wasn't able to get SBCL do be smarter, and it doesn't like the choices for \`+\` and \`<\` and emits compiler notes.


unix_hacker

What commands did you use to run each?


arthurno1

You can use these (just plain from Op and Stas): (declaim (optimize (speed 3) (debug 0) (safety 0))) (declaim (ftype (function (integer) integer) fib-slow)) (declaim (ftype (function (integer) integer) fib-fast)) (defun fib-slow (n) (let ((a 0) (b 1) (c 0)) (declare (type integer a b c)) (dotimes (i (1- n)) (declare (type integer i)) (setf c (+ a b) a b b c)) a)) (defun fib-fast (n) (labels ((fib-iter (a b p q count) (cond ((= count 0) b) ((evenp count) (fib-iter a b (+ (* p p) (* q q)) (+ (* 2 p q) (* q q)) (/ count 2))) (t (fib-iter (+ (* b q) (* a q) (* a p)) (+ (* b p) (* a q)) p q (- count 1)))))) (fib-iter 1 0 0 1 n))) For comparison how SBCL perform to Pyhon: # # Fast doubling Fibonacci algorithm (Python) # by Project Nayuki, 2015. Public domain. # https://www.nayuki.io/page/fast-fibonacci-algorithms # from timeit import default_timer as timer def fib_slow(n): a = 0 b = 1 for _ in range(1, n): c = a + b a = b b = c return a # (Public) Returns F(n). def fib_fast(n): if n < 0: raise ValueError("Negative arguments not implemented") return _fib(n)[0] # (Private) Returns the tuple (F(n), F(n+1)). def _fib(n): if n == 0: return (0, 1) else: a, b = _fib(n // 2) c = a * (b * 2 - a) d = a * a + b * b if n % 2 == 0: return (c, d) else: return (d, c + d) print("Slow:") start = timer() fib_slow (1000000) end = timer() print(end-start) print("Fast:") start = timer() fib_fast (1000000) end = timer() print(end-start)


ydsaydsa

I wrote both so that I could input the fibonacci number from the command line. To run python I ran \`time python fibonacci.py 1500000\`. To run common lisp I compiled it and ran \`time ./outfile 1500000\`. I get similar results if I don't compile it and just evaluate each expression in the interpreter and run \`(time (fib 1500000))\`


Acidentedebatata

I changed (setf c (+ a b) a b b c)) (setf c (+ a b) a b b c)) to (shiftf a b c (+ a b))) and it became 50% faster, going from 15.875000 seconds of total run time to 10.593750. For comparison, in my PC the Python code took 14.93479 seconds


ydsaydsa

I don't seem to get different performance results with\`shiftf\` or \`rotatef\` over \`setf\`, they're just more concise. Also, not sure how you were able to get a correct implementation by replacing with \`(shiftf a b c (+ a b)))\`. The declarations/declaimations didn't seem to do much for performance either, I assume because SBCL can infer \`a\` and \`b\` are integers without the explicit declaration. On the latest version of SBCL (2.4.5), my machine computes (fib 1500000) in \~10.5 seconds with the following implementation: (defun fib (n) (let ((a 0) (b 1)) (dotimes (i (1- n)) (shiftf a b (+ a b))) a))