Saturday, May 24, 2008

Optimizing Factorization

I've been working my way through the problems hosted by Project Euler. Usually, using a naïve approach coupled with efficient nondeterminism suffices (oh! don't other problem solvers wish they had the MonadPlus in their semantics when tackling the now-trivial pandigital problems!), but I ran into a snag when I was factoring large triangular numbers.

The problem was my naïveté, of course.

My original version of factors proceeded thusly ...

factors :: Integral a ⇒ a → [a]
factors x = filter (λy . x `rem` y ≡ 0) [1..x]

... which is straightforward enough. One can't get any simpler than: "Keep ('filter') all the numbers from 1 to x that divide x with no remainder"!

Simple? Yes. Naïve? Yes. For small numbers, this factorization algorithm did just fine, even for the 384th triangle, 73920, this algorithm found the 112 factors in a blink. The problem appeared when doing a (triangular) search along the Natural number line for an unknown, large, triangle. The factorization algorithm is linear, and combining it with a (nearly) linear search makes the entire activity quadratic. Twenty-six hours after I began my search ("find the first triangular number with more than 500 factors") the algorithm was still going with no solution.

There must be a better way to go about this. And, of course, there are more than a few ways, some of which you can pay for, but none, on the top of the search tree that appeared simple enough to comprehend and to implement in a reasonable time.

Fortunately, there also exists an obvious sub-linear solution: if we are given in the linear progression [1..x] at the current index, y, that y is a factor of x, then obviously ...

z where z = x / y

... is also a factor. Not only that, but, since we are progressing linearly, z now becomes the upper bound in the factorization algorithm. The fix-point of the algorithm described gives us a new, much more efficient, factorization ...

factors x = factors' 2 (x, [1,x])
where factors' y res@(top, ans) | y ≥ top = ans
| otherwise = factors'' (x `divMod` y) y res
factors'' (newtop, 0) y (oldtop, ans)
= factors' (succ y) (newtop, y:oldtop:ans)
factors'' _ y res
= factors' (succ y) res

... Much more efficient? Yes. Self-describing? Not so much. Also, there's now a subtle error introduced by assuming all factors are unique, for ...

> factors 25 → [5,5,1,25]

... duplicates are introduced when factoring squares. Testing uniqueness adds its own complexity:

factors x = factors' 2 (x, [1,x])
where factors' y res@(top, ans) | y ≥ top = ans
| otherwise = factors'' (x `divMod` y) y res
factors'' (newtop, 0) y (oldtop, ans)
= factors' (succ y) (newtop, addin y oldtop ans)
factors'' _ y res
= factors' (succ y) res
addin x y list | x &equiv y = x:list
| otherwise = x:y:list

Complexity on top of complexity! This goes against my grain (as well as Richard O'Keefe's as you see in the side-bar quote), but is it worth it?

> sort (factors (triangle 12375)) → [1,2,3,...572 others...,76576500]

Yes, as the above interaction took no noticeable time, whereas before no solution was available after a day's worth of computation. So this optimized factorization algorithm is "good enough" for the task at hand, for I did solve the problem using that algorithm.

But, returning to the fix-point, which more than generalizes recursive algorithms -- it can be used to think of algorithms, qua algorithms, more generally. In this particular case, I have no desire to view every one of the 576 factors of this triangular number. No, I simply wish to ensure that there are more than 500 factors, no matter what those factors are. So, in the large, something like a countfactors (that simply returns the number of factors, instead the list of all the factors) is more desirable. How are we to go about writing such an algorithm? Most coders, I regret to observe, would punt to the copy-and-paste style of coding, as the desired changed is buried deeply within the algorithm. Not only that, but the type signature of the function varies with our alternatives: in what we've developed so far, the algorithm returns a list, but what we wish to have instead is (only one) number.

Fortunately, Dirk Thierbach introduced me to this particular use of the fix-point -- we simply extract the engine of computation into an external function. So now we have two functions that collapse into one line each ...

showfactors :: Integral a ⇒ a → [a]
showfactors x = factors x (λ y z ans . if y ≡ z then y:ans else y:z:ans) []

countfactors :: Integral a ⇒ a → a
countfactors x = factors x (λ y z ans . if y ≡ z then 1 + ans else 2 + ans) 0

... with factors being (slightly) modified to accept the new functional argument and base case:

factors :: Integral a ⇒ a → (a → a → b → b) → b → b
factors x adder ans = factors' 2 (x, adder 1 x ans)
where factors' y (tot, ans) | y ≥ tot = ans
| otherwise = factors'' (x `divMod` test) y (tot, ans)
factors'' (tot, 0) y (_, ans)
= factors' (y+1) (tot, adder y tot ans)
factors'' _ _ y ans
= factors' y ans

With this approach, the return type depends on the calling function's use and is threaded throughout the utility function, factors, by the adder computation engine.

One final note: the λ-terms in both calling functions showfactors and countfactors are also of the same structure, so we can again perform surgery, extracting the engine from the structure:

mkAdder :: Integral a ⇒ (a → b) → (b → b → b) → a → a → (b → b) 
mkAdder f g y z = g (if y ≡ z then f y else g (f y) (f z))

Now, anyone who's been working with monads for more than an brief period will see the b type as monadic and the functional type (a → b) as a lifting function (return) and the composition function, g, as monadic addition. And, as it turns out, this concept fits very nicely with the new implementation ...

showfactors x = factors x (mkAdder return mplus) []
countfactors x = factors x (mkadder (const 1) (+)) 0

With these new implementations, how much does factors need to change? Not one bit. Functional purity: why can't all programming languages have this? And with this generalization, I was able to achieve the desired result:

> countfactors (triangle 12375) → 576

Now it is known that both lists of some type (in this case integers) and also integers under addition are monoids. Given that, mkAdder can be further simplified. Also, the if-then-else is easily replaced by the Either type and proper use of arrows, but these are discussions for another day.

No comments: