finding singleton elements in a sequence

I went offline and proved the original algorithm subject to the
conjecture that the XOR tricks worked. As it happens, the XOR tricks
don't work, but the following argument may still interest some
people. (I re-did it in Haskell because I find proofs much easier when
I have recursive functions instead of loops and I can use data
structures. But for the Pythonistas in the audience I tried to use
list comprehensions wherever possible.)


Beautiful theory slain by an ugly fact
======================================

Problem: we're given a sequence of *n* nonzero 32-bit words in which
every element is either *singleton* or *doubleton*:

  - If a word appears exactly once, it is *singleton*.

  - If a word appears exactly twice, it is *doubleton*.

  - No word appears three or more times.

The problem is to find the singletons.  If there are three singletons,
we should use linear time and constant space.  More generally, if
there are *k* singletons, we should use _O(k*n)_ time and _O(k)_
space.  Unfortunately, the algorithm rests on an unproven conjecture
about exclusive or, and the conjecture has been shown to be false.

I've paid close attention to proof of correctness.  Arguments about
resource bounds are hand-waving.

We begin with these basics:

> module Singleton where
> import Data.Bits
> import Data.List
> import Data.Word
> import Test.QuickCheck hiding ((.&.))


Key abstraction: Partial specification of a word
-----------------------------------------


To tackle the problem I'm going to introduce an abstraction:
to describe the least significant $w$ bits of a 32-bit word,
I introduce a `Spec`:

> data Spec = Spec { w :: Int, bits :: Word32 }
>    deriving Show
> width = w -- width of a Spec

A `Spec` matches a word if the least significant `w` bits are equal
to `bits`.  If `w` is zero, by definition all words match:

> matches :: Spec -> Word32 -> Bool
> matches spec word = width spec == 0 ||
>                     ((word `shiftL` n) `shiftR` n) == bits spec
>   where n = 32 - width spec
>
> universalSpec = Spec { w = 0, bits = 0 }

Here are some claims about `Spec`s:

  - All words match the `universalSpec`, which has width 0

  - If `matches spec word` and `width spec == 32`, then `word == bits spec`

Key idea: "extend" a partial specification
------------------------------------------

Here is the key idea of the algorithm: we can *extend* a `Spec` by
adding another bit to the specification.  Extending a `Spec` produces
a list of two `Spec`s

> extend :: Spec -> [Spec]
> extend spec = [ Spec { w = w', bits = bits spec .|. (bit `shiftL` width spec) }
>               | bit <- [0, 1] ]
>   where w' = width spec + 1

And here's the crucial claim: if `spec` matches `word` and if `width spec`
is less than 32, then exactly one of the two specs from `extend spec`
match `word`.  The proof is by case analysis on the relevant bit of `word`.
This claim is so important that I'm going to call it Lemma One
Here's a test:

> lemmaOne :: Spec -> Word32 -> Property
> lemmaOne spec word =
>   width spec < 32 && (spec `matches` word) ==>
>       isSingletonList [s | s <- extend spec, s `matches` word]
>
> isSingletonList :: [a] -> Bool
> isSingletonList [a] = True
> isSingletonList _   = False

We're going to define a function which given a `Spec` and a sequence
of 32-bit words, returns a list of the singleton words that match the
spec.  The function will take time proportional to the length of the
input times the size of the answer times 32, and extra space
proportional to the size of the answer times 32.  Before we tackle the
main functio, we define some constant-space XOR functions.

XOR ideas that are broken
-------------

Function `xorWith f ws` applies function `f` to every word in `ws` and
returns the exclusive or of the result.

> xorWith :: (Word32 -> Word32) -> [Word32] -> Word32
> xorWith f ws = reduce xor 0 [f w | w <- ws]
>   where reduce = foldl'

Thanks to *stream fusion* (see ICFP 2007), function `xorWith` takes
constant space.

A list of nonzero words has a singleton if and only if either the
exclusive or is nonzero, or if the exclusive or of `3 * w + 1` is
nonzero.  (The "if" direction is trivial.  The "only if" direction is
a conjecture that Evgeny Kluev has disproven; for a counterexample,
see array `testb` below.  I can make Evgeny's example work by adding
a third function `g`, but obviously this situation calls for a proof,
and I don't have one.)

> hasSingleton :: [Word32] -> Bool
> hasSingleton ws = xorWith id ws /= 0 || xorWith f ws /= 0 || xorWith g ws /= 0
>   where f w = 3 * w + 1
>         g w = 31 * w + 17

Efficient search for singletons
-------------------------------


Our main function returns a list of all the singletons matching a spec.

> singletonsMatching :: Spec -> [Word32] -> [Word32]
> singletonsMatching spec words =
>  if hasSingleton [w | w <- words, spec `matches` w] then
>    if width spec == 32 then
>      [bits spec]
>    else
>      concat [singletonsMatching spec' words | spec' <- extend spec]
>  else
>    []

We'll prove its correctness by induction on the width of the `spec`.

  - The base case is that `spec` has width 32.  In this case, the list
    comprehension will give the list of words that are exactly equal
    to `bits spec`.  Function `hasSingleton` will return `True` if and
    only if this list has exactly one element, which will be true
    exactly when `bits spec` is singleton in `words`.

  - Now let's prove that if `singletonsMatching` is correct for with
    *m+1*, it is also correct for width *m*, where *m < 32$.  (This is
    the opposite direction as usual for induction, but it doesn't
    matter.)

    Calling `extend spec` on a `spec` of width *m* returns two specs
    that have width $m+1$.  By hypothesis, `singletonsMatching` is
    correct on these specs.  To prove: that the result contains
    exactly those singletons that match `spec`.  By Lemma One, any
    word that matches `spec` matches exactly one of the extended
    specs.  By hypothesis, the recursive calls return exactly the
    singletons matching the extend specs.  When we combine the results
    of these calls with `concat`, we get exactly the matching
    singletons, with no duplicates and no omissions.

Actually solving the problem is anticlimactic: the singletons are all
the singletons that match the empty spec:

> singletons :: [Word32] -> [Word32]
> singletons words = singletonsMatching universalSpec words

Testing code
------------

> testa, testb :: [Word32]
> testa = [10, 1, 1, 2, 3, 4, 2, 5, 6, 7, 10]
> testb = [ 0x0000
>         , 0x0010
>         , 0x0100
>         , 0x0110
>         , 0x1000
>         , 0x1010
>         , 0x1100
>         , 0x1110
>         ]

Beyond this point, if you want to follow what's going on, you need to
know [QuickCheck](http://en.wikipedia.org/wiki/QuickCheck).

Here's a random generator for specs:

> instance Arbitrary Spec where
>   arbitrary = do width <- choose (0, 32)
>                  b <- arbitrary
>                  return (randomSpec width b)
>   shrink spec = [randomSpec w' (bits spec) | w' <- shrink (width spec)] ++
>                 [randomSpec (width spec) b | b  <- shrink (bits spec)]
> randomSpec width bits = Spec { w = width, bits = mask bits }
>   where mask b = if width == 32 then b
>                  else (b `shiftL` n) `shiftR` n
>         n = 32 - width

Using this generator, we can test Lemma One using `quickCheck lemmaOne`.

We can test to see that any word claimed to be a singleton is in fact singleton:

> singletonsAreSingleton nzwords =
>   not (hasTriple words) ==> all (`isSingleton` words) (singletons words)
>   where isSingleton w words = isSingletonList [w' | w' <- words, w' == w]
>         words = [w | NonZero w <- nzwords]

> hasTriple :: [Word32] -> Bool
> hasTriple words = hasTrip (sort words)
> hasTrip (w1:w2:w3:ws) = (w1 == w2 && w2 == w3) || hasTrip (w2:w3:ws)
> hasTrip _ = False


Here's another property that tests the fast `singletons` against a
slower algorithm that uses sorting.

> singletonsOK :: [NonZero Word32] -> Property
> singletonsOK nzwords = not (hasTriple words) ==>
>   sort (singletons words) == sort (slowSingletons words)
>  where words = [w | NonZero w <- nzwords ]
>        slowSingletons words = stripDoubletons (sort words)
>        stripDoubletons (w1:w2:ws) | w1 == w2 = stripDoubletons ws
>                                   | otherwise = w1 : stripDoubletons (w2:ws)
>        stripDoubletons as = as