View difference between Paste ID: DyK7bvfv and c5WLWPbp
SHOW: | | - or go back to the newest paste.
1
"""
2
note on np.unique
3
4
this module expands on a the functionality one would currently bootstrap from np.unique
5
to facilitate the functionality implemented in this module, a generalization of unique is proposed
6
that is, the version of unique implemented here also takes a lex-sortable tuple,
7
or multidimensional array viewed along one of its axes as a valid input; ie:
8
9
    #stack of 4 images
10
    images = np.random.rand(4,64,64)
11
    #shuffle the images; this is a giant mess now; how to find the unique ones?
12
    shuffled = images[np.random.randint(0,4,200)]
13
    #there you go
14
    print unique(shuffled, axis=0)
15
16
furthermore, some np.unique related functions are proposed, with a more friendly interface
17
(multiplicity, count)
18
"""
19
20
21
"""
22
A note on pandas
23
This module has substantial overlap with pandas' grouping functionality.
24
So whats the reason for implementing it in numpy?
25
26
Primarily; the concept of grouping is far more general than pandas' dataframe.
27
There is no reason why numpy ndarrays should not have a solid core of grouping functionality.
28
29
The recently added ufunc support make that we can now express grouping operations in pure numpy;
30
that is, without any slow python loops or cumbersome C-extensions.
31
32
I was not aware of pandas functionalty on this front when starting working on it;
33
so i am somewhat pleased to say that many core design elements have turned out the same.
34
I take that to mean these design decision probably make sense.
35
36
It does raise the question as to where the proper line between pandas and numpy lies.
37
I would argue that evidently, most of pandas functionality has no place in numpy.
38
Then how is grouping different? I feel what lies at the heart of pandas is a permanent conceptual
39
association between various pieces of data, assembled in a dataframe and all its metadata.
40
I think numpy ought to stay well clear of that.
41
On the other hand, you dont want want to go around creating a pandas dataframe
42
just to plot a radial reduction; These kind of transient single-statement associations
43
between keys and values are very useful, entirely independently of some heavyweight framework
44
45
Further questions raised from pandas: should we have some form of merge/join functionality too?
46
or is this getting too panda-ey? all the use cases i can think of fail to be pressed
47
in some kind of standard mould, but i might be missing something here
48
"""
49
50
51
import numpy as np
52
import itertools
53
54
55
56
57
58
59
"""
60
this toggle switches between preferred or backwards compatible semantics
61
for dealing with key objects. current behavior for the arguments to functions
62
like np.unique is to flatten any input arrays.
63
i think a more unified semantics is achieved by interpreting all key arguments as
64
sequences of key objects, whereby non-flat arrays are simply sequences of keys,
65
whereby the keys themselves are ndim-1-arrays
66
for reasons of backwards compatibility, it is probably wise to retain the default,
67
but at least an axis keyword to toggle this behavior would be a welcome addition
68
"""
69
backwards_compatible = False
70
if backwards_compatible:
71
    axis_default = None
72
else:
73
    axis_default = 0
74
75
76
77
"""
78
some utility functions
79
"""
80
81
def as_struct_array(*cols):
82
    """pack a bunch of columns as a struct"""
83
    cols = [np.asarray(c) for c in cols]
84
    rows = len(cols[0])
85
    data = np.empty(rows, [('f'+str(i), c.dtype, c.shape[1:]) for i,c in enumerate(cols)])
86
    for i,c in enumerate(cols):
87
        data['f'+str(i)] = c
88
    return data
89
90
def axis_as_object(arr, axis=-1):
91
    """
92
    cast the given axis of an array to a void object
93
    if the axis to be cast is contiguous, a view is returned, otherwise a copy
94
    this is useful for efficiently sorting by the content of an axis, for instance
95
    """
96
    shape = arr.shape
97
    arr = np.ascontiguousarray(np.swapaxes(arr, axis, -1))
98
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * shape[axis]))).reshape(np.delete(shape, axis))
99
def object_as_axis(arr, dtype, axis=-1):
100
    """cast an array of void objects to a typed axis"""
101
    return np.swapaxes(arr.view(dtype).reshape(arr.shape+(-1,)), axis, -1)
102
103
104
def array_as_object(arr):
105
    """view everything but the first axis as a void object"""
106
    arr = arr.reshape(len(arr),-1)
107
    return axis_as_object(arr)
108
def array_as_typed(arr, dtype, shape):
109
    """unwrap a void object to its original type and shape"""
110
    return object_as_axis(arr, dtype).reshape(shape)
111
112
##def array_as_struct(arr):
113
##    return np.ascontiguousarray(arr).view([('f0', arr.dtype, arr.shape[1:])])#.flatten()
114
##def struct_as_array(arr):
115
##    return arr.view(arr['f0'].dtype)
116
117
118
119
120
121
"""
122
class hierarchy for indexing a set of keys
123
the class hierarchy allows for code reuse, while providing specializations for different
124
types of key objects, yielding maximum flexibility and performance
125
"""
126
127
"""
128
A note on naming: 'Index' here refers to the fact that the goal of these classes is to
129
perform and store precomputations on a set of keys,
130
such as to accelerate subsequent operations involving these keys.
131
They are not 'logical' indexes as in pandas;
132
they are not permanently associated with any other data objects
133
134
Note that these classes are not primarily intended to be used directly from the numpy namespace,
135
but rather are intended for code reuse within a family of higher level operations,
136
only the latter need to be part of the numpy API.
137
138
That said, these classes can also be very useful
139
in those places where the standard operations do not quite cover your needs,
140
saving your from completely reinventing the wheel.
141
"""
142
143
144
class BaseIndex(object):
145
    """
146
    minimal indexing functionality
147
    only provides unique and counts, but with optimal performance
148
    no grouping, or lex-keys are supported,
149
    or anything that would require an indirect sort
150
    """
151
    def __init__(self, keys):
152
        """
153
        keys is a flat array of possibly compsite type
154
        """
155
        self.keys   = np.asarray(keys).flatten()
156
        self.sorted = np.sort(self.keys)
157
        #the slicing points of the bins to reduce over
158
        self.flag   = self.sorted[:-1] != self.sorted[1:]
159
        self.slices = np.concatenate((
160
            [0],
161
            np.flatnonzero(self.flag)+1,
162
            [self.size]))
163
164
    @property
165
    def start(self):
166
        """start index of all bins"""
167
        return self.slices[:-1]
168
    @property
169
    def stop(self):
170
        """stop index of all bins"""
171
        return self.slices[1:]
172
    @property
173
    def groups(self):
174
        return len(self.start)
175
    @property
176
    def size(self):
177
        return self.keys.size
178
179
180
    @property
181
    def unique(self):
182
        """the first entry of each bin is a unique key"""
183
        return self.sorted[self.start]
184
    @property
185
    def uniques(self):
186
        """number of unique keys"""
187
        return len(self.slices)-1
188
189
    @property
190
    def count(self):
191
        """number of times each key occurs"""
192
        return np.diff(self.slices)
193
    @property
194
    def uniform(self):
195
        """returns true if each key occurs an equal number of times"""
196
        return not np.any(np.diff(self.count))
197
198
199
class Index(BaseIndex):
200
    """
201
    index object over a set of keys
202
    adds support for more extensive functionality, notably grouping
203
    """
204
    def __init__(self, keys):
205
        """
206
        keys is a flat array of possibly composite type
207
        """
208
        self.keys   = np.asarray(keys)
209
        #find indices which sort the keys
210
        self.sorter = np.argsort(self.keys)
211
        #computed sorted keys
212
        self.sorted = self.keys[self.sorter]
213
        #the slicing points of the bins to reduce over
214
        self.flag   = self.sorted[:-1] != self.sorted[1:]
215
        self.slices = np.concatenate((
216
            [0],
217
            np.flatnonzero(self.flag)+1,
218
            [self.size]))
219
220
    @property
221
    def index(self):
222
        """ive never run into any use cases for this;
223
        perhaps it was intended to be used to do group_by(keys).first(values)?
224
        in any case, included for backwards compatibility with np.unique"""
225
        return self.sorter[self.start]
226
    @property
227
    def rank(self):
228
        """how high in sorted list each key is"""
229
        return self.sorter.argsort()
230
    @property
231
    def sorted_group_rank_per_key(self):
232
        """find a better name for this? enumeration of sorted keys. also used in median implementation"""
233
        return np.cumsum(np.concatenate(([False], self.flag)))
234
    @property
235
    def inverse(self):
236
        """return index array that maps unique values back to original space"""
237
        return self.sorted_group_rank_per_key[self.rank]
238
239
240
##    def where(self, other):
241
##        """
242
##        determine at which indices a second set equals a first
243
##        """
244
##        return np.searchsorted(self.unique, other)
245
246
247
248
class ObjectIndex(Index):
249
    """
250
    given axis enumerates the keys
251
    all other axes form the keys
252
    groups will be formed on the basis of bitwise equality
253
254
    should we retire objectindex?
255
    this can be integrated with regular index ala lexsort, no?
256
    not sure what is more readable though
257
    """
258
    def __init__(self, keys, axis):
259
        self.axis = axis
260
        self.dtype = keys.dtype
261
262
        keys = np.swapaxes(keys, axis, 0)
263
        self.shape = keys.shape
264
        keys = array_as_object(keys)
265
266
        super(ObjectIndex, self).__init__(keys)
267
268
    @property
269
    def unique(self):
270
        """the first entry of each bin is a unique key"""
271
        sorted = array_as_typed(self.sorted, self.dtype, self.shape)
272
        return np.swapaxes(sorted[self.start], self.axis, 0)
273
274
275
class LexIndex(Index):
276
    """
277
    index object based on lexographic ordering of a tuple of key-arrays
278
    key arrays can be any type, including multi-dimensional, structed or voidobjects
279
    however, passing such fancy keys to lexindex may not be ideal from a performance perspective,
280
    as lexsort does not accept them as arguments directly, so we have to index and invert them first
281
282
    should you find yourself with such complex keys, it may be more efficient
283
    to place them into a structured array first
284
285
    note that multidimensional columns are indexed by their first column,
286
    and no per-column axis keyword is supplied,
287
    customization of column layout will have to be done at the call site
288
289
    """
290
    def __init__(self, keys):
291
        self.keys   = tuple(np.asarray(key) for key in keys)
292
        self.dtypes = tuple(key.dtype for key in self.keys)
293
        self.shapes = tuple(key.shape for key in self.keys)
294
295
        keyviews   = tuple(array_as_object(key) if key.ndim>1 else key for key in self.keys)
296
        #find indices which sort the keys; complex keys which lexsort does not accept are bootstrapped from Index
297
        self.sorter = np.lexsort(tuple(Index(key).inverse if key.dtype.kind is 'V' else key for key in keyviews))
298
        #computed sorted keys
299
        self.sorted = tuple(key[self.sorter] for key in keyviews)
300
        #the slicing points of the bins to reduce over
301
        self.flag   = reduce(
302
            np.logical_or,
303
            (s[:-1] != s[1:] for s in self.sorted))
304
        self.slices = np.concatenate((
305
            [0],
306
            np.flatnonzero(self.flag)+1,
307
            [self.size]))
308
309
    @property
310
    def unique(self):
311
        """returns a tuple of unique key columns"""
312
        return tuple(
313
            (array_as_typed(s, dtype, shape) if len(shape)>1 else s)[self.start]
314
                for s, dtype, shape in zip(self.sorted, self.dtypes, self.shapes))
315
316
    @property
317
    def size(self):
318
        return self.sorter.size
319
320
class LexIndexSimple(Index):
321
    """
322
    simplified LexIndex, which only accepts 1-d arrays of simple dtypes
323
    the more expressive LexIndex only has some logic overhead,
324
    in case all columns are indeed simple. not sure this is worth the extra code
325
    """
326
    def __init__(self, keys):
327
        self.keys   = tuple(np.asarray(key) for key in keys)
328
        self.sorter = np.lexsort(self.keys)
329
        #computed sorted keys
330
        self.sorted = tuple(key[self.sorter] for key in self.keys)
331
        #the slicing points of the bins to reduce over
332
        self.flag   = reduce(
333
            np.logical_or,
334
            (s[:-1] != s[1:] for s in self.sorted))
335
        self.slices = np.concatenate((
336
            [0],
337
            np.flatnonzero(self.flag)+1,
338
            [self.size]))
339
340
    @property
341
    def unique(self):
342
        """the first entry of each bin is a unique key"""
343
        return tuple(s[self.start] for s in self.sorted)
344
345
    @property
346
    def size(self):
347
        return self.sorter.size
348
349
350
351
def as_index(keys, axis = axis_default, base=False):
352
    """
353
    casting rules for a keys object to an index
354
355
    the preferred semantics is that keys is a sequence of key objects,
356
    except when keys is an instance of tuple,
357
    in which case the zipped elements of the tuple are the key objects
358
359
    the axis keyword specifies the axis which enumerates the keys
360
    if axis is None, the keys array is flattened
361
    if axis is 0, the first axis enumerates the keys
362
    which of these two is the default depends on whether backwards_compatible == True
363
364
    if base==True, the most basic index possible is constructed.
365
    this avoids an indirect sort; if it isnt required, this has better performance
366
    """
367
    if isinstance(keys, Index):
368
        return keys
369
    if isinstance(keys, tuple):
370
        return LexIndex(keys)
371
    try:
372
        keys = np.asarray(keys)
373
    except:
374
        raise TypeError('Given object does not form a valid set of keys')
375
    if axis is None:
376
        keys = keys.flatten()
377
    if keys.ndim==1:
378
        if base:
379
            return BaseIndex(keys)
380
        else:
381
            return Index(keys)
382
    else:
383
        return ObjectIndex(keys, axis)
384
385
386
387
388
389
"""
390
public API starts here
391
"""
392
393
394
class GroupBy(object):
395
    """
396
    supports ufunc reduction
397
    should any other form of reduction be supported?
398
    not sure it should; more cleanly written externally i think, on a grouped iterables
399
    """
400
    def __init__(self, keys, axis = 0):
401
        #we could inherit from Index, but the multiple specializations make
402
        #holding a reference to an index object more appropriate
403
        # note that we dont have backwards compatibility issues with groupby,
404
        #so we are going to use axis = 0 as a default
405
        #the multi-dimensional structure of a keys object is usualy meaningfull,
406
        #and value arguments are not automatically flattened either
407
408
        self.index = as_index(keys, axis)
409
410
    #forward interesting/'public' index properties
411
    @property
412
    def unique(self):
413
        return self.index.unique
414
    @property
415
    def count(self):
416
        return self.index.count
417
    @property
418
    def inverse(self):
419
        return self.index.inverse
420
    @property
421
    def rank(self):
422
        return self.index.rank
423
424
425
    #some different methods of chopping up a set of values by key
426
    #not sure they are all equally relevant, but i actually have real world use cases for most of them
427
428
    def as_iterable_from_iterable(self, values):
429
        """
430
        grouping of an iterable. memory consumption depends on the amount of sorting required
431
        worst case, if index.sorter[-1] = 0, we need to consume the entire value iterable,
432
        before we can start yielding any output
433
        but to the extent that the keys come presorted, the grouping is lazy
434
        """
435
        values = iter(enumerate(values))
436
        cache = dict()
437
        def get_value(ti):
438
            try:
439
                return cache.pop(ti)
440
            except:
441
                while True:
442
                    i, v = next(values)
443
                    if i==ti:
444
                        return v
445
                    cache[i] = v
446
        s = iter(self.index.sorter)
447
        for c in self.count:
448
            yield (get_value(i) for i in itertools.islice(s, c))
449
450
    def as_outoforder(self, values):
451
        """
452
        group values, without regard for the ordering of self.index.unique
453
        consume values as they come, and yield key-group pairs as soon as they complete
454
        thi spproach is lazy, insofar as grouped values are close in their iterable
455
        """
456
        from collections import defaultdict
457
        cache = defaultdict(list)
458
        count = self.count
459
        unique = self.unique
460
        key = (lambda i: unique[i]) if isinstance(unique, np.ndarray) else (lambda i: tuple(c[i] for c in unique))
461
        for i,v in itertools.izip(self.inverse, values):
462
            cache[i].append(v)
463
            if len(cache[i]) == count[i]:
464
                yield key(i), cache.pop(i)
465
466
    def as_iterable_from_sequence(self, values):
467
        """
468
        this is the preferred method if values has random access,
469
        but we dont want it completely in memory.
470
        like a big memory mapped file, for instance
471
        """
472
        s = iter(self.index.sorter)
473
        for c in self.count:
474
            yield (values[i] for i in itertools.islice(s, c))
475
476
    def as_array(self, values):
477
        """
478
        return grouped values as an ndarray
479
        returns an array of shape [groups, groupsize, ungrouped-axes]
480
        this is only possible if index.uniform==True
481
        """
482
        assert(self.index.uniform)
483
        values = np.asarray(values)
484
        values = values[self.index.sorter]
485
        return values.reshape(self.index.groups, self.count[0], *values.shape[1:])
486
487
    def as_list(self, values):
488
        """return grouped values as a list of arrays, or a jagged-array"""
489
        values = np.asarray(values)
490
        values = values[self.index.sorter]
491
        return np.split(values, self.index.slices[1:-1], axis=0)
492
493
    def group(self, values):
494
        try:
495
            return self.as_array(values)
496
        except:
497
            return self.as_list(values)
498
499
    def __call__(self, values):
500
        """
501
        not sure how i feel about this. explicit is better than implict
502
        but group_by(keys).group(values) does not sit too well with me either
503
        """
504
        return self.unique, self.group(values)
505
506
507
    # ufunc based reduction methods
508
509
    def reduce(self, values, operator = np.add):
510
        """
511
        reduce the values over identical key groups, using the ufunc operator
512
        reduction is over the first axis, which should have elements corresponding to the keys
513
        all other axes are treated indepenently for the sake of this reduction
514
        note that this code is only C-vectorized over the first axis
515
        that is fine is this inner loop is significant, but not so much if it isnt
516
        if we have only few keys, but values[0].size is substantial, a reduction via
517
        as_list may be preferable
518
        """
519
        values = values[self.index.sorter]
520
        if values.ndim>1:
521
            return np.apply_along_axis(
522
                lambda slc: operator.reduceat(slc, self.index.start),
523
                0, values)
524
        else:
525
            return operator.reduceat(values, self.index.start)
526
527
    def sum(self, values, axis=0):
528
        """compute the sum over each group"""
529
        values = np.asarray(values)
530
        if axis: values = np.rollaxis(values, axis)
531
        return self.unique, self.reduce(values)
532
533
    def mean(self, values, axis=0):
534
        """compute the mean over each group"""
535
        values = np.asarray(values)
536
        if axis: values = np.rollaxis(values, axis)
537
        count = self.count.reshape(-1,*(1,)*(values.ndim-1))
538
        return self.unique, self.reduce(values) / count
539
540
    def var(self, values, axis=0):
541
        """compute the variance over each group"""
542
        values = np.asarray(values)
543
        if axis: values = np.rollaxis(values, axis)
544
        count = self.count.reshape(-1,*(1,)*(values.ndim-1))
545
        mean = self.reduce(values) / count
546
        err = values - mean[self.inverse]
547
        return self.unique, self.reduce(err**2) / count
548
549
    def std(self, values, axis=0):
550
        """standard deviation over each group"""
551
        unique, var = self.var(values, axis)
552
        return unique, np.sqrt(var)
553
554
    def median(self, values, axis=0, average=True):
555
        """
556
        compute the median value over each group.
557
        when average is true, the average is the two cental values is taken
558
        for groups with an even key-count
559
        """
560
        values = np.asarray(values)
561
562
        mid_2 = self.index.start + self.index.stop
563
        hi = (mid_2    ) // 2
564
        lo = (mid_2 - 1) // 2
565
566
        #need this indirection for lex-index compatibility
567
        sorted_group_rank_per_key = self.index.sorted_group_rank_per_key
568
569
        def median1d(slc):
570
            #place values at correct keys; preconditions the upcoming lexsort
571
            slc    = slc[self.index.sorter]
572
            #refine value sorting within each keygroup
573
            sorter = np.lexsort((slc, sorted_group_rank_per_key))
574
            slc    = slc[sorter]
575
            return (slc[lo]+slc[hi]) / 2 if average else slc[hi]
576
577
        values = np.asarray(values)
578
        if axis: values = np.rollaxis(values, axis)
579
        if values.ndim>1:   #is trying to skip apply_along_axis somewhat premature optimization?
580
            values = np.apply_along_axis(median1d, 0, values)
581
        else:
582
            values = median1d(values)
583
        return self.unique, values
584
585
    def min(self, values, axis=0):
586
        """return the minimum within each group"""
587
        values = np.asarray(values)
588
        if axis: values = np.rollaxis(values, axis)
589
        return self.unique, self.reduce(values, np.minimum)
590
591
    def max(self, values, axis=0):
592
        """return the maximum within each group"""
593
        values = np.asarray(values)
594
        if axis: values = np.rollaxis(values, axis)
595
        return self.unique, self.reduce(values, np.maximum)
596
597
    def first(self, values, axis=0):
598
        """return values at first occurance of its associated key"""
599
        values = np.asarray(values)
600
        if axis: values = np.rollaxis(values, axis)
601
        return self.unique, values[self.index.sorter[self.index.start]]
602
603
    def last(self, values, axis=0):
604
        """return values at last occurance of its associated key"""
605
        values = np.asarray(values)
606
        if axis: values = np.rollaxis(values, axis)
607
        return self.unique, values[self.index.sorter[self.index.stop-1]]
608
609
610
611
    #implement iter interface? could simply do zip( group_by(keys)(values)), no?
612
613
#just an alias, for those who dont like camelcase
614
group_by = GroupBy
615
#could also turn this into a function with optional values and reduction func.
616
def group_by(keys, values=None, reduction=None, axis=0):
617
    g = GroupBy(keys, axis)
618
    if values is None:
619
        return g
620
    groups = g.group(values)
621
    if reduction is None:
622
        return groups
623
    return [reduction(group) for group in groups]
624
625
626
627
"""
628
some demonstration of the duplicity between this code and np.unique
629
they share a lot of common functionality
630
we cant quite bootstrap grouping from unique as is
631
but unique can easily be reimplemented using the Index class
632
"""
633
634
def unique(keys, return_index = False, return_inverse = False, return_count = False, axis = axis_default):
635
    """
636
    backwards compatible interface with numpy.unique
637
    in the long term i think the kwargs should be deprecated though
638
    cleaner to call index and its properties directly,
639
    should you want something beyond the interface provided
640
    """
641
    index = as_index(keys, axis, base = not (return_index or return_inverse))
642
643
    ret = index.unique,
644
    if return_index:
645
        ret = ret + (index.index,)
646
    if return_inverse:
647
        ret = ret + (index.inverse,)
648
    if return_count:
649
        ret = ret + (index.count,)
650
    return ret[0] if len(ret) == 1 else ret
651
652
def count(keys, axis = axis_default):
653
    """numpy work-alike of collections.Counter"""
654
    index = as_index(keys, axis, base = True)
655
    return index.unique, index.count
656
657
def multiplicity(keys, axis = axis_default):
658
    """
659
    return the multiplicity of each key, or how often it occurs in the set
660
    note that we are not interested in the unique values for this operation,
661
    casting doubt on the current numpy design which places np.unique
662
    at the center of the interface for this kind of operation
663
    given how often i use multiplicity, id like to have it in the numpy namespace
664
    """
665
    index = as_index(keys, axis)
666
    return index.count[index.inverse]
667
668
def rank(keys, axis = axis_default):
669
    """
670
    where each item is in the pecking order.
671
    not sure this should be part of the public api, cant think of any use-case right away
672
    plus, we have a namespace conflict
673
    """
674
    index = as_index(keys, axis)
675
    return index.rank
676
677
678
def incidence(boundary):
679
    """
680
    given an Nxm matrix containing boundary info between simplices,
681
    compute indidence info matrix
682
    not to be part of numpy API, to be sure, just something im playing with
683
    """
684
    return group_by(boundary).group(np.arange(boundary.size) // boundary.shape[1])
685
686
687
688
689
def test_basic():
690
691
    keys   = ["e", "b", "b", "c", "d", "e", "e", 'a']
692
    values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01,1]
693
694
##    keys   = np.array(["a", "b", "b", "c", "d", "e", "e",'d','a','c'])
695
##    values = np.array([1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01, 1,2,3])
696
697
    print 'two methods of computing non-reducing group'
698
    print 'as iterable'
699
    g = group_by(keys)
700
    for k,v in zip(g.unique, g.as_iterable_from_sequence(values)):
701
        print k, list(v)
702
    print 'as list'
703
    for k,v in zip(*group_by(keys)(values)):
704
        print k, v
705
706
    print 'some reducing group operations'
707
    g = group_by(keys)
708
    unique_keys, reduced_values = g.median(values)
709
    print 'per group median'
710
    print reduced_values
711
    unique_keys, reduced_values = g.mean(values)
712
    print 'per group mean'
713
    print reduced_values
714
    unique_keys, reduced_values = g.std(values)
715
    print 'per group std'
716
    print reduced_values
717
    reduced_values = g.reduce(np.array(values), np.minimum) #alternate way of calling
718
    print 'per group min'
719
    print reduced_values
720
    unique_keys, reduced_values = g.max(values)
721
    print 'per group max'
722
    print reduced_values
723
724
725
##test_basic()
726
##quit()
727
728
729
def test_lex_median():
730
    """
731
    for making sure i squased all bugs related to fancy-keys and median filter implementation
732
    """
733
    keys1  = ["e", "b", "b", "c", "d", "e", "e", 'a']
734
    keys2  = ["b", "b", "b", "d", "e", "e", 'e', 'e']
735
##    keys3 = np.random.randint(0,2,(8,2))
736
    values = [1.2, 4.5, 4.3, 2.0, 5.6, 8.8, 9.1, 1]
737
738
    unique, median = group_by((keys1, keys2)).median(values)
739
    for i in zip(zip(*unique), median):
740
        print i
741
742
##test_lex_median()
743
##quit()
744
745
746
def test_fancy_keys():
747
    """
748
    test Index subclasses
749
    """
750
    keys        = np.random.randint(0, 2, (20,3)).astype(np.int8)
751
    values      = np.random.randint(-1,2,(20,4))
752
753
754
    #all these various datastructures should produce the same behavior
755
    #multiplicity is a nice unit test, since it draws on most of the low level functionality
756
    if backwards_compatible:
757
        assert(np.all(
758
            multiplicity(keys, axis=0) ==           #void object indexing
759
            multiplicity(tuple(keys.T))))           #lexographic indexing
760
        assert(np.all(
761
            multiplicity(keys, axis=0) ==           #void object indexing
762
            multiplicity(as_struct_array(keys))))   #struct array indexing
763
    else:
764
        assert(np.all(
765
            multiplicity(keys) ==                   #void object indexing
766
            multiplicity(tuple(keys.T))))           #lexographic indexing
767
        assert(np.all(
768
            multiplicity(keys) ==                   #void object indexing
769
            multiplicity(as_struct_array(keys))))   #struct array indexing
770
771
    #lets go mixing some dtypes!
772
    floatkeys   = np.zeros(len(keys))
773
    floatkeys[0] = 8.8
774
    print 'sum per group of identical rows using struct key'
775
    g = group_by(as_struct_array(keys, floatkeys))
776
    for e in zip(g.count, *g.sum(values)):
777
        print e
778
    print 'sum per group of identical rows using lex of nd-key'
779
    g = group_by(( keys, floatkeys))
780
    for e in zip(zip(*g.unique), g.sum(values)[1]):
781
        print e
782
    print 'sum per group of identical rows using lex of struct key'
783
    g = group_by((as_struct_array( keys), floatkeys))
784
    for e in zip(zip(*g.unique), g.sum(values)[1]):
785
        print e
786
787
    #showcase enhanced unique functionality
788
    images = np.random.rand(4,4,4)
789
    #shuffle the images; this is a giant mess now; how to find the unique ones?
790
    shuffled = images[np.random.randint(0,4,200)]
791
    #there you go
792
    if backwards_compatible:
793
        print unique(shuffled, axis=0)
794
    else:
795
        print unique(shuffled)
796
797
798
##g = test_fancy_keys()
799
##quit()
800
801
802
def test_radial():
803
    x = np.linspace(-2,2, 64)
804
    y = x[:, None]
805
    x = x[None, :]
806
    R = np.sqrt( x**2+y**2)
807
808
    def airy(r, sigma):
809
        from scipy.special import j1
810
        r = r / sigma * np.sqrt(2)
811
        a = (2*j1(r)/r)**2
812
        a[r==0] = 1
813
        return a
814
    def gauss(r, sigma):
815
        return np.exp(-(r/sigma)**2)
816
817
    distribution = np.random.choice([gauss, airy])(R, 0.3)
818
    sample = np.random.poisson(distribution*200+10).astype(np.float)
819
820
    import matplotlib.pyplot as pp
821
    #is this an airy or gaussian function? hard to tell with all this noise!
822
    pp.imshow(sample, interpolation='nearest', cmap='gray')
823
    pp.show()
824
    #radial reduction to the rescue!
825
    #if we are sampling an airy function, you will see a small but significant rise around x=1
826
    g = group_by(np.round(R, 5).flatten())
827
    pp.errorbar(
828
        g.unique,
829
        g.mean(sample.flatten())[1],
830
        g.std (sample.flatten())[1] / np.sqrt(g.count))
831
    pp.xlim(0,2)
832
    pp.show()
833
834
##test_radial()
835
##quit()
836
837
def test_meshing():
838
    """
839
    meshing example
840
    demonstrates the use of multiplicity, and group.median
841
    """
842
    #set up some random points, and get their delaunay triangulation
843
    points = np.random.random((20000,2))*2-1
844
    points = points[np.linalg.norm(points,axis=1) < 1]
845
    from scipy.spatial.qhull import Delaunay
846
    d = Delaunay(points)
847
    tris = d.simplices
848
849
    #the operations provided in this module allow us to express potentially complex
850
    #computational geometry questions elegantly in numpy
851
    #Delaunay.neighbors could be used as well,
852
    #but the point is to express this in pure numpy, without additional library functionaoty
853
    edges = tris[:,[[0,1],[1,2],[2,0]]].reshape(-1,2)
854
    sorted_edges = np.where(edges[:,0:1]<edges[:,1:2], edges, edges[:,::-1])
855
    #we test to see how often each edge occurs, or how many indicent simplices it has
856
    #this is a very general method of finding the boundary of any topology
857
    #and we can do so here with only one simple and readable command, multiplicity == 1
858
    if backwards_compatible:
859
        boundary_edges = edges[multiplicity(sorted_edges, axis=0)==1]
860
    else:
861
        boundary_edges = edges[multiplicity(sorted_edges)==1]
862
    boundary_points = unique(boundary_edges)
863
864
    if False:
865
        print boundary_edges
866
        print incidence(boundary_edges)
867
868
869
    #create some random values on faces
870
    #we want to smooth them over the mesh to create a nice hilly landscape
871
    face_values   = np.random.normal(size=d.nsimplex)
872
    #add some salt and pepper noise, to make our problem more interesting
873
    face_values[np.random.randint(d.nsimplex, size=10)] += 1000
874
875
    #start with a median step, to remove salt-and-pepper noise
876
    #toggle to mean to see the effect of the median filter
877
    g = group_by(tris.flatten())
878
    prestep = g.median if True else g.mean
879
    vertex_values = prestep(np.repeat(face_values, 3))[1]
880
    vertex_values[boundary_points] = 0
881
882
    #actually, we can compute the mean without grouping
883
    tris_per_vert = g.count
884
    def scatter(x):
885
        r = np.zeros(d.npoints, x.dtype)
886
        for idx in tris.T: np.add.at(r, idx, x)
887
        return r / tris_per_vert
888
    def gather(x):
889
        return x[tris].mean(axis=1)
890
891
    #iterate a little
892
    for i in range(100):
893
        face_values   = gather(vertex_values)
894
        vertex_values = scatter(face_values)
895
        vertex_values[boundary_points] = 0
896
897
898
    #display our nicely rolling hills and their boundary
899
    import matplotlib.pyplot as plt
900
    x, y = points.T
901
    plt.tripcolor(x,y, triangles = tris, facecolors = face_values)
902
    plt.scatter(x[boundary_points], y[boundary_points])
903
    plt.xlim(-1,1)
904
    plt.ylim(-1,1)
905
    plt.axis('equal')
906
    plt.show()
907
908
d = test_meshing()