Untitled

## Query

DDL:
```sql
CREATE TABLE foo (a int, b int, c int) DISTRIBUTED BY (a);
```

Query:
```sql
SELECT aggfn(DISTINCT b) FROM foo;
```


## Splitting a DQA

### A somehow noisy output:

```
Xform: CXformSplitDQA
Input:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]    origin: [Grp:5, GrpExpr:0]
   |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   +--CScalarProjectList   origin: [Grp:4, GrpExpr:0]
      +--CScalarProjectElement "avg" (10)   origin: [Grp:3, GrpExpr:0]
         +--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Global)   origin: [Grp:2, GrpExpr:0]
            +--CScalarIdent "b" (1)   origin: [Grp:1, GrpExpr:0]
Output:
Alternatives:
0:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Intermediate ) Grp Cols: ["b" (1)][Intermediate], Minimal Grp Cols: [], Distinct Cols:["b" (1)], Generates Duplicates :[ 0 ]
   |  |--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 1 ]
   |  |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  |  +--CScalarProjectList
   |  +--CScalarProjectList
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "b" (1)
1:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Local ) Grp Cols: [][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  +--CScalarProjectList
   |     +--CScalarProjectElement "ColRef_0011" (11)
   |        +--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Local)
   |           +--CScalarIdent "b" (1)
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "ColRef_0011" (11)
2:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  +--CScalarProjectList
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "b" (1)
```

### Let's parse the above
Given an input of the following (simplified) shape:

```
Agg
  Output: aggfn(DISTINCT b)
  -> RELATION
```

-- where `RELATION` is a wildcard, in this case, it is `LogicalGet "foo"` -- the transform `CXformSplitDQA` generates 3 logical alternatives:

1. Pre-agg deduplicate with global and local GropuBy's (a.k.a. 3-phase agg: Agg-GroupBy-GroupBy: aggregate function is not split)
1. A split-agg alternative (a.k.a. 2-stage agg: Agg-Agg)
1. Pre-agg deduplicate with local GroupBy (Agg-GroupBy: aggregate function is not split)

#### Pre-agg deduplicate with global and local GropuBy's
```
Agg 🐱
  Output: NORMAL aggfn(b)
  -> GroupBy (b) [global dedup] 🐶
       -> GroupBy (b) [local dedup] 🐷
          -> WHATEVER 🐴
```

(Possibly) final physical plan:
```
Aggregate 🐱
  Output: NORMAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: b
        ->  GroupAggregate 🐶
              Output: b
              Group By: b
              ->  Sort 🐶
                    Output: b
                    Sort Key: b
                    ->  Redistribute Motion 3:3  (slice1; segments: 3)
                          Output: b
                          Hash Key: b
                          ->  GroupAggregate 🐷
                                Output: b
                                Group By: b
                                ->  Sort 🐷
                                      Output: b
                                      Sort Key: b
                                      ->  Table Scan on foo 🐴
                                            Output: b
```

##### Exercise: what happens when "WHATEVER" is distributed on the `DISTINCT` column?

#### "Split-Agg" alternative:

```
Agg 🐱
  Output: FINAL aggfn(b)
  -> Agg 🐶
       Output: PARTIAL aggfn(DISTINCT b)
       -> WHATEVER 🐴
```

Possibly final physical plan:
```
Aggregate 🐱
  Output: FINAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: (PARTIAL aggfn(DISTINCT b))
        ->  Aggregate 🐶
              Output: PARTIAL aggfn(DISTINCT b)
              ->  Redistribute Motion 3:3  (slice1; segments: 3)
                    Output: b
                    Hash Key: b
                    ->  Table Scan on foo 🐴
                          Output: b
```

##### Exercise: what happens when "WHATEVER" is distributed on the `DISTINCT` column?
Then you won't need the bottom motion:

```
Aggregate 🐱
  Output: FINAL aggfn(a)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: (PARTIAL aggfn(DISTINCT a))
        ->  Aggregate 🐶
              Output: PARTIAL aggfn(DISTINCT a)
              ->  Table Scan on foo 🐴
                    Output: a
```

#### Pre-agg deduplicate with local GroupBy
```
Agg 🐱
  Output: NORMAL aggfn(b)
  -> GroupBy (b) [global dedup] 🐶
     -> WHATEVER 🐴
```

(Possibly) final physical plan

```
Aggregate 🐱
  Output: NORMAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: b
        ->  GroupAggregate 🐶
              Output: b
              Group By: b
              ->  Sort 🐶
                    Output: b
                    Sort Key: b
                    ->  Redistribute Motion 3:3  (slice1; segments: 3)
                          Output: b
                          Hash Key: b
                          ->  Table Scan on foo 🐴
                                Output: b
```

##### Exercise: what happens when "WHATEVER" is distributed on the `DISTINCT` column?
Then you won't need the bottom motion:

```
Aggregate 🐱
  Output: NORMAL aggfn(a)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: a
        ->  GroupAggregate 🐶
              Output: a
              Group By: a
              ->  Sort
                    Output: a
                    Sort Key: a
                    ->  Table Scan on foo 🐴
                          Output: a
```


## Exercise for readers:
1. What happens if the `aggfn` doesn't have a combine (`prefunc`) function?
1. What happens when we have a `GROUP BY` in the query? Say

    ```sql
    SELECT aggfn(DISTINCT b) FROM foo GROUP BY a;
    ```
    OR
    ```sql
    SELECT aggfn(DISTINCT b) FROM foo GROUP BY c;
    ```

1. What if there is more than one aggregate function in the projection? Say
    ```sql
    SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT c) FROM foo GROUP BY a;
    ```

    OR
    ```sql
    SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT a) FROM foo GROUP BY c;
    ```

1 What should happen when the project mixes `DISTINCT` aggregate with regular aggregates?


## Debugging tips:

1. Error loading bytea type
    ```
    2018-01-11 09:57:31:165576 PST,THD000,ERROR,"Lookup of object 0.17.1.0 in cache failed",
    ```

    Solution: turn off force 3-phase scalar dqa, or steal the `<dxl:Type>` metadata from other minidumps...