Untitled

Example code under test:

`CExpressionPreprocessor::PexprRemoveSuperfluousOuterRefs`
is a function which takes as input an expression representing a SQL query and returns an expression representing a SQL query which it has either processed or left alone.
The specific intent of the function is to remove references to columns which are not required.
If the input expression has certain characteristics, then a transformation is applied to it.
In this case, those characteristics are the presence of a subquery which has an ORDER BY or GROUP BY clause containing a reference to a column from a table that is from outside of the subquery. According to SQL semantics, the query with and without those columns is equivalent, and, by removing that reference, it makes it more likely that this expression will be eligible for later tranformations.

In order to test that this step is correctly transforming an expression according to the intended logic, I want to create a test input expression and compare either the entire output expression to an "expected" output or assert on various characteristics of the output expression.

Let's take an example input expression. The serialized representation of an example input expression is as follows:

```
+--CLogicalSelect
|--CLogicalGet "t" ("t"), Columns: ["a" (0), "b" (1), "ctid" (2), "xmin" (3), "cmin" (4), "xmax" (5), "cmax" (6), "tableoid" (7), "gp_segment_id" (8)] Key sets: {[2,8]}
+--CScalarSubqueryAny(=)["count" (18)]
   |--CLogicalGbAgg( Global ) Grp Cols: ["j" (10), "b" (1)][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "s" ("s"), Columns: ["i" (9), "j" (10), "ctid" (11), "xmin" (12), "cmin" (13), "xmax" (14), "cmax" (15), "tableoid" (16), "gp_segment_id" (17)] Key sets: {[2,8]}
   |  +--CScalarProjectList
   |     +--CScalarProjectElement "count" (18)
   |        +--CScalarAggFunc (count , Distinct: false , Aggregate Stage: Global)
   |           +--CScalarIdent "i" (9)
   +--CScalarIdent "a" (0)
```

Each component of this expression tree is a node of type `CExpression`. There is a class, CExpression.h, which prescribes the characteristics for such an expression.
Each expression node has an array of child nodes saved in a member variable `m_pdrgpexpr`.

Each node also has a type expressed as its operator type. Valid operator types are logical, physical, scalar, and patterns. `COperator.h` describes a valid operator.
The operator type is saved in a member variable in the expression `m_pop`
An expression also has what are called derived relational properties--saved in `m_pdprel`. Some of these are displayed above--most significantly for this example, the columns -- e.g. `"a" (0)`

I would like to test that when I call `CExpressionPreprocessor::PexprRemoveSuperfluousOuterRefs` on an expression which has a `CScalarSubquery*` with a `CLogicalGbAgg` as a child having at least one outer reference that that outer reference is eliminated and the output expression has only columns from the table referenced in my subquery.
The serialized output expression that I would be expecting to come out of the input above is

```
+--CLogicalSelect
|--CLogicalGet "t" ("t"), Columns: ["a" (0), "b" (1), "ctid" (2), "xmin" (3), "cmin" (4), "xmax" (5), "cmax" (6), "tableoid" (7), "gp_segment_id" (8)] Key sets: {[2,8]}
+--CScalarSubqueryAny(=)["count" (18)]
   |--CLogicalGbAgg( Global ) Grp Cols: ["j" (10)][Global], Minimal Grp Cols: ["j" (10)], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "s" ("s"), Columns: ["i" (9), "j" (10), "ctid" (11), "xmin" (12), "cmin" (13), "xmax" (14), "cmax" (15), "tableoid" (16), "gp_segment_id" (17)] Key sets: {[2,8]}
   |  +--CScalarProjectList
   |     +--CScalarProjectElement "count" (18)
   |        +--CScalarAggFunc (count , Distinct: false , Aggregate Stage: Global)
   |           +--CScalarIdent "i" (9)
   +--CScalarIdent "a" (0)
```
In order to construct a real input for a test of this step, I need to make real nodes for every node in the expression. This means that, even though I only care about the specific details of one of the nodes, I have to write dozens of lines of code to construct the rest of the expression for it to be a valid input to this function.

This is cumbersome for the developer.

One strategy that we have used so far for testing is to use a SQL query as the test input and assert that there is no diff in the plan that is produced.
	This is problematic because dozens of transformations are performed on an expression as part of planning, so if there is a difference in the plan, it is not clear which transformation or processing step led to that diff.
	This is also problematic because the input may have to go through other transformations from SQL before arriving at this function. So, an actual test input would not go directly from being compiled from SQL into an expression to having this function called on it.
	Even if this function was called directly after parsing and compilation to an expression, I would have to devise a SQL query which produces the correct input, and, if the code which parses and compiles the SQL query into an expression changes, my test would fail.

So, the first question is how do we test this single processing step in isolation.

Almost every escalation involves tracking down the bug manually in the code. This usually means finding the exact processing step which did the wrong thing. We usually do this by setting breakpoints and inspecting the expression tree at various spots throughout the code. Because we have discrete processing steps with input and output, it seems like we could more easily test this than using this strategy.
This is data in/data out software, so testing it should be possible

Another strategy we have used is to create helper functions which make different specific nodes. For example, we have a helper function `CTestUtils::PexprLogicalGet` which creates a `CLogicalGet` similar to the one in the above serialized expression. This is helpful, however, it requires you to construct several inputs--for example, a real `CTableDescriptor` which is a representation of a table and requires a cascade of other inputs to be created in order to make it. The helper functions have several problems:
1. they still require the developer to make and specify many real inputs which is time consuming and painful
2. when there is a particular part of the input that I would like to make real or to specify, helper functions that make default values often don't have a good way for me to inject the details I do care about
3. they still require a large and elaborate test setup to take care of add ref'ing and memory management
4. It is nearly impossible to create inputs to test processing steps which take bigger and more elaborate input expressions
I find that I end up creating a new helper function every time I want to write a test for a processing step to create the input, which makes it less of a helper function and more of a code obfuscation.
I, as a developer, want the helper function making my test input to take care of making all of the parts of the test input which I don't care about and are not under test--or, more specifically, I want to be able to test that this production code works as intended while only having to create the objects under test.

This felt like it would be possible to solve with mocking. I have done a POC for mocking `CExpression` and will describe how it went.

Caveats about the prototype:
- I didn't use a mocking framework, as I wanted to drive out the basic concepts, so the code could not be used in production, as it required me to make many of the functions in CExpression virtual.
- I had my mocks inherit from the real classes instead of re-parenting the real classes to avoid a huge change to the code in my prototype
- I ended up using several hacks to work around current specifics of the code base which can be changed, should we decide to go this route (for example, I made methods like the actual step under test `PexprRemoveSuperfluousOuterRefs` public)
- Also, my mock's functions which have different behavior than the real functions only work for my very specific use case and would have to be generalized
- I used some of the current test setup which obfuscates the details around memory management
- I didn't refactor any code -- including where I added my test --
I think that many of the above points are very important to address in the future, but my current question is even more basic: basically, with this mocking method, 1) am I even putting the code under test correctly and 2) will this make it easier to write valid test cases than either a) current state helper functions or b) other testing methods

I wrote a test which involved the following setup:
- Standard memory management and metadata creation setup
Real inputs components created:
`CLogicalGbAgg`
`CColRefSet`
Mocks created:
`CExpressionMock`
`CDrvdPropRelationalMock`

I make an instance of my mock expression and call `PexprRemoveSuperfluousOuterRefs` on it. Then I assert that the length of the column references in the output of calling that function is as expected (from 2 to 1).

The actual mock ends up working like this in the production code:
```
COperator::EOperatorId eopid = pop->Eopid();
```
Gets the operator id of my mock expression which is the id of the real `CLogicalGbAgg` that I made for my test input.
```
(pop->FLogical() && CUtils::FHasOuterRefs(pexpr));
```
Returns true because my real `CLogicalGbAgg` is a logical operator and my mock expression returns a real `CColRefSet` when `PcrsOuter()` is called on it.

That means I fall into `if (fHasOuterRefs)` and the nested `else if (COperator::EopLogicalGbAgg == eopid)`
This next line exercises the mock expression and the mock derived relational properties
```
CColRefSet *pcrsOuter = Pdprel(pexpr->PdpDerive())->PcrsOuter(pmp);
```

`PdpDerive()` when called on my mock expression will create a mock `CDrvdPropRelational`.
These mock derived relational properties return a real `CColRefSet` when I call `PcrsOuter()` on them.
In the next line:
```
CLogicalGbAgg *popAgg = CLogicalGbAgg::PopConvert(pop);
```
my real `CLogicalGbAgg` is obtained.
Now, I call
```
DrgPcr *pdrgpcr = CUtils::PdrgpcrExcludeColumns(pmp, popAgg->Pdrgpcr(), pcrsOuter);
```
With my outer references and my logical group by agg operator.
`CUtils::PdrgpcrExcludeColumns` wil return a copy of the columns, excluding the column "b" which is from outside of my subquery.

As part of constructing my real `CLogicalGbAgg`, I will have set the minimal columns required (based on functional dependencies).
Next I use my fake `CExpression` in order to get a fake child which has a fake arity so that I can meet the condition that the project list is not empty.
Then I will access the minimal colrefset which I constructed my groupby with
```
DrgPcr *pdrgpcrMinimal = popAgg->PdrgpcrMinimal();
```
I'll exclude my outer referencing columns from it
```
pdrgpcrMinimal = CUtils::PdrgpcrExcludeColumns(pmp, pdrgpcrMinimal, pcrsOuter);
```
In order to make sure that I have columns that are not actually outer references in my list of group by columns.

Then I make a new `CLogicalGbAgg` which has the new set of column references with the superfluous outer references elimintated.
At the bottom of the function, before creating a new expression containing the new `CLogicalGbAgg`, we recursively process the children of my expression in order to remove their superfluous outer references.
```
	const ULONG ulArity = pexpr->UlArity();
	DrgPexpr *pdrgpexprChildren = GPOS_NEW(pmp) DrgPexpr(pmp);
	for (ULONG ul = 0; ul < ulArity; ul++)
	{
		CExpression *pexprChild = PexprRemoveSuperfluousOuterRefs(pmp, (*pexpr)[ul]);
		pdrgpexprChildren->Append(pexprChild);
	}
```
Because our test case only requires us to remove outer references from the parent expression, we want this code to be a no-op.
We do create our new output expression with the children we found while looping. So, this actually will work fine with our mock--as none of our children will meet the condition to be processed by this step.
Then we create our output expression.
```
return GPOS_NEW(pmp) CExpression(pmp, pop, pdrgpexprChildren);
```

So, my questions are, with my use of a mock to create this test
1. Am I actually exercising the code in a way that is meaningful?
2. Is this the best way to test this code? And, if not, what is?


Other strategies I've considered are:
- Using XML-style string matching on serialized versions of the code
	- Steps:
		- Parse into an expression an XML-style test input and expected output
		- Call the function on the parsed input
		- Serialize the output of calling the function
		- String compare the output and expected output
	- Challenges
		- The string input has to be formatted perfectly and is therefore brittle
		- Developers still have to "care about"/come up with all of the parts of the expression
		- Developers should write code to test a function that takes code objects as input and not edit XML strings
		- If the parsing or serialization code breaks, these tests would break
		- String matching doesn't allow us to assert on specific aspects of the test output, which makes it less effective as a way of documenting the code
		- Any code that changes any other part of the expression would cause the test to fail (which is most of the code in the code base)
- Writing a DSL to generate real input expressions which has sensible defaults for all nodes and attributes not specified by the developer
	- Steps
		- ?
	- Challenges
		- Sounds hard
	- Pros
		- Easy to make new tests
		- Expressions could be arbitrarily complicated much more easily
- Change the processing steps production code to make it easier to test without mocks
	- Steps
		- change individual processing steps to mock out function calls that would require more complicated inputs
		- unit-test individual functions called in the processing step `CUtils::PdrgpcrExcludeColumns`
		- then test the overall expression transformation only for "happy path" simple case
	- Challenges
		- hard to generalize
		- unclear mandate
		- inconsistent ability to apply
		- breaking up and de-centralizing these steps will make it extremely difficult to isolate the cause of bugs using breakpoints and a debugger, as we do now
		- Harder to test that the characteristics of more complicated expressions are actually identified
			- for example, identify a `CColRefSet` which includes an outer reference which is in a nested subquery is correctly processed
	- Pros
		- developers can use their judgment about how to test steps they add
- Mocks
	- Steps
	- Challenges
		- not sure how easy it will be to generate arbitrarily complicated expressions
		- not sure how easy it will be to generalize the mock methods of CExpression to work for different types of expressions--maybe we should mock the operators and not the expression -- that is a lot of mocks
		- recursion is a challenge -- all of the processing steps recurse on the children. if we are using mock objects to make sure that we only process the parts of the expression that we feel like spec'ing out, then we run into the problem where recursion will force us to have to construct the whole expression. How do we get recursion to work with mocking?
		- TODO: explain this better: the arbitrary separation betweeen COperator and CExpression also is a source of problems. If we mock CExpression and then proceed to define a mock UlArity() function, for example, even if we know that for a particular type of operator, there is an expected arity and we should be able to return a const value in our mock code, we cannot do that because UlArity() will not be called polymorphically -- it is called on a class which is not a parent of the particular operator type. This means that if we want UlArity to return a different value for different types of operators, we would have to add a large switch-case to the UlArity() mock method which switches on the operator type. This is in contrast to being able to do it in the specific operator's mock. If we combined COperator and CExpression, this may be possible
	- Pros
		- test the code directly
		- it is clear what you care about


What ideas do people have about the above or about completely different strategies?

One important point is that these processing steps are one kind of transformation we do. We also implement a large number of other transformations that do not take a simple expression as input but instead take a memo-ized version of many expressions organized into groups. This code is particularly challenging to test because these different expressions which are memo-ized represent many expressions which do not get picked to be the final plan. This means you cannot test that they produce the right alternatives with SQL. Inherently, many of the test cases you would choose involve the production of expressions which would be eliminated by later steps.
The goal in giving this less complicated example is to come up with a testing strategy that could be extended to these steps.

More caveats:
- We do have tests for various processing steps which take the time to create real inputs and compare them to outputs--it's just time-consuming to create new ones
- We have a robust framework for integration testing query planning which uses a special format called "minidump" to make the entire query planning process data in/data out and allows us to detect any difference in query plan produced for a given query. This means we already have parsing and serialization defined for all of the nodes we are discussing.