Untitled

## Problem Description

We have introns, and clusters of introns.
The challenge is: given a single intron, return all clusters where that intron is a member.

Data is given in the following format:

|cluster_id|intron_list|
|-|-|
|1|"Intron30, Intron54, Intron 55"|
|2|"Intron45, Intron46"|
|3|"Intron24, Intron30"|
|4|"Intron96, Intron199, Intron87, Intron88"|
|...|...|

There are ~600k total clusters, comprised of 2.1 million unique introns.
Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.

### Sidenotes

 - Users may search for introns that do not exist in any cluster. If this happens, return no results.
 - After the clusters are returned, a script combines the contents of each cluster to get just the unique values.

## Suggested Implementation

Goal is to enable `specific intron -> set of groups`.

Create a lookup table to store that information directly.

### Pseudocode

```
lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)

for each cluster in the database {
    get the cluster id number, ie the primary key for that cluster

    for each intron in the cluster {
        find that intron's corresponding list in the lookup_table
        add the cluster id to that list
    }
}
```

### Postgress implementation details

Create a new table, with its primary key as "intron name" and add a column named "groups".

Run the pseduocode above in your favorite programming language,
then for each entry in the hashmap `lookup_table` add a row to the postgress lookup table.
The data should have this format:

|intron_id|groups|
|-|-|
|"Intron1"|"5"|
|"Intron2"|"4,7,8"|
|"Intron3"|"1,3,52"|

It's important that the name of the intron is used as the "primary key";
this lets Postgress search by name efficiently.

To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.

### Estimated Performance

Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.

Search time for a single lookup: very fast best case, 2 seconds worst case.