Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ## Problem Description
- We have introns, and clusters of introns.
- The challenge is: given a single intron, return all clusters where that intron is a member.
- Data is given in the following format:
- |cluster_id|intron_list|
- |-|-|
- |1|"Intron30, Intron54, Intron 55"|
- |2|"Intron45, Intron46"|
- |3|"Intron24, Intron30"|
- |4|"Intron96, Intron199, Intron87, Intron88"|
- |...|...|
- There are ~600k total clusters, comprised of 2.1 million unique introns.
- Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.
- ### Sidenotes
- - Users may search for introns that do not exist in any cluster. If this happens, return no results.
- - After the clusters are returned, a script combines the contents of each cluster to get just the unique values.
- ## Suggested Implementation
- Goal is to enable `specific intron -> set of groups`.
- Create a lookup table to store that information directly.
- ### Pseudocode
- ```
- lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)
- for each cluster in the database {
- get the cluster id number, ie the primary key for that cluster
- for each intron in the cluster {
- find that intron's corresponding list in the lookup_table
- add the cluster id to that list
- }
- }
- ```
- ### Postgress implementation details
- Create a new table, with its primary key as "intron name" and add a column named "groups".
- Run the pseduocode above in your favorite programming language,
- then for each entry in the hashmap `lookup_table` add a row to the postgress lookup table.
- The data should have this format:
- |intron_id|groups|
- |-|-|
- |"Intron1"|"5"|
- |"Intron2"|"4,7,8"|
- |"Intron3"|"1,3,52"|
- It's important that the name of the intron is used as the "primary key";
- this lets Postgress search by name efficiently.
- To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.
- ### Estimated Performance
- Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.
- Search time for a single lookup: very fast best case, 2 seconds worst case.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement