Advertisement
Guest User

Untitled

a guest
Apr 23rd, 2019
90
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.10 KB | None | 0 0
  1. ## Problem Description
  2.  
  3. We have introns, and clusters of introns.
  4. The challenge is: given a single intron, return all clusters where that intron is a member.
  5.  
  6. Data is given in the following format:
  7.  
  8. |cluster_id|intron_list|
  9. |-|-|
  10. |1|"Intron30, Intron54, Intron 55"|
  11. |2|"Intron45, Intron46"|
  12. |3|"Intron24, Intron30"|
  13. |4|"Intron96, Intron199, Intron87, Intron88"|
  14. |...|...|
  15.  
  16. There are ~600k total clusters, comprised of 2.1 million unique introns.
  17. Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.
  18.  
  19. ### Sidenotes
  20.  
  21. - Users may search for introns that do not exist in any cluster. If this happens, return no results.
  22. - After the clusters are returned, a script combines the contents of each cluster to get just the unique values.
  23.  
  24. ## Suggested Implementation
  25.  
  26. Goal is to enable `specific intron -> set of groups`.
  27.  
  28. Create a lookup table to store that information directly.
  29.  
  30. ### Pseudocode
  31.  
  32. ```
  33. lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)
  34.  
  35. for each cluster in the database {
  36. get the cluster id number, ie the primary key for that cluster
  37.  
  38. for each intron in the cluster {
  39. find that intron's corresponding list in the lookup_table
  40. add the cluster id to that list
  41. }
  42. }
  43. ```
  44.  
  45. ### Postgress implementation details
  46.  
  47. Create a new table, with its primary key as "intron name" and add a column named "groups".
  48.  
  49. Run the pseduocode above in your favorite programming language,
  50. then for each entry in the hashmap `lookup_table` add a row to the postgress lookup table.
  51. The data should have this format:
  52.  
  53. |intron_id|groups|
  54. |-|-|
  55. |"Intron1"|"5"|
  56. |"Intron2"|"4,7,8"|
  57. |"Intron3"|"1,3,52"|
  58.  
  59. It's important that the name of the intron is used as the "primary key";
  60. this lets Postgress search by name efficiently.
  61.  
  62. To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.
  63.  
  64. ### Estimated Performance
  65.  
  66. Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.
  67.  
  68. Search time for a single lookup: very fast best case, 2 seconds worst case.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement