Advertisement
Guest User

Untitled

a guest
Jul 5th, 2015
191
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.15 KB | None | 0 0
  1. ##Alphabet α##
  2.  
  3. Lets consider the most general case of IUPAC encoding. Our alphabet α is:
  4.  
  5. *α = { T, K, H, Y, G, C, W, V, A, D, S, M, B, R, N }*
  6.  
  7. symbol|reverse complement|options|full name
  8. ------|------------------|-------|---------|
  9. **T**|A|T|Thymine
  10. **K**|M|G, T|Keto
  11. **H**|D|A, C, T|Not G
  12. **Y**|R|T, C|Pyrimidine
  13. **G**|C|G|Guanine
  14. **C**|G|C|Cytosine
  15. **W**|W|A, T|Weak bonds
  16. **V**|B|G, C, A|Not T
  17. **A**|T|A|Adenine
  18. **D**|H|G, A, T|Not C
  19. **S**|S|G, C|Strong bonds
  20. **M**|K|A, C|Amino
  21. **B**|V|G, T, C|Not A
  22. **R**|Y|G, A|Purine
  23. **N**|N|A, G, C, T|Any
  24.  
  25. ##Fragment ε##
  26. A fragment **ε** is defined as the triplet *ε = { n, o, l }* where:
  27. * **n** is the nibble number
  28. * **o** is the offset in nibble **n** to the start of the fragment
  29. * **l** is the length of the fragment in nucleotides
  30.  
  31. ##Barcode set β##
  32. Each **b** in the barcode set **β** is a pair *{ s, t }* where:
  33. * **s** is a word over the alphabet **α** of some length **l**
  34. * **t** is an ordered set of fragments who's total concatenated length is **l**
  35.  
  36. ## Read and Nibble ##
  37. Each read **r** in R is a set of nibbles. Each nibble is a nucleotide sequence with corresponding Phred quality scores.
  38.  
  39. ##Quality scores##
  40. Lets assume those are encoded in the *Illumina 1.8 Phred+33* so the value is encoded in ASCII.
  41. To get the *Phred* score we first get the ordinal of the character and than remove 33 from it.
  42.  
  43. *Phred* is *-10 * log base 10 of p*, where *p* is the probability of an error.
  44.  
  45. To get *p* we take *10 ^ -(Phred / 10)*.
  46.  
  47. for instance:
  48. ```
  49. Ordinal('+') = 43, Phred = 43 - 33 = 10, p = 10 ^ -1 = 0.1
  50. Ordinal('5') = 53, Phred = 53 - 33 = 20, p = 10 ^ -2 = 0.01
  51. Ordinal(';') = 53, Phred = 59 - 33 = 26, p = 10 ^ -2.6 = 0.00251188643151
  52. Ordinal('A') = 65, Phred = 65 - 33 = 32, p = 10 ^ -3.2 = 0.00063095734448
  53. ```
  54.  
  55. For each read **r** in **R** we can calculate the word for each barcode.
  56. So we get the vector **Br** which is a list of words over **α** of length **card(β)**,
  57. and the vector **Qr** which is a list of corresponding quality scores, or probabilities of error for each base in each element of the vector **Br**.
  58.  
  59. ##Score##
  60. For each read and each barcode we calculate the score of the barcode for that read.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement