Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ##Alphabet α##
- Lets consider the most general case of IUPAC encoding. Our alphabet α is:
- *α = { T, K, H, Y, G, C, W, V, A, D, S, M, B, R, N }*
- symbol|reverse complement|options|full name
- ------|------------------|-------|---------|
- **T**|A|T|Thymine
- **K**|M|G, T|Keto
- **H**|D|A, C, T|Not G
- **Y**|R|T, C|Pyrimidine
- **G**|C|G|Guanine
- **C**|G|C|Cytosine
- **W**|W|A, T|Weak bonds
- **V**|B|G, C, A|Not T
- **A**|T|A|Adenine
- **D**|H|G, A, T|Not C
- **S**|S|G, C|Strong bonds
- **M**|K|A, C|Amino
- **B**|V|G, T, C|Not A
- **R**|Y|G, A|Purine
- **N**|N|A, G, C, T|Any
- ##Fragment ε##
- A fragment **ε** is defined as the triplet *ε = { n, o, l }* where:
- * **n** is the nibble number
- * **o** is the offset in nibble **n** to the start of the fragment
- * **l** is the length of the fragment in nucleotides
- ##Barcode set β##
- Each **b** in the barcode set **β** is a pair *{ s, t }* where:
- * **s** is a word over the alphabet **α** of some length **l**
- * **t** is an ordered set of fragments who's total concatenated length is **l**
- ## Read and Nibble ##
- Each read **r** in R is a set of nibbles. Each nibble is a nucleotide sequence with corresponding Phred quality scores.
- ##Quality scores##
- Lets assume those are encoded in the *Illumina 1.8 Phred+33* so the value is encoded in ASCII.
- To get the *Phred* score we first get the ordinal of the character and than remove 33 from it.
- *Phred* is *-10 * log base 10 of p*, where *p* is the probability of an error.
- To get *p* we take *10 ^ -(Phred / 10)*.
- for instance:
- ```
- Ordinal('+') = 43, Phred = 43 - 33 = 10, p = 10 ^ -1 = 0.1
- Ordinal('5') = 53, Phred = 53 - 33 = 20, p = 10 ^ -2 = 0.01
- Ordinal(';') = 53, Phred = 59 - 33 = 26, p = 10 ^ -2.6 = 0.00251188643151
- Ordinal('A') = 65, Phred = 65 - 33 = 32, p = 10 ^ -3.2 = 0.00063095734448
- ```
- For each read **r** in **R** we can calculate the word for each barcode.
- So we get the vector **Br** which is a list of words over **α** of length **card(β)**,
- and the vector **Qr** which is a list of corresponding quality scores, or probabilities of error for each base in each element of the vector **Br**.
- ##Score##
- For each read and each barcode we calculate the score of the barcode for that read.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement