bnfoinstructions

BNFO 201 – Project P2.
Due: Monday 5th December 2016 by 5pm EST.
FASTA file format is the most common format for examining and studying DNA sequences. The format typically has a sequenceid on a single line, identified by a “>” sign as the first character of the line, and a DNA sequence which may span multiple lines. /home/bnfo201/project_genome.fasta contains the assembly of a bacterial genome, which /home/bnfo201/project_annotations.gtf contains this genomes’ annotations. GTF file format is tab-delimited with the following columns:
1. seqname - name of the chromosome or scaffold or contig
2. source - name of the program that generated this feature, or the data source
3. feature - feature type name, e.g. Gene, Variation, Similarity
4. start - Start position of the feature, with sequence numbering starting at 1.
5. end - End position of the feature, with sequence numbering starting at 1.
6. score - A floating point value.
7. strand - defined as + (forward) or - (reverse).
8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is
the first base of a codon, '1' that the second base is the first base of a codon,
and so on..
9. attribute - A semicolon-separated list of tag-value pairs, providing additional
information about each feature.
For each protein coding annotation (feature=CDS), extract the transcript from the genome and save it to a file transcripts.fasta. Each transcript should have the sequence id same as the attribute “protein_id” in the GTF file. Transcripts need to be reverse complemented for “-“ strands. Next, translate each transcript into a protein according to /home/bnfo201/project_genetic_code.tsv and save into proteins.fasta. Corresponding transcripts and proteins should have the same sequence ids. Remember, all proteins should start with Met. Finally, calculate the GRAVY score of each protein in proteins.fasta and report it in tabular format in “gravy_score.tsv” with the columns being sequence_id, length, gravy_score.
GRAVY Score: Hydropathic index (the measure of hydrophilic or hydrophobic nature) of the amino acid is documented in the /home/bnfo201/hydropathic_index.tsv. Sum of the hydropathic indices of all the amino acids in the protein, averaged over the length of the protein is called the GRAVY score.

Deliverables:
1. A MS Word document describing
a. how you approached the problem,
b. which functions/objects/dictionaries/lists were created and why, etc.
c. explaining how to run the code (input arguments, etc.)
d. explaining how the code runs, i.e., the flow.
e. how did you test the code for accuracy/speed.
2. Python code. <eid>_project.py. Please give comments wherever necessary, even if you feel they are very “obvious”. Copy your scripts and output files to /home/bnfo201/<eid>/project
Is this project for you? Project assignments: