Advertisement
Guest User

Untitled

a guest
Aug 20th, 2019
85
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.37 KB | None | 0 0
  1. ## Overview
  2. This gist displays several strategies to parse and correct genbank files in the general format of a cookbook
  3.  
  4. ### Delete and append a line by matching a pattern PATTERN
  5.  
  6. `sed '/PATTERN/d' someFile.gb > outputFile.gb`
  7. `perl -pe '$_.= qq(TEXT N\n) if /PATTERN/' outputFile.gb > otherOutfile.gb`
  8.  
  9.  
  10. ### Modify a line 1
  11. *sed* can be used for this purpose, by matching PATTERN, and optionally using a variable $NM by using double-quotes
  12.  
  13. `NM="some value"`
  14. `sed "s/ACCESSION <unknown id>/ACCESSION $NM/g" genbankFile.gb > genbankFile.mod.gb`
  15.  
  16. ### Modify a line 2
  17. commandline *perl* can also be used
  18.  
  19. `perl -pe 's/^\s*ORGANISM \./ORGANISM Macaca mulatta/' genbankFile.gb > genbankFile.mod.gb`
  20.  
  21. ### Combine commands
  22.  
  23. for i in *.gb ; do
  24. NM=$(echo "$i" | rev | cut -d. -f2- | rev) ;`
  25. # replace <unknown id> with name string`
  26. sed "s/ACCESSION <unknown id>/ACCESSION $NM/g" ${i} > ${i}.tmp.1 ;
  27. # when converting from genbank to embl
  28. # Biopython looks at the VERSION line for the identifier when converting a gb file
  29. sed "s/VERSION <unknown id>/VERSION $NM/g" ${i}.tmp.1 > ${i}.tmp.2 ;
  30. # misc aesthetic cleanup
  31. perl -pe 's/^\s*ORGANISM \./ORGANISM Macaca mulatta/' ${i}.tmp.3 > ${i}.gb ;
  32. rm *.tmp.* ;
  33. done
  34.  
  35. ### Scripts
  36.  
  37. bash
  38.  
  39. #!/bin/bash
  40. for i in *.gb ; do
  41. NM=$(echo "$i" | rev | cut -d. -f2- | rev) ;
  42. # replace <unknown id> with name string
  43. sed "s/ACCESSION <unknown id>/ACCESSION $NM/g" ${i} > ${i}.tmp.1 ;
  44. sed "s/VERSION <unknown id>/VERSION $NM/g" ${i}.tmp.1 > ${i}.tmp.2 ;
  45. # when converting from genbank to embl
  46. # Biopython looks at the VERSION line for the identifier when converting a gb file
  47. perl -pe 's/^\s*ORGANISM \./ORGANISM Macaca mulatta/' ${i}.tmp.2 > ${i}.gb ;
  48. rm *.tmp.* ;
  49. done
  50. # then loop through the files, and use biopython to modify each to an embl file
  51.  
  52.  
  53. python
  54.  
  55. #!/Users/caskey/anaconda3/bin/python
  56. # requires bash env variable FILENAME to be set or script will fail
  57.  
  58. import sys
  59. import os
  60.  
  61. from Bio import SeqIO
  62.  
  63.  
  64. fname=os.environ['FILENAME']
  65. fname = fname.replace('/',':')
  66. presentDir=os.environ['PWD']
  67. outFileName = presentDir + '/' + fname + '.embl'
  68. fnameIn = presentDir + '/' + fname
  69. count = SeqIO.convert(fnameIn, "genbank", outFileName, "embl")
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement