Advertisement
Guest User

Untitled

a guest
Feb 22nd, 2019
95
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.84 KB | None | 0 0
  1. import bs4
  2. import os
  3. import re
  4. from num2words import num2words
  5. from tqdm import tqdm
  6.  
  7.  
  8. def read_file(filepath):
  9. content = None
  10. try:
  11. with open(filepath, 'r') as f:
  12. content = f.read()
  13. except IOError as e:
  14. print(e)
  15.  
  16. return content
  17.  
  18.  
  19. def append_file(file, data):
  20. try:
  21. with open(file, 'a') as f:
  22. f.write(data)
  23. except IOError as e:
  24. print(e)
  25.  
  26.  
  27. def process(file):
  28. filepath = os.path.join(input_dir, file)
  29. content = read_file(filepath)
  30. soup = bs4.BeautifulSoup(content, "lxml")
  31. text = soup.get_text()
  32. text = re.sub('[^A-Za-z0-9]+', ' ', text)
  33.  
  34. text = text.split()
  35. text = [num2words(token) if token.isdigit() else token.lower() for token in text]
  36.  
  37. text = " ".join(text)
  38.  
  39. append_file(output_file, text)
  40.  
  41.  
  42. input_dir = "paper"
  43. output_file = "output.txt"
  44. with open(output_file, 'w'): pass # empty the file
  45.  
  46. files = os.listdir(input_dir)
  47.  
  48. for file in tqdm(files):
  49. process(file)
  50.  
  51. print("ready in", output_file)
  52.  
  53.  
  54. """
  55. Answer for part 2:
  56.  
  57. Word error rate and character error rate are common metrics for evaluation.
  58. An established way is benchmarks like SwitchBoard.
  59. In domain specific case, you would need the both the voice and the correct transcript to comparing your transcription.
  60. Using ready sources with both speech and transcript is great, if there are any.
  61. If not you can try to first convert text to speech and use it for testing, but it will introduce bias to your system.
  62.  
  63. In the end, the real test is using it, because only this can show the real-world performance.
  64.  
  65. the frequency of words in user generated content can be used as an input for speech-to-text decisions.
  66. For example, a home assistant understand “holla, hella, holo” as “hello” because it’s more common.
  67. the system would not correctly less frequent words, a tf-idf scheme can correct this.
  68. """
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement