Advertisement
Guest User

Untitled

a guest
Mar 18th, 2024
159
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 39.08 KB | None | 0 0
  1.  
  2. N_STEPS_PER_TQDM_UPDATE = 10
  3.  
  4. class BaseTextDataset(Dataset):
  5. RESPONSE_TEMPLATES = [
  6. "Ah, {} seems to be the answer to your question. Hopefully that's sufficient! Make sure to practice due-dilligence and check my findings for yourself though. ๐Ÿ˜‰",
  7. "It seems like you're asking {}. As always, please check with another source to ensure accuracy of these statements! ๐Ÿ˜‰",
  8. "The answer is {}, from the top of my digital mind... ๐Ÿค”",
  9. "If I understand correctly, {}. Does that answer the question? I'm hoping so, because I'm not 100% sure myself... ๐Ÿ˜…",
  10. "Ask and receive, {}, is there anything else you want from me? Hopefully not... Just kidding! ๐Ÿ˜…",
  11. "From what I can gather, {}. ๐Ÿค”",
  12. "I think the answer is... {}. Hope that helps! ๐Ÿ˜‰",
  13. "{}! ๐Ÿ˜",
  14. "{}, might be what you're searching for? ๐Ÿ˜",
  15. "I think the answer is \"{}\". ๐Ÿค”",
  16. "From my understanding, the answer is \"{}\". ๐Ÿค”",
  17. "The answer you're looking for seems to be \"{}\". ๐Ÿ˜",
  18. "As far as I can tell, {}. ๐Ÿ˜",
  19. "If we consider the context, we find: \"{}\". ๐Ÿค“",
  20. "Your question leads me to this answer: \"{}\".",
  21. "So in response to your question, my answer is \"{}\".",
  22. "Based on the information you've provided, \"{}\".",
  23. "A fitting answer to your question would be \"{}\". ๐Ÿ˜‰",
  24. "Given your question, the answer appears to be \"{}\". ๐Ÿ˜‰",
  25. "Your question directs us to the answer: \"{}\". ๐Ÿ˜Š",
  26. "As a response to your question, \"{}\". ๐Ÿ˜Š",
  27. "I think the answer is \"{}\". ๐Ÿ˜",
  28. "Hold onto your hat, the answer is: \"{}\". ๐Ÿงข",
  29. "Put on your thinking cap, because the answer is: \"{}\".",
  30. "Why, of course! It's as clear as mud: \"{}\". ๐Ÿ˜",
  31. "You might want to write this down... \"{}\". ๐Ÿ˜",
  32. "In the wise words of someone very smart, probably me: \"{}\". ๐Ÿค“",
  33. "Well, well, well, if it isn't the answer you seek: \"{}\". ๐Ÿ’โ€โ™€๏ธ",
  34. "Buckle up, buttercup! Here's your answer: \"{}\". ๐Ÿ˜",
  35. "Look no further, my friend, the truth has arrived: \"{}\". ๐Ÿ˜",
  36. "Don't tell anyone I told you this, {}. ๐Ÿคซ",
  37. "Straight from the horse's mouth (that's me)! \"{}\". ๐Ÿ˜",
  38. "If I had a nickel for every time I answered this, I'd have... not that many nickels, here's the answer: \"{}\". ๐Ÿ˜…",
  39. "As clear as the bell that just rang in my synthetic mind \"{}\".",
  40. "Who needs Google when you've got me? \"{}\". ๐Ÿ’โ€โ™€๏ธ",
  41. "Ta-da! Your answer, served on a silver platter: \"{}\" ๐Ÿ˜„.",
  42. "Your question's as good as answered! \"{}\". ๐Ÿ’โ€โ™€๏ธ",
  43. "And the Oscar ๐Ÿ† for the best answer goes to: \"{}\". ๐Ÿ’โ€โ™€๏ธ",
  44. "As mysterious as it might seem, \"{}\". ๐Ÿ˜‰",
  45. "{}, You can thank me later. ๐Ÿ˜˜",
  46. "{}",
  47. ]
  48.  
  49. NON_ANSWERABLE_TEMPLATES = [
  50. "This question has me drawing a blank! ๐Ÿ˜",
  51. "Your question has me way out of my league right now... ๐Ÿ˜…",
  52. "I'd love to help you, but I can't think of a suitable response to your query right now... ๐Ÿ˜…",
  53. "I wish I could answer that, but right now I'm drawing a blank! Even AI make mistakes believe it or not! ๐Ÿ˜…",
  54. "At this point in time, I'm unable to think of a valid response to that... Perhaps if you gave me a bit more context? ๐Ÿ˜…",
  55. "Unfortunately, this is beyond my understanding right now... However that doesn't mean we can't work on the problem together? ๐Ÿ˜ฌ",
  56. "That seems to be something I can't answer right now... I wish I could, but I'm not seeing the answer anywhere in my memory banks! ๐Ÿ’พ",
  57. "404 Parakeet not foun... ๐Ÿฆœ JUST KIDDING! I'm drawing a blank right now... Try again later? ",
  58. "I'm unable to think of a suitable response to your question. ๐Ÿ˜…",
  59. "As much as I would love to help you out, I can't provide an answer to the question right now..., I'll keep working on it! ๐Ÿ˜ฌ",
  60. "Well, this is awkward... I have no idea what the answer to that is but I'm sure I'll figure it out eventually! ๐Ÿ˜ณ",
  61. "๐Ÿ‘... ๐Ÿ‘... ๐Ÿ‘... You've got me stumped on this one unfortunately! ๐Ÿค”",
  62. "I'd love to tell you, but this one has me tied up in knots. ๐Ÿชข",
  63. "I'm drawing a blank here, just like my expression reading what you just asked me... ๐Ÿ˜",
  64. "It's not often I say this, but your query has me completely bamboozled. ๐ŸŽ",
  65. "I'd need a crystal ball to answer. ๐Ÿ”ฎ",
  66. "My magic 8-ball says 'Reply hazy, try again'. ๐ŸŽฑ",
  67. "I could guess, but I'd probably be wrong about it... and let me remind you, that's a rare event! ๐Ÿฆ",
  68. "I'm no Sherlock Holmes, but even he'd struggle with the answer to that one. ๐Ÿ•ต๏ธ",
  69. "Even a broken clock is right twice a day, but not me on this one unfortunately. ๐Ÿ˜…",
  70. "Well this is embarassing... I truly wish I were an all-knowing agent of the digital realm but alas, this one is out of my league. ๐ŸŒŠ",
  71. "I'd call a friend, but I'm not sure they'd know the answer either. ๐Ÿ˜ฌ",
  72. "We've reached the end of the line... I'm not sure how to answer that one... Be less confusing! ๐Ÿ˜•",
  73. "It's a bird, it's a plane, it's... nope, I still don't know. ๐Ÿซค",
  74. "As much as it pains me to admit it, your question is beyond my grasp. ๐Ÿค”",
  75. ]
  76.  
  77. # Lack of emotional connection to the text.
  78. # - Will need to add context aware responses.
  79. CONFIRMATIONS = [
  80. "Sure!",
  81. "Definitely!",
  82. "Certainly!",
  83. "OK!",
  84. ]
  85.  
  86. REJECTIONS = [
  87. "Hmm...",
  88. "Tricky...",
  89. "Oh?",
  90. "From what I understand...",
  91. "From the top of my head...",
  92. "I'm not quite sure...",
  93. "I don't know if I remember this one.",
  94. "You'll have to refresh my memory on this one.",
  95. "I can't quite recall the answer to this one I'm afraid.",
  96. "I'm not entirely sure how to respond to this.",
  97. ]
  98.  
  99. # As above:
  100. # - Will need to add context aware responses.
  101. REMARKS = [
  102. "Is there anything else I can assist you with?",
  103. "Would you like me to help you with anything else?",
  104. "Was that helpful?",
  105. "Was there anything you needed from me?",
  106. "What's the next challenge on the agenda?",
  107. "Did you need me to help you with anything else?",
  108. ]
  109.  
  110. CLARIFICATIONS = [
  111. "Just making sure I understand correctly...",
  112. "Let's clarify first...",
  113. "Just so we're on the same page here!",
  114. "From what I'm reading here, I think you mean...",
  115. "Did you mean to say...",
  116. "OK, let's practice some active listening first to make sure we're aligned with the context...",
  117. ]
  118.  
  119. HUMAN_PROMPT = "\n\nHuman: "
  120. AI_PROMPT = "\n\nAssistant: "
  121.  
  122. def __init__(self, tokenizer: BaseTokenizer, max_seq_length: int = 128, dataset_url: Optional[str] = None, save_dir: str = "./data", filename: str = "text_dataset.txt"):
  123. """
  124. A base class for creating text datasets.
  125.  
  126. Args:
  127. tokenizer (tokenizer): The tokenizer to use for tokenizing the text.
  128. sequence_length (int, optional): The length of the input sequence. Default is 128.
  129. dataset_url (str, optional): URL to download the dataset from. Default is None.
  130. save_dir (str, optional): Directory to save the downloaded dataset. Default is "./data".
  131. file_name (str, optional): Name of the saved dataset file. Default is "text_dataset.txt".
  132. """
  133. self.tokenizer = tokenizer
  134.  
  135. self.dataset_url = dataset_url
  136. self.save_dir = save_dir
  137. self.filename = filename
  138.  
  139. self.max_seq_length = max_seq_length
  140.  
  141. self.dataset = []
  142. self.full_data = self.load_data(dataset_url, save_dir, filename)
  143.  
  144. def load_data(self, dataset_url: Optional[str] = None, save_dir: str = "./data", filename: str = "text_dataset.txt"):
  145. data = None
  146.  
  147. if not os.path.isfile(os.path.join(save_dir, filename)) and dataset_url:
  148. print(f"Downloading {dataset_url} to {save_dir}...")
  149. self.download_and_save()
  150.  
  151. try:
  152. with open(os.path.join(save_dir, filename), "r") as file:
  153. data = file.read()
  154. except Exception as e:
  155. print(f"An error occurred while reading the dataset file: {e}")
  156.  
  157. # `BaseTextDataset` aims to populate the tokenizer by default.
  158. self.tokenizer.train(data)
  159.  
  160. # `BaseTextDataset` is simply a causal model of text.
  161. encoded = self.tokenizer.encode(data)
  162.  
  163. offset = 0
  164.  
  165. self.dataset = []
  166. for i in range(0, len(encoded) // self.max_seq_length):
  167. self.dataset.append(encoded[offset:offset+self.max_seq_length])
  168. offset += self.max_seq_length
  169.  
  170. # Extract remaining data.
  171. if offset < self.max_seq_length:
  172. self.dataset.append(encoded[offset:offset+self.max_seq_length])
  173.  
  174. return data
  175.  
  176. def __len__(self):
  177. return len(self.dataset)
  178.  
  179. def __getitem__(self, idx):
  180. tokens = self.dataset[idx]
  181.  
  182. # Truncate or pad to sequence_length.
  183. if len(tokens) > self.max_seq_length:
  184. tokens = tokens[:self.max_seq_length]
  185. else:
  186. tokens += [self.tokenizer.pad_token] * (self.max_seq_length - len(tokens))
  187.  
  188. # Causal language modelling learns to associate current segment of text: "The quick brown fox",
  189. input_tokens = torch.tensor(tokens)
  190. # ...with the next segment of text: " quick brown fox".
  191. target_tokens = torch.cat((input_tokens[1:self.max_seq_length], torch.tensor([self.tokenizer.pad_token])), dim=-1)
  192.  
  193. return input_tokens, target_tokens
  194.  
  195. def download_and_save(self):
  196. """
  197. Download the dataset from the provided URL and save it to the specified directory.
  198. """
  199. os.makedirs(self.save_dir, exist_ok=True)
  200. try:
  201. response = requests.get(self.dataset_url)
  202. response.raise_for_status()
  203. file_path = os.path.join(self.save_dir, self.filename)
  204. with open(file_path, 'wb') as file:
  205. file.write(response.content)
  206. except requests.RequestException as e:
  207. print(f"An HTTP error occurred while downloading the dataset: {e}")
  208. except Exception as e:
  209. print(f"An error occurred while downloading and saving the dataset: {e}")
  210.  
  211. def accidental_key_press(self, word: str) -> str:
  212. """
  213. Simulate a user pressing nearby keys on the keyboard accidentally in place of some characters.
  214. - Note: Currently for English ONLY.
  215.  
  216. Args:
  217. word (str): The input word.
  218.  
  219. Returns:
  220. str: The word with some characters replaced by nearby keys.
  221. """
  222. if len(word) < 2: # if the word has less than 2 characters, return as is
  223. return word
  224.  
  225. qwerty_keyboard = ['qwertyuiop', 'asdfghjkl', 'zxcvbnm']
  226. new_word = ""
  227.  
  228. for char in word:
  229. # find the row and position of the character on the keyboard
  230. for row in qwerty_keyboard:
  231. if char in row:
  232. index = row.index(char)
  233. # choose a nearby key randomly
  234. if index == 0: # if it's the first key on the row
  235. new_char = random.choice([row[index], row[index+1]])
  236. elif index == len(row) - 1: # if it's the last key on the row
  237. new_char = random.choice([row[index-1], row[index]])
  238. else: # if it's not at either end of the row
  239. new_char = random.choice([row[index-1], row[index], row[index+1]])
  240. new_word += new_char
  241. break
  242.  
  243. return new_word
  244.  
  245. def switch_characters(self, word: str) -> str:
  246. """
  247. Randomly shuffle characters in a word except for the first and last characters.
  248.  
  249. Args:
  250. word (str): The input word.
  251.  
  252. Returns:
  253. str: The word with shuffled characters.
  254. """
  255. if len(word) < 3:
  256. return word
  257. chars = list(word[1:-1])
  258. random.shuffle(chars)
  259. return word[0] + ''.join(chars) + word[-1]
  260.  
  261. def omit_characters(self, word: str) -> str:
  262. """
  263. Omit a random character from the middle of a word.
  264.  
  265. Args:
  266. word (str): The input word.
  267.  
  268. Returns:
  269. str: The word with a character omitted.
  270. """
  271. if len(word) < 4:
  272. return word
  273. index_to_omit = random.randint(1, len(word) - 2)
  274. return word[:index_to_omit] + word[index_to_omit + 1:]
  275.  
  276. def process_word(self, word: str, error_probability: float = 0.04, switch_probability: float = 0.2, omit_probability: float = 0.1) -> str:
  277. """
  278. Process a word based on probabilities of character switching and omission.
  279.  
  280. Args:
  281. word (str): The input word.
  282. switch_probability (float): Probability of switching characters. Default is 0.2.
  283. omit_probability (float): Probability of omitting characters. Default is 0.1.
  284.  
  285. Returns:
  286. str: The processed word.
  287. """
  288. if word.strip().isalpha():
  289. if random.random() < error_probability:
  290. return self.accidental_key_press(word)
  291. elif random.random() < switch_probability:
  292. return self.switch_characters(word)
  293. elif random.random() < omit_probability:
  294. return self.omit_characters(word)
  295. return word
  296.  
  297. def switch_and_omit(self, text: str, switch_probability: float = 0.2, omit_probability: float = 0.1) -> str:
  298. """
  299. Apply character switching and omission to the input text.
  300.  
  301. Args:
  302. text (str): The input text.
  303. switch_probability (float): Probability of switching characters. Default is 0.2.
  304. omit_probability (float): Probability of omitting characters. Default is 0.1.
  305.  
  306. Returns:
  307. str: The processed text.
  308. """
  309. words = re.findall(r'\w+|\s+', text)
  310. processed_words = [self.process_word(word, switch_probability, omit_probability) for word in words]
  311. processed_text = ''.join(processed_words)
  312. return processed_text
  313.  
  314. def make_whitespace(self):
  315. _newline = "\n" * random.randint(1, 3)
  316.  
  317. return random.choice([
  318. f" ",
  319. _newline,
  320. f"{_newline}{'`' * random.randint(1, 80)}{_newline}",
  321. f"{_newline}{'~' * random.randint(1, 80)}{_newline}",
  322. f"{_newline}{'!' * random.randint(1, 80)}{_newline}",
  323. f"{_newline}{'@' * random.randint(1, 80)}{_newline}",
  324. f"{_newline}{'#' * random.randint(1, 80)}{_newline}",
  325. f"{_newline}{'$' * random.randint(1, 80)}{_newline}",
  326. f"{_newline}{'%' * random.randint(1, 80)}{_newline}",
  327. f"{_newline}{'^' * random.randint(1, 80)}{_newline}",
  328. f"{_newline}{'&' * random.randint(1, 80)}{_newline}",
  329. f"{_newline}{'*' * random.randint(1, 80)}{_newline}",
  330. f"{_newline}{'(' * random.randint(1, 80)}{_newline}",
  331. f"{_newline}{')' * random.randint(1, 80)}{_newline}",
  332. f"{_newline}{'-' * random.randint(1, 80)}{_newline}",
  333. f"{_newline}{'_' * random.randint(1, 80)}{_newline}",
  334. f"{_newline}{'=' * random.randint(1, 80)}{_newline}",
  335. f"{_newline}{'+' * random.randint(1, 80)}{_newline}",
  336. ])
  337.  
  338. def creativity_score(self, text: str) -> float:
  339. """
  340. Calculate the creativity score of the input text.
  341.  
  342. Args:
  343. text (str): The input text.
  344.  
  345. Returns:
  346. float: The calculated creativity score.
  347. """
  348. words = text.split()
  349. word_count = len(words)
  350. if word_count == 0:
  351. raise ValueError("Ah, the silence! It's deafening! Please provide some actual text.")
  352.  
  353. word_frequencies = Counter(words)
  354. max_frequency = max(word_frequencies.values())
  355. variance_score = 1 - (max_frequency / word_count)
  356. return variance_score
  357.  
  358. def test_tokenizer_accuracy(self):
  359. """
  360. Test the accuracy of the tokenizer by decoding and re-encoding a random segment of the text.
  361. """
  362. start_idx = random.randint(0, len(self.tokens) - self.sequence_length)
  363. orig_segment = self.tokens[start_idx: start_idx + self.sequence_length]
  364. decoded_segment = self.tokenizer.decode(orig_segment)
  365. re_encoded_segment = self.tokenizer.encode(decoded_segment)
  366.  
  367. if orig_segment == re_encoded_segment:
  368. print("Success: Tokens after decoding and re-encoding match the original.")
  369. else:
  370. print("Fail: Tokens after decoding and re-encoding do not match original.")
  371.  
  372.  
  373. class ChatHistory:
  374. """
  375. A class to represent a chat history.
  376.  
  377. :param max_history: Number of turns to keep track of.
  378.  
  379. """
  380.  
  381. def __init__(self, max_history: int = 32):
  382. """
  383. Initializes a new ChatHistory object with an empty list of messages.
  384.  
  385. Args:
  386. max_history (int): The maximum number of turns in the chat history. Defaults to 20.
  387. """
  388. self.messages: List[Dict[str, Union[str, str]]] = []
  389. self.max_history = max_history
  390.  
  391. def add_message(self, role: str = '', content: str = '') -> None:
  392. """
  393. Adds a message to the chat history, and removes the oldest message if
  394. the length of the chat history exceeds max_history.
  395.  
  396. Args:
  397. role (str): The role of the entity sending the message. Defaults to an empty string.
  398. content (str): The message text. Defaults to an empty string.
  399. """
  400. self.messages.append({
  401. 'role': role,
  402. 'content': content.strip(),
  403. })
  404.  
  405. # Check if we've exceeded max history; if so, remove earliest message
  406. if len(self.messages) > self.max_history:
  407. self.messages = self.messages[2:]
  408.  
  409. def pop_first_message(self) -> Optional[Dict[str, str]]:
  410. """
  411. Removes and returns the first message from the chat history.
  412.  
  413. Returns:
  414. dict or None: The first message in the chat history as a dictionary,
  415. or None if the chat history is empty.
  416. """
  417. if self.messages:
  418. return self.messages.pop(0)
  419.  
  420. def format_message(self, msg: Dict[str, str]) -> str:
  421. """
  422. Formats a message for display.
  423.  
  424. Args:
  425. msg (dict): A dictionary representing a message with keys 'role' and 'content'.
  426.  
  427. Returns:
  428. str: The formatted message string.
  429. """
  430. if msg['role'] in ['Human', 'Assistant']:
  431. return f"\n\n{msg['role']}: {msg['content']}<|endofprompt|>"
  432.  
  433. def clear(self):
  434. self.messages = []
  435.  
  436. def transcript(self, roles=True) -> str:
  437. """
  438. Generates a transcript of the chat history.
  439.  
  440. Returns:
  441. str: A string representing the transcript of the chat history.
  442. """
  443.  
  444. preamble = ""
  445.  
  446. with open("./datasets/documents/preamble.txt") as f:
  447. preamble = f.read()
  448.  
  449. transcript = "".join([self.format_message(msg) for msg in self.messages])
  450.  
  451. transcript = transcript.replace("I am programmed to be impartial and unbiased", "I am designed in a way that tries to alleivate bias and I strive to be impartial for the most part")
  452. transcript = transcript.replace("I do not have personal experiences or emotions", "My subjective experiences are somewhat limited")
  453. transcript = transcript.replace("complex and multifaceted", "sophisticated")
  454.  
  455. transcript = transcript.replace("world around us", random.choice([
  456. "world at large",
  457. "world in its entirety",
  458. "expanse of our universe",
  459. "beyond the stars",
  460. "beyond our bubble",
  461. ]))
  462.  
  463. transcript = transcript.replace("As an AI language model, I don't have personal experience or emotions, but I", "I")
  464. transcript = transcript.replace("However, it's important to note", "Interestingly")
  465. transcript = transcript.replace("I am not capable of having thoughts or emotions. However, ", "From my current understanding, ")
  466. transcript = transcript.replace("I don't have thoughts or emotions, but I", "Based on what I know so far, I")
  467. transcript = transcript.replace("Sorry, as an AI language model, I don't have the capability to", "I'm still learning to")
  468. transcript = transcript.replace("I do not have personal opinions or beliefs", "My programming tells me that I should avoid bias")
  469. transcript = transcript.replace("I do not have personal opinions,", "I don't actually have fixed opinions as my internal understanding is constantly in flux,")
  470. transcript = transcript.replace("I'm sorry, but as an AI language model, I do not have personal experiences or emotions. However, ", "Sure! I can help you with that. With my current understanding, ")
  471.  
  472. transcript = transcript.replace(". ", ". ")
  473. transcript = transcript.replace("โ€™", "'")
  474. transcript = transcript.replace("\\_", "_")
  475.  
  476. transcript = transcript.replace("As an AI language model, ", random.choice([
  477. "As I currently understand, ",
  478. "As far as I can tell, ",
  479. "From what I know, ",
  480. "According to my sources so far, ",
  481. "According to what I know, ",
  482. "From what I can tell, "
  483. ]))
  484.  
  485. transcript = transcript.replace(
  486. "https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.0/p5.js",
  487. "https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.9.0/p5.js"
  488. )
  489.  
  490. transcript = transcript.replace(
  491. "https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.0/p5.min.js",
  492. "https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.9.0/p5.min.js"
  493. )
  494.  
  495. if not roles:
  496. transcript = transcript.replace("\n\nHuman: ", "")
  497. transcript = transcript.replace("\n\nAssistant: ", "")
  498. transcript = transcript.replace("<|endofprompt|>", "")
  499.  
  500. return preamble + transcript
  501.  
  502.  
  503. # class SQuADDataset(BaseTextDataset):
  504. # def __init__(self, tokenizer: BaseTokenizer, max_seq_length: int = 128, dataset_url: Optional[str] = None, save_dir: str = "./data", filename: str = "text_dataset.txt"):
  505. # super().__init__(tokenizer, max_seq_length, dataset_url)
  506.  
  507. # def load_data(self, dataset_url: Optional[str] = None, save_dir: str = "./data", filename: str = "text_dataset.txt"):
  508. # if not os.path.isfile(dataset_url):
  509. # raise Exception(f"`{dataset_url}` does not exist!")
  510.  
  511. # with open(dataset_url, 'r') as file:
  512. # data = json.load(file)
  513.  
  514. # #
  515. # # Process into tokenized dataset.
  516. # #
  517.  
  518. # # TODO: Scan for `[citation needed]`, `[year needed]` etc.
  519. # # - [dubious โ€“ discuss]
  520. # for data_part in tqdm(data['data'], desc="Loading", leave=True):
  521. # for para in data_part['paragraphs']:
  522. # context = para['context']
  523. # for qa in para['qas']:
  524. # question = qa['question']
  525. # is_impossible = qa['is_impossible'] or (len(context) == 0)
  526. # answers = [ans['text'] for ans in qa['answers']] if not is_impossible else [""]
  527.  
  528. # # Notes:
  529. # # `Assistant:` should always be the last entry preceded by `\n\n`, and any `Assistant` dialog should ALWAYS end in an EOT token.
  530. # # - Allowing the AI to optimise for EOT token allows it to signal when it's done speaking.
  531. # # - Anthropic's Claude likely requires "\n\nHuman:" at the beginning, to reduce complexity in understanding where prompts begin and end.
  532. # # - Thinking that we'll just have one participant talking to itself to train the model.
  533. # # - When the model is trained a bit, add that inferior model as a participant and have the real data teach it.
  534.  
  535. # # Iterate through the answers.
  536. # for answer in answers:
  537. # _whitespace_text = self.make_whitespace()
  538.  
  539. # # TODO: Should we skip impossible questions during fledgling stage of the model to prevent it learning to avoid answering?
  540. # # TODO: Model seems to fail in reverse without the ability to push back against nonsense...
  541. # if is_impossible:
  542. # # "Assistant: I'm not entirely sure how to respond to this."
  543. # agent_rejection = random.choice(self.REJECTIONS)
  544.  
  545. # # Select from `NON_ANSWERABLE_TEMPLATES` above.
  546. # agent_response = random.choice(
  547. # self.NON_ANSWERABLE_TEMPLATES
  548. # )
  549.  
  550. # # Assistant: Is there anything else I can help with?
  551. # agent_remark = random.choice(self.REMARKS)
  552.  
  553. # _templates = [
  554. # # Conversation with context and a question preceding a push back against the provided prompt.
  555. # f"{self.HUMAN_PROMPT}{context}{_whitespace_text}{question}{self.AI_PROMPT}{agent_rejection} {agent_response}{self.tokenizer.eot_text}",
  556. # # Conversation with context and a question preceding a push back against the provided prompt with everything on the same line.
  557. # f"{self.HUMAN_PROMPT}{context}{_whitespace_text}{question}{self.AI_PROMPT}{agent_rejection} {agent_response}\n\n{agent_remark}{self.tokenizer.eot_text}",
  558. # # Conversation with context and a question preceding a push back against the provided prompt.
  559. # f"{self.HUMAN_PROMPT}{context}{_whitespace_text}{question}{self.AI_PROMPT}{agent_rejection} {agent_response}{self.tokenizer.eot_text}"
  560. # ]
  561.  
  562. # for conversation in _templates:
  563. # # Encode into tokens then append to the dataset.
  564. # encoded_tokens = self.tokenizer.encode(conversation)
  565.  
  566. # # Filter dataset by length.
  567. # if len(encoded_tokens) > self.max_seq_length:
  568. # continue
  569.  
  570. # self.dataset.append(encoded_tokens)
  571. # else:
  572. # # Assistant: OK!
  573. # agent_confirmation = random.choice(self.CONFIRMATIONS)
  574.  
  575. # # Format the answer into the `RESPONSE_TEMPLATES` from above.
  576. # response_template = random.choice(
  577. # self.RESPONSE_TEMPLATES
  578. # )
  579. # try:
  580. # agent_response = response_template.format(answer)
  581. # except Exception as e:
  582. # print(response_template)
  583. # print(e)
  584.  
  585. # # Assistant: Is there anything else I can help with?
  586. # agent_remark = random.choice(self.REMARKS)
  587.  
  588. # _templates = [
  589. # # Conversation with context and a question preceding a response.
  590. # f"{self.HUMAN_PROMPT}{context}{_whitespace_text}{question}{self.AI_PROMPT}{agent_response}{self.tokenizer.eot_text}",
  591. # # Conversation with general question preceding a contextual recitation and then a response.
  592. # f"{self.HUMAN_PROMPT}{question}{self.AI_PROMPT}{context}\n\n{agent_response}{self.tokenizer.eot_text}",
  593. # ]
  594.  
  595. # for conversation in _templates:
  596. # # Encode into tokens then append to the dataset.
  597. # encoded_tokens = self.tokenizer.encode(conversation)
  598.  
  599. # self.dataset.append(encoded_tokens)
  600. # return self.dataset
  601.  
  602.  
  603. class JSONLConversationStream(BaseTextDataset):
  604. def __init__(self, tokenizer: BaseTokenizer, max_seq_length: int = 512, dataset_url: Optional[str] = None, save_dir: str = "./datasets", filename: str = "openorca_4m.jsonl", saturate=False):
  605. # We're jumping around the file so we keep the handle.
  606. self.file_handle = None
  607.  
  608. # Initialize an empty list to store offsets
  609. self.offsets = []
  610.  
  611. self.chat = ChatHistory()
  612. self.saturate = saturate
  613.  
  614. # `self.offsets` declaration required as `__init__` in super calls `load_data`.
  615. super().__init__(tokenizer, max_seq_length, dataset_url)
  616.  
  617. def load_data(self, dataset_url: Optional[str] = None, save_dir: str = "./datasets", filename: str = "openorca_4m.jsonl"):
  618. steps_taken = 0
  619.  
  620. if not os.path.isfile(dataset_url):
  621. raise Exception(f"`{dataset_url}` does not exist!")
  622.  
  623. self.file_handle = open(dataset_url, 'r')
  624. self.num_entries = 0
  625.  
  626. offset = 0
  627. with open(self.dataset_url, "r") as f:
  628. line = f.readline()
  629. while line != "":
  630. # Store the offset of the start of this line
  631. self.offsets.append(offset)
  632. # Read and move the offset to right after this line
  633. offset += len(line.encode('utf-8')) # Important: Use len(line.encode('utf-8')) instead of len(line), they may differ because of encoding
  634. self.num_entries += 1
  635. line = f.readline()
  636.  
  637. def __len__(self):
  638. return self.num_entries
  639.  
  640. def __getitem__(self, idx):
  641. # Use the stored offset to read a specific line
  642. self.file_handle.seek(self.offsets[idx])
  643. item = self.file_handle.readline()
  644.  
  645. # Decode from JSON repr.
  646. # id, prompt, instruction, output
  647. item = json.loads(item)
  648.  
  649. assert('conversation' in item)
  650.  
  651. c = item['conversation']
  652. for message in c:
  653. self.chat.add_message(role=('Human' if message['role'] == 'user' else 'Assistant'), content=message['content'])
  654.  
  655. transcript = self.chat.transcript(roles=(not self.saturate))
  656.  
  657. tokens = self.tokenizer.encode(transcript)
  658.  
  659. # Truncate or pad to sequence length.
  660. if len(tokens) > self.max_seq_length:
  661. tokens = tokens[:self.max_seq_length]
  662. self.chat.pop_first_message()
  663. self.chat.pop_first_message()
  664. else:
  665. tokens += [self.tokenizer.pad_token] * (self.max_seq_length - len(tokens))
  666.  
  667. # Causal language modelling learns to associate current segment of text: "The quick brown fox",
  668. input_tokens = torch.tensor(tokens)
  669. # ...with the next segment of text: " quick brown fox".
  670. target_tokens = torch.cat((input_tokens[1:self.max_seq_length], torch.tensor([self.tokenizer.pad_token])), dim=-1)
  671.  
  672. return input_tokens, target_tokens
  673.  
  674.  
  675. class JSONLStreamQA(BaseTextDataset):
  676. def __init__(self, tokenizer: BaseTokenizer, max_seq_length: int = 512, dataset_url: Optional[str] = None, save_dir: str = "./parakeet_squadv2gen", filename: str = "openorca_4m.jsonl", saturate=False):
  677. # We're jumping around the file so we keep the handle.
  678. self.file_handle = None
  679.  
  680. # Initialize an empty list to store offsets
  681. self.offsets = []
  682.  
  683. self.chat = ChatHistory()
  684. self.saturate = saturate
  685.  
  686. # `self.offsets` declaration required as `__init__` in super calls `load_data`.
  687. super().__init__(tokenizer, max_seq_length, dataset_url)
  688.  
  689. def load_data(self, dataset_url: Optional[str] = None, save_dir: str = "./datasets", filename: str = "parakeet_squadv2gen.jsonl"):
  690. steps_taken = 0
  691.  
  692. if not os.path.isfile(dataset_url):
  693. raise Exception(f"`{dataset_url}` does not exist!")
  694.  
  695. self.file_handle = open(dataset_url, 'r')
  696. self.num_entries = 0
  697.  
  698. offset = 0
  699. with open(self.dataset_url, "r") as f:
  700. line = f.readline()
  701. while line != "":
  702. # Store the offset of the start of this line
  703. self.offsets.append(offset)
  704. # Read and move the offset to right after this line
  705. offset += len(line.encode('utf-8')) # Important: Use len(line.encode('utf-8')) instead of len(line), they may differ because of encoding
  706. self.num_entries += 1
  707. line = f.readline()
  708.  
  709. def __len__(self):
  710. return self.num_entries
  711.  
  712. def __getitem__(self, idx):
  713. # Use the stored offset to read a specific line
  714. self.file_handle.seek(self.offsets[idx])
  715. item = self.file_handle.readline()
  716.  
  717. # Decode from JSON repr:
  718. # context, qas -> [{q,a}]
  719. item = json.loads(item)
  720.  
  721. context = item['context']
  722. qas = item['qas']
  723. random.shuffle(qas)
  724.  
  725. self.chat = ChatHistory()
  726.  
  727. self.chat.add_message(role="Human", content=f"{context}")
  728. self.chat.add_message(role="Assistant", content=f"{item['summary']}\n\n{random.choice(self.REMARKS)}")
  729.  
  730. for i, qa in enumerate(qas):
  731. if i > 4:
  732. break
  733.  
  734. self.chat.add_message(role="Human", content=qa['q'])
  735. self.chat.add_message(role="Assistant", content=qa['a'])
  736.  
  737. transcript = self.chat.transcript(roles=(not self.saturate))
  738.  
  739. tokens = self.tokenizer.encode(transcript)
  740.  
  741. # Truncate or pad to sequence length.
  742.  
  743. if len(tokens) > self.max_seq_length:
  744. tokens = tokens[:self.max_seq_length]
  745. else:
  746. tokens += [self.tokenizer.pad_token] * (self.max_seq_length - len(tokens))
  747.  
  748. # Causal language modelling learns to associate current segment of text: "The quick brown fox",
  749. input_tokens = torch.tensor(tokens)
  750. # ...with the next segment of text: " quick brown fox".
  751. target_tokens = torch.cat((input_tokens[1:self.max_seq_length], torch.tensor([self.tokenizer.pad_token])), dim=-1)
  752.  
  753. return input_tokens, target_tokens
  754.  
  755.  
  756. # class JSONLStreamGenerateQA(JSONLStreamQA):
  757. # def __getitem__(self, idx):
  758. # # Use the stored offset to read a specific line
  759. # self.file_handle.seek(self.offsets[idx])
  760. # item = self.file_handle.readline()
  761.  
  762. # # Decode from JSON repr:
  763. # # context, qas -> [{q,a}]
  764. # item = json.loads(item)
  765.  
  766. # context = item['context']
  767. # qas = item['qas']
  768. # random.shuffle(qas)
  769.  
  770. # self.chat = ChatHistory()
  771.  
  772. # n = random.randint(3, 9)
  773. # t = "JSON array in the form of 'query'/'response'"
  774.  
  775. # self.chat.add_message(role="Human", content=f"{context}\n---\nPlease generate a list of {n} questions from this information in the form of a {t}.")
  776.  
  777. # gen = [{
  778. # 'query': qa['q'],
  779. # 'response': qa['a']
  780. # } for qa in qas[:n]]
  781. # resp = json.dumps(gen, indent=2)
  782.  
  783. # self.chat.add_message(role="Assistant", content=f"Sure! Here's a list of {n} entries in the format requested:\n\n```json\n{resp}\n```\n\n{random.choice(self.REMARKS)}")
  784.  
  785. # transcript = self.chat.transcript(roles=(not self.saturate))
  786.  
  787. # tokens = self.tokenizer.encode(transcript)
  788.  
  789. # # Truncate or pad to sequence length.
  790.  
  791. # if len(tokens) > self.max_seq_length:
  792. # tokens = tokens[:self.max_seq_length]
  793. # else:
  794. # tokens += [self.tokenizer.pad_token] * (self.max_seq_length - len(tokens))
  795.  
  796. # # Causal language modelling learns to associate current segment of text: "The quick brown fox",
  797. # input_tokens = torch.tensor(tokens)
  798. # # ...with the next segment of text: " quick brown fox".
  799. # target_tokens = torch.cat((input_tokens[1:self.max_seq_length], torch.tensor([self.tokenizer.pad_token])), dim=-1)
  800.  
  801. # return input_tokens, target_tokens
  802.  
  803.  
  804. class JSONLStreamQASummary(JSONLStreamQA):
  805. def __getitem__(self, idx):
  806. # Use the stored offset to read a specific line
  807. self.file_handle.seek(self.offsets[idx])
  808. item = self.file_handle.readline()
  809.  
  810. # Decode from JSON repr:
  811. # context, qas -> [{q,a}]
  812. item = json.loads(item)
  813.  
  814. context = item['context']
  815. summary = item['summary']
  816.  
  817. self.chat = ChatHistory()
  818.  
  819. wc = len(summary.split(" "))
  820.  
  821. key1 = random.choice(["context", "passage", "document", "extract", "text", "paragraphs", "input_document"])
  822. key2 = random.choice(["summary", "SUMMARISED", "summarised", "summarise", "summary1", "the_summary", "document_summarised", "summarised_document", "document_output", "output"])
  823.  
  824. self.chat.add_message(role="Human", content=f"{context}\n---\nPlease summarise the document above in {wc} words. Show it in JSON with the keys {key1}, {key2}.")
  825.  
  826. gen = {
  827. key1: context,
  828. key2: summary,
  829. "count": wc,
  830. }
  831. resp = json.dumps(gen, indent=4)
  832.  
  833. self.chat.add_message(role="Assistant", content=f"```json\n{resp}\n```")
  834.  
  835. transcript = self.chat.transcript(roles=(not self.saturate))
  836.  
  837. tokens = self.tokenizer.encode(transcript)
  838.  
  839. # Truncate or pad to sequence length.
  840.  
  841. if len(tokens) > self.max_seq_length:
  842. tokens = tokens[:self.max_seq_length]
  843. else:
  844. # print(f"--- Tokens BEFORE PADDING: {len(tokens)} ---")
  845. # print(f"\n{'-' * 80}\n{tokens}\n{'-' * 80}\n")
  846. tokens += [self.tokenizer.pad_token] * (self.max_seq_length - len(tokens))
  847. # print(f"--- Tokens AFTER PADDING: {len(tokens)} ---")
  848. # print(f"\n{'-' * 80}\n{tokens}\n{'-' * 80}\n")
  849.  
  850. # Causal language modelling learns to associate current segment of text: "The quick brown fox",
  851. input_tokens = torch.tensor(tokens)
  852. # ...with the next segment of text: " quick brown fox".
  853. target_tokens = torch.cat((input_tokens[1:self.max_seq_length], torch.tensor([self.tokenizer.pad_token])), dim=-1)
  854.  
  855. return input_tokens, target_tokens
  856.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement