Advertisement
Guest User

Untitled

a guest
Jan 16th, 2024
2,294
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 20.66 KB | Source Code | 0 0
  1. from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
  2. import torch
  3.  
  4. model_name_or_path = "TheBloke/OpenHermes-2.5-Mistral-7B-AWQ"
  5.  
  6. tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
  7. # model = AutoModelForCausalLM.from_pretrained(
  8. #     model_name_or_path, torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2"
  9. # )
  10. model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto")
  11.  
  12. article = '''
  13.  
  14. == BEGIN ARTICLE ==
  15.  
  16. Llama 2 : Open Foundation and Fine-Tuned Chat Models
  17. Hugo Touvron∗Louis Martin†Kevin Stone†
  18. Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
  19. Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
  20. Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
  21. Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
  22. Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
  23. Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
  24. Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
  25. Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
  26. Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
  27. Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
  28. Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
  29. Sergey Edunov Thomas Scialom∗
  30. GenAI, Meta
  31. Abstract
  32. In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
  33. large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
  34. Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our
  35. models outperform open-source chat models on most benchmarks we tested, and based on
  36. ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-
  37. source models. We provide a detailed description of our approach to fine-tuning and safety
  38. improvements of Llama 2-Chat in order to enable the community to build on our work and
  39. contribute to the responsible development of LLMs.
  40. ∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
  41. †Second author
  42. 2
  43. Figure 1: Helpfulness human evaluation results for Llama
  44. 2-Chatcomparedtootheropen-sourceandclosed-source
  45. models. Human raters compared model generations on ~4k
  46. promptsconsistingofbothsingleandmulti-turnprompts.
  47. The95%confidenceintervalsforthisevaluationarebetween
  48. 1%and2%. MoredetailsinSection3.4.2. Whilereviewing
  49. these results, it is important to note that human evaluations
  50. canbenoisyduetolimitationsofthepromptset,subjectivity
  51. of the review guidelines, subjectivity of individual raters,
  52. and the inherent difficulty of comparing generations.
  53. Figure 2: Win-rate % for helpfulness and
  54. safety between commercial-licensed base-
  55. lines and Llama 2-Chat , according to GPT-
  56. 4. Tocomplementthehumanevaluation,we
  57. used a more capable model, not subject to
  58. ourownguidance. Greenareaindicatesour
  59. modelisbetteraccordingtoGPT-4. Toremove
  60. ties, we used win/ (win+loss). The orders in
  61. whichthemodelresponsesarepresentedto
  62. GPT-4arerandomlyswappedtoalleviatebias.
  63. 1 Introduction
  64. Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in
  65. complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized
  66. domains such as programming and creative writing. They enable interaction with humans through intuitive
  67. chat interfaces, which has led to rapid and widespread adoption among the general public.
  68. ThecapabilitiesofLLMsareremarkableconsideringtheseeminglystraightforwardnatureofthetraining
  69. methodology. Auto-regressivetransformersarepretrainedonanextensivecorpusofself-superviseddata,
  70. followed by alignment with human preferences via techniques such as Reinforcement Learning with Human
  71. Feedback(RLHF).Althoughthetrainingmethodologyissimple,highcomputationalrequirementshave
  72. limited the development of LLMs to a few players. There have been public releases of pretrained LLMs
  73. (such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that
  74. match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla
  75. (Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such
  76. asChatGPT,BARD,andClaude. TheseclosedproductLLMsareheavilyfine-tunedtoalignwithhuman
  77. preferences, which greatly enhances their usability and safety. This step can require significant costs in
  78. computeandhumanannotation,andisoftennottransparentoreasilyreproducible,limitingprogresswithin
  79. the community to advance AI alignment research.
  80. In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, Llama 2 and
  81. Llama 2-Chat , at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,
  82. Llama 2-Chat models generally perform better than existing open-source models. They also appear to
  83. be on par with some of the closed-source models, at least on the human evaluations we performed (see
  84. Figures1and3). Wehavetakenmeasurestoincreasethesafetyofthesemodels,usingsafety-specificdata
  85. annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally,
  86. thispapercontributesathoroughdescriptionofourfine-tuningmethodologyandapproachtoimproving
  87. LLM safety. We hope that this openness will enable the community to reproduce fine-tuned LLMs and
  88. continue to improve the safety of those models, paving the way for more responsible development of LLMs.
  89. Wealsosharenovelobservationswemadeduringthedevelopmentof Llama 2 andLlama 2-Chat ,suchas
  90. the emergence of tool usage and temporal organization of knowledge.
  91. 3
  92. Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-
  93. source models. Human raters judged model generations for safety violations across ~2,000 adversarial
  94. prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is
  95. importanttocaveatthesesafetyresultswiththeinherentbiasofLLMevaluationsduetolimitationsofthe
  96. promptset,subjectivityofthereviewguidelines,andsubjectivityofindividualraters. Additionally,these
  97. safety evaluations are performed using content standards that are likely to be biased towards the Llama
  98. 2-Chatmodels.
  99. We are releasing the following models to the general public for research and commercial use‡:
  100. 1.Llama 2 ,anupdatedversionof Llama 1,trainedonanewmixofpubliclyavailabledata. Wealso
  101. increasedthesizeofthepretrainingcorpusby40%,doubledthecontextlengthofthemodel,and
  102. adoptedgrouped-queryattention(Ainslieetal.,2023). Wearereleasingvariantsof Llama 2 with
  103. 7B,13B,and70Bparameters. Wehavealsotrained34Bvariants,whichwereportoninthispaper
  104. but are not releasing.§
  105. 2.Llama 2-Chat , a fine-tuned version of Llama 2 that is optimized for dialogue use cases. We release
  106. variants of this model with 7B, 13B, and 70B parameters as well.
  107. WebelievethattheopenreleaseofLLMs,whendonesafely,willbeanetbenefittosociety. LikeallLLMs,
  108. Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021;
  109. Solaimanet al.,2023). Testingconductedtodate hasbeeninEnglish andhasnot— andcouldnot— cover
  110. all scenarios. Therefore, before deploying any applications of Llama 2-Chat , developers should perform
  111. safetytestingand tuningtailoredtotheirspecificapplicationsofthemodel. Weprovidearesponsibleuse
  112. guide¶and code examples‖to facilitate the safe deployment of Llama 2 andLlama 2-Chat . More details of
  113. our responsible release strategy can be found in Section 5.3.
  114. Theremainderofthispaperdescribesourpretrainingmethodology(Section2),fine-tuningmethodology
  115. (Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related
  116. work (Section 6), and conclusions (Section 7).
  117. ‡https://ai.meta.com/resources/models-and-libraries/llama/
  118. §We are delaying the release of the 34B model due to a lack of time to sufficiently red team.
  119. ¶https://ai.meta.com/llama
  120. ‖https://github.com/facebookresearch/llama
  121. 4
  122. Figure4: Trainingof Llama 2-Chat : Thisprocessbeginswiththe pretraining ofLlama 2 usingpublicly
  123. availableonlinesources. Followingthis,wecreateaninitialversionof Llama 2-Chat throughtheapplication
  124. ofsupervised fine-tuning . Subsequently, the model is iteratively refined using Reinforcement Learning
  125. with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy
  126. Optimization(PPO).ThroughouttheRLHFstage,theaccumulationof iterativerewardmodelingdata in
  127. parallel with model enhancements is crucial to ensure the reward models remain within distribution.
  128. 2 Pretraining
  129. Tocreatethenewfamilyof Llama 2models,webeganwiththepretrainingapproachdescribedinTouvronetal.
  130. (2023), using an optimized auto-regressive transformer, but made several changes to improve performance.
  131. Specifically,weperformedmorerobustdatacleaning,updatedourdatamixes,trainedon40%moretotal
  132. tokens,doubledthecontextlength,andusedgrouped-queryattention(GQA)toimproveinferencescalability
  133. for our larger models. Table 1 compares the attributes of the new Llama 2 models with the Llama 1 models.
  134. 2.1 Pretraining Data
  135. Our training corpus includes a new mix of data from publicly available sources, which does not include data
  136. fromMeta’sproductsorservices. Wemadeanefforttoremovedatafromcertainsitesknowntocontaina
  137. highvolumeofpersonalinformationaboutprivateindividuals. Wetrainedon2trilliontokensofdataasthis
  138. providesagoodperformance–costtrade-off,up-samplingthemostfactualsourcesinanefforttoincrease
  139. knowledge and dampen hallucinations.
  140. Weperformedavarietyofpretrainingdatainvestigationssothatuserscanbetterunderstandthepotential
  141. capabilities and limitations of our models; results can be found in Section 4.1.
  142. 2.2 Training Details
  143. We adopt most of the pretraining setting and model architecture from Llama 1 . We use the standard
  144. transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and
  145. Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings
  146. (RoPE, Su et al. 2022). The primary architectural differences from Llama 1 include increased context length
  147. andgrouped-queryattention(GQA).WedetailinAppendixSectionA.2.1eachofthesedifferenceswith
  148. ablation experiments to demonstrate their importance.
  149. Hyperparameters. We trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with β1=
  150. 0.9, β2= 0.95,eps= 10−5. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay
  151. finallearningratedownto10%ofthepeaklearningrate. Weuseaweightdecayof 0.1andgradientclipping
  152. of1.0. Figure 5 (a) shows the training loss for Llama 2 with these hyperparameters.
  153. 5
  154. Training Data Params Context
  155. LengthGQA Tokens LR
  156. Llama 1See Touvron et al.
  157. (2023)7B 2k ✗ 1.0T 3.0×10−4
  158. 13B 2k ✗ 1.0T 3.0×10−4
  159. 33B 2k ✗ 1.4T 1.5×10−4
  160. 65B 2k ✗ 1.4T 1.5×10−4
  161. Llama 2A new mix of publicly
  162. available online data7B 4k ✗ 2.0T 3.0×10−4
  163. 13B 4k ✗ 2.0T 3.0×10−4
  164. 34B 4k ✓ 2.0T 1.5×10−4
  165. 70B 4k ✓ 2.0T 1.5×10−4
  166. Table 1: Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with
  167. a global batch-size of 4M tokens. Bigger models — 34B and 70B — use Grouped-Query Attention (GQA) for
  168. improved inference scalability.
  169. 0 250 500 750 1000 1250 1500 1750 2000
  170. Processed Tokens (Billions)1.41.51.61.71.81.92.02.12.2Train PPLLlama-2
  171. 7B
  172. 13B
  173. 34B
  174. 70B
  175. Figure 5: Training Loss for Llama 2 models. We compare the training loss of the Llama 2 family of models.
  176. We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation.
  177. Tokenizer. Weusethesametokenizeras Llama 1;itemploysabytepairencoding(BPE)algorithm(Sennrich
  178. etal.,2016)usingtheimplementationfromSentencePiece(KudoandRichardson,2018). Aswith Llama 1,
  179. we split all numbers into individual digits and use bytes to decompose unknown UTF-8 characters. The total
  180. vocabulary size is 32k tokens.
  181. 2.2.1 Training Hardware & Carbon Footprint
  182. TrainingHardware. WepretrainedourmodelsonMeta’sResearchSuperCluster(RSC)(LeeandSengupta,
  183. 2022)aswellasinternalproductionclusters. BothclustersuseNVIDIAA100s. Therearetwokeydifferences
  184. between the two clusters, with the first being the type of interconnect available: RSC uses NVIDIA Quantum
  185. InfiniBandwhileourproductionclusterisequippedwithaRoCE(RDMAoverconvergedEthernet)solution
  186. based on commodity ethernet Switches. Both of these solutions interconnect 200 Gbps end-points. The
  187. seconddifferenceistheper-GPUpowerconsumptioncap—RSCuses400Wwhileourproductioncluster
  188. uses350W.Withthistwo-clustersetup,wewereabletocomparethesuitabilityofthesedifferenttypesof
  189. interconnectforlargescaletraining. RoCE(whichisamoreaffordable,commercialinterconnectnetwork)
  190. 6
  191. Time
  192. (GPU hours)Power
  193. Consumption (W)Carbon Emitted
  194. (tCO 2eq)
  195. Llama 27B 184320 400 31.22
  196. 13B 368640 400 62.44
  197. 34B 1038336 350 153.90
  198. 70B 1720320 400 291.42
  199. Total 3311616 539.00
  200. Table 2: CO2emissions during pretraining. Time: total GPU time required for training each model. Power
  201. Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency.
  202. 100%oftheemissionsaredirectlyoffsetbyMeta’ssustainabilityprogram,andbecauseweareopenlyreleasing
  203. these models, the pretraining costs do not need to be incurred by others.
  204. can scale almost as well as expensive Infiniband up to 2000 GPUs, which makes pretraining even more
  205. democratizable. On A100s with RoCE and GPU power capped at 350W, our optimized codebase reached up
  206. to 90% of the performance of RSC using IB interconnect and 400W GPU power.
  207. Carbon Footprint of Pretraining. Following preceding research (Bender et al., 2021a; Patterson et al., 2021;
  208. Wu et al., 2022; Dodge et al., 2022) and using power consumption estimates of GPU devices and carbon
  209. efficiency, we aim tocalculate thecarbon emissions resultingfrom the pretrainingof Llama 2 models. The
  210. actualpowerusageofaGPUisdependentonitsutilizationandislikelytovaryfromtheThermalDesign
  211. Power(TDP)thatweemployasanestimationforGPUpower. Itisimportanttonotethatourcalculations
  212. do not account for further power demands, such as those from interconnect or non-GPU server power
  213. consumption,norfromdatacentercoolingsystems. Additionally,thecarbonoutputrelatedtotheproduction
  214. of AI hardware, like GPUs, could add to the overall carbon footprint as suggested by Gupta et al. (2022b,a).
  215. Table 2 summarizes the carbon emission for pretraining the Llama 2 family of models. A cumulative of
  216. 3.3M GPUhours ofcomputation wasperformed onhardware oftype A100-80GB (TDPof 400Wor 350W).
  217. We estimate the total emissions for training to be 539 tCO 2eq, of which 100% were directly offset by Meta’s
  218. sustainability program.∗∗Our open release strategy also means that these pretraining costs will not need to
  219. be incurred by other companies, saving more global resources.
  220. 2.3 Llama 2 Pretrained Model Evaluation
  221. In this section, we report the results for the Llama 1 andLlama 2 base models, MosaicML Pretrained
  222. Transformer(MPT)††models,andFalcon(Almazroueietal.,2023)modelsonstandardacademicbenchmarks.
  223. For all the evaluations, we use our internal evaluations library. We reproduce results for the MPT and Falcon
  224. modelsinternally. Forthesemodels,wealwayspickthebestscorebetweenourevaluationframeworkand
  225. any publicly reported results.
  226. InTable3,wesummarizetheoverallperformanceacrossasuiteofpopularbenchmarks. Notethatsafety
  227. benchmarks are shared in Section 4.1. The benchmarks are grouped into the categories listed below. The
  228. results for all the individual benchmarks are available in Section A.2.2.
  229. •Code.Wereporttheaveragepass@1scoresofourmodelsonHumanEval(Chenetal.,2021)and
  230. MBPP (Austin et al., 2021).
  231. •CommonsenseReasoning. WereporttheaverageofPIQA(Bisketal.,2020),SIQA(Sapetal.,2019),
  232. HellaSwag (Zellers et al., 2019a), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge
  233. (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and CommonsenseQA (Talmor et al.,
  234. 2018). We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks.
  235. •World Knowledge. We evaluate the 5-shot performance on NaturalQuestions (Kwiatkowski et al.,
  236. 2019) and TriviaQA (Joshi et al., 2017) and report the average.
  237. •Reading Comprehension. For reading comprehension, we report the 0-shot average on SQuAD
  238. (Rajpurkar et al., 2018), QuAC (Choi et al., 2018), and BoolQ (Clark et al., 2019).
  239. ∗∗https://sustainability.fb.com/2021-sustainability-report/
  240. ††https://www.mosaicml.com/blog/mpt-7b
  241. 7
  242. Model Size CodeCommonsense
  243. ReasoningWorld
  244. KnowledgeReading
  245. ComprehensionMath MMLU BBH AGI Eval
  246. MPT7B 20.5 57.4 41.0 57.5 4.9 26.8 31.0 23.5
  247. 30B 28.9 64.9 50.0 64.7 9.1 46.9 38.0 33.8
  248. Falcon7B 5.6 56.1 42.8 36.0 4.6 26.2 28.0 21.2
  249. 40B 15.2 69.2 56.7 65.7 12.6 55.4 37.1 37.0
  250. Llama 17B 14.1 60.8 46.2 58.5 6.95 35.1 30.3 23.9
  251. 13B 18.9 66.1 52.6 62.3 10.9 46.9 37.0 33.9
  252. 33B 26.0 70.0 58.4 67.6 21.4 57.8 39.8 41.7
  253. 65B 30.7 70.7 60.5 68.6 30.8 63.4 43.5 47.6
  254. Llama 27B 16.8 63.9 48.9 61.3 14.6 45.3 32.6 29.3
  255. 13B 24.5 66.9 55.4 65.8 28.7 54.8 39.4 39.1
  256. 34B 27.8 69.9 58.7 68.0 24.2 62.6 44.1 43.4
  257. 70B37.5 71.9 63.6 69.4 35.2 68.9 51.2 54.2
  258. Table3: Overallperformanceongroupedacademicbenchmarkscomparedtoopen-sourcebasemodels.
  259. •MATH. We report the average of the GSM8K (8 shot) (Cobbe et al., 2021) and MATH (4 shot)
  260. (Hendrycks et al., 2021) benchmarks at top 1.
  261. •Popular Aggregated Benchmarks . We report the overall results for MMLU (5 shot) (Hendrycks
  262. et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong
  263. et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.
  264. As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the
  265. resultsonMMLUandBBHby ≈5and≈8points,respectively,comparedto Llama 1 65B.Llama 2 7Band30B
  266. modelsoutperformMPTmodelsofthecorrespondingsizeonallcategoriesbesidescodebenchmarks. Forthe
  267. Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.
  268. Additionally, Llama 2 70B model outperforms all open-source models.
  269. In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
  270. in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
  271. gaponcodingbenchmarks. Llama 2 70BresultsareonparorbetterthanPaLM(540B)(Chowdheryetal.,
  272. 2022)onalmostallbenchmarks. Thereisstillalargegapinperformancebetween Llama 2 70BandGPT-4
  273. and PaLM-2-L.
  274. We also analysed the potential data contamination and share the details in Section A.6.
  275. Benchmark (shots) GPT-3.5 GPT-4 PaLM PaLM-2-L Llama 2
  276. MMLU (5-shot) 70.0 86.4 69.3 78.3 68.9
  277. TriviaQA (1-shot) – – 81.4 86.1 85.0
  278. Natural Questions (1-shot) – – 29.3 37.5 33.0
  279. GSM8K (8-shot) 57.1 92.0 56.5 80.7 56.8
  280. HumanEval (0-shot) 48.1 67.0 26.2 – 29.9
  281. BIG-Bench Hard (3-shot) – – 52.3 65.7 51.2
  282. Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
  283. are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the
  284. PaLM-2-L are from Anil et al. (2023).
  285.  
  286. == END ARTICLE ==
  287.  
  288. '''
  289.  
  290. chat = [
  291.     {
  292.         "role": "system",
  293.         "content": "You are Hermes 2, an unbiased AI assistant, and you always answer to the best of your ability."
  294.     },
  295.     {
  296.         "role": "user",
  297.         "content": (
  298.             "You are given a partial and unparsed scientific article, please read it carefully and complete the "
  299.             f"request below.{article}Please summarize the article in 5 sentences."
  300.         )
  301.     },
  302. ]
  303. processed_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
  304. input_ids = tokenizer(processed_chat, return_tensors='pt').to(model.device)
  305.  
  306. streamer = TextStreamer(tokenizer)
  307.  
  308. # Run and measure generate
  309. start_event = torch.cuda.Event(enable_timing=True)
  310. end_event = torch.cuda.Event(enable_timing=True)
  311. torch.cuda.reset_peak_memory_stats(model.device)
  312. torch.cuda.empty_cache()
  313. torch.cuda.synchronize()
  314. start_event.record()
  315. generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer)
  316. # generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10)
  317. end_event.record()
  318. torch.cuda.synchronize()
  319. max_memory = torch.cuda.max_memory_allocated(model.device)
  320. print("Max memory (MB): ", max_memory * 1e-6)
  321. new_tokens = generation_output.shape[1] - input_ids.input_ids.shape[1]
  322. print("Throughput (tokens/sec): ", new_tokens / (start_event.elapsed_time(end_event) * 1.0e-3))
  323.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement