Guest User

Untitled

a guest
Dec 14th, 2025
244
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.90 KB | None | 0 0
  1. <!DOCTYPE html>
  2.  
  3. <html lang="en">
  4. <head>
  5. <meta charset="utf-8"/>
  6. <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  7. <title>
  8. sacrificing accessibility for not getting web-scraped
  9. </title>
  10. <meta content="width=device-width, initial-scale=1" name="viewport"/>
  11. <style>
  12. @font-face {
  13. font-family: "Space Grotesk";
  14. src: url("../fonts/SpaceGrotesk-VariableFont.ttf") format("truetype");
  15. font-weight: 100 900;
  16. }
  17.  
  18. @font-face {
  19. font-family: "Mulish";
  20. src: url("../fonts/Mulish-VariableFont.ttf") format("truetype");
  21. font-weight: 100 900;
  22. }
  23.  
  24. /* Fill whole page and keep footer at bottom. Reset margins. */
  25. html,
  26. body {
  27. height: 100%;
  28. margin: 0;
  29. }
  30.  
  31. /* Keep scrollbar always visible to keep spacing consistent. */
  32. html {
  33. overflow-y: scroll;
  34. }
  35.  
  36. body {
  37. display: flex;
  38. flex-direction: column;
  39.  
  40. background-color: rgb(255, 230, 187);
  41.  
  42. font-family: "Mulish", sans-serif;
  43. font-optical-sizing: auto;
  44. font-weight: 400;
  45. font-style: normal;
  46. font-size: 18px;
  47. }
  48.  
  49. h1,
  50. h2 {
  51. text-align: center;
  52.  
  53. font-family: "Mulish", sans-serif;
  54. font-optical-sizing: auto;
  55. font-weight: 600;
  56. font-style: normal;
  57. }
  58.  
  59. p {
  60. margin-top: 0.5em;
  61. }
  62.  
  63. /* Override justification inside tables */
  64. table,
  65. th,
  66. td,
  67. .footnotes {
  68. text-align: left;
  69. }
  70.  
  71. header h1 {
  72. padding: 0.25em 0em;
  73. margin: 0em;
  74.  
  75. font-weight: 700;
  76. font-style: bold;
  77. font-size: 70px;
  78.  
  79. font-family: "Space Grotesk", sans-serif;
  80. font-optical-sizing: auto;
  81. font-style: normal;
  82. font-weight: 600;
  83. }
  84.  
  85. a {
  86. text-decoration: none;
  87. font-weight: 700;
  88. transition: color 0.66s ease-in-out;
  89. color: #c07139;
  90. }
  91. a:hover {
  92. color: #8b4513;
  93. }
  94.  
  95. /* Set consistent width and margins. Justify text. */
  96. header,
  97. main,
  98. footer {
  99. width: 600px;
  100. margin: 0 auto;
  101. text-align: justify;
  102. }
  103.  
  104. /* spacer-div keeps footer at the bottom of the page. */
  105. .spacer {
  106. flex: 1;
  107. }
  108.  
  109. footer {
  110. padding: 1.5em;
  111. text-align: center;
  112. }
  113.  
  114. @font-face {
  115. font-family: "Mulish-scrambled";
  116. src: url("../fonts/Mulish-Regular-scrambled.ttf") format("truetype");
  117. font-weight: 100 900;
  118. }
  119. main {
  120. font-family: "Mulish-scrambled", sans-serif;
  121. font-optical-sizing: auto;
  122. font-weight: 400;
  123. font-style: normal;
  124. font-size: 18px;
  125. }
  126. .code-snippet {
  127. background: rgb(192, 113, 57);
  128. overflow-x: auto;
  129. padding: 0.5em;
  130. margin: 0.5em;
  131. }
  132. details {
  133. padding: 0.5em 0;
  134. }
  135. </style>
  136. </head>
  137. <body>
  138. <header>
  139. <h1>
  140. <a href="/index.html">TIL SCHÜNEMANN</a>
  141. </h1> </header>
  142. <main>
  143. <div class="content">
  144. <h1>sacrificing accessibility for not getting web-scraped</h1>
  145. <p>
  146. LLMs have taken the world by a storm, and need ever-increasing training data to improve.
  147. Copyright laws get broken, content gets aggressively scraped, and even though you might have deleted your original work, it might just show up because it got cached or archived at some point.
  148. </p>
  149. <p>
  150. Now, if you subscribe to the idea that your content shouldn't be used for training, you don't have much say.
  151. I wondered how I personally would mitigate this on a technical level.
  152. </p>
  153. <h2>et tu, caesar?</h2>
  154. <p>
  155. In my linear algebra class we discussed <a href="#footnote-1" id="ref-1">the caesar cipher<sup>[1]</sup></a> as a simple encryption algorithm:
  156. Every character gets shifted by n characters. If you know (or guess) the shift, you can figure out the original text.
  157. Brute force or character heuristics break this easily.
  158. </p>
  159. <p>
  160. But we can apply this substitution more generally to a font!
  161. A font contains a cmap (character map), which maps codepoints and glyphs. A codepoint defines the character, or complex symbol, and the glyph represents the visual shape.
  162. We scramble the font´s codepoint-glyph-mapping, and adjust the text with the inverse of the scramble, so it stays intact for our readers.
  163. It displays correctly, but the inspected (or scraped) HTML stays scrambled. Theoretically, you could apply a different scramble to each request.
  164. </p>
  165. <p>
  166. This works as long as scrapers don't use OCR for handling edge cases like this, but I don't think it would be feasible.
  167. </p>
  168. <p>
  169. I also tested if ChatGPT could decode a ciphertext if I'd tell it that a substitution cipher was used, and after some back and forth, it gave me the result: <i>One day Alice went down a rabbit hole, and found herself in Wonderland, a strange and magical place filled with...</i>
  170. </p>
  171. <p>
  172. ...which funnily didn't resemble the original text at all! This might have happened due to the training corpus containing <a href="#footnote-2" id="ref-2">Alice and Bob<sup>[2]</sup></a> as standard party labels for showcasing encryption.
  173. </p>
  174. <p>
  175. <details>
  176. <summary>The code I used for testing: (click to expand)</summary>
  177. <div class="code-snippet">
  178. <code style="white-space: pre;"># /// script
  179. # requires-python = "&gt;=3.12"
  180. # dependencies = [
  181. # "bs4",
  182. # "fonttools",
  183. # ]
  184. # ///
  185. import random
  186. import string
  187. from typing import Dict
  188.  
  189. from bs4 import BeautifulSoup
  190. from fontTools.ttLib import TTFont
  191.  
  192.  
  193. def scramble_font(seed: int = 1234) -&gt; Dict[str, str]:
  194. random.seed(seed)
  195. font = TTFont("src/fonts/Mulish-Regular.ttf")
  196.  
  197. # Pick a Unicode cmap (Windows BMP preferred)
  198. cmap_table = None
  199. for table in font["cmap"].tables:
  200. if table.isUnicode() and table.platformID == 3:
  201. break
  202. cmap_table = table
  203.  
  204. cmap = cmap_table.cmap
  205.  
  206. # Filter codepoints for a-z and A-Z
  207. codepoints = [cp for cp in cmap.keys() if chr(cp) in string.ascii_letters]
  208. glyphs = [cmap[cp] for cp in codepoints]
  209.  
  210. shuffled_glyphs = glyphs[:]
  211. random.shuffle(shuffled_glyphs)
  212.  
  213. # Create new mapping
  214. scrambled_cmap = dict(zip(codepoints, shuffled_glyphs, strict=True))
  215. cmap_table.cmap = scrambled_cmap
  216.  
  217. translation_mapping = {}
  218. for original_cp, original_glyph in zip(codepoints, glyphs, strict=True):
  219. for new_cp, new_glyph in scrambled_cmap.items():
  220. if new_glyph == original_glyph:
  221. translation_mapping[chr(original_cp)] = chr(new_cp)
  222. break
  223.  
  224. font.save("src/fonts/Mulish-Regular-scrambled.ttf")
  225.  
  226. return translation_mapping
  227.  
  228.  
  229. def scramble_html(
  230. input: str,
  231. translation_mapping: Dict[str, str],
  232. ) -&gt; str:
  233. def apply_cipher(text):
  234. repl = "".join(translation_mapping.get(c, c) for c in text)
  235. return repl
  236.  
  237. # Read HTML file
  238. soup = BeautifulSoup(input, "html.parser")
  239.  
  240. # Find all main elements
  241. main_elements = soup.find_all("main")
  242. skip_tags = {"code", "h1", "h2"}
  243.  
  244. # Apply cipher only to text within main
  245. for main in main_elements:
  246. for elem in main.find_all(string=True):
  247. if elem.parent.name not in skip_tags:
  248. elem.replace_with(apply_cipher(elem))
  249.  
  250. return str(soup)
  251. </code>
  252. </div>
  253. </details>
  254. </p>
  255. <h2>drawbacks</h2>
  256. <p>
  257. There is no free lunch, and this method comes with major drawbacks:
  258. <ul>
  259. <li>copy-paste gets broken</li>
  260. <li>accessibility for screen readers or non-graphical browsers like w3m is gone</li>
  261. <li>your search rank will drop</li>
  262. <li>font-kerning could get messed up (if you are not using a monospace font)</li>
  263. <li>probably more</li>
  264. </ul>
  265. On the plus side, you read this article using my own scrambled font. Take this, web scrapers!
  266. </p>
  267. </div>
  268. <div class="footnotes">
  269. <h2>footnotes</h2>
  270. <p>
  271. You can click on the footnote index to jump back:
  272. <ul>
  273. <li id="footnote-1">
  274. <a href="#ref-1"><sup>[1]</sup></a> <a href="https://en.wikipedia.org/wiki/Caesar_cipher">https://en.wikipedia.org/wiki/Caesar_cipher</a>
  275. </li>
  276. <li id="footnote-2">
  277. <a href="#ref-2"><sup>[2]</sup></a> <a href="https://en.wikipedia.org/wiki/Alice_and_Bob">https://en.wikipedia.org/wiki/Alice_and_Bob</a>
  278. </li>
  279. </ul>
  280. </p>
  281. </div>
  282. </main>
  283. <div class="spacer"></div>
  284. <footer>
  285. made with <span style="color: saddlebrown">♥</span> by Til Schünemann </footer>
  286. <script defer src="https://static.cloudflareinsights.com/beacon.min.js/vcd15cbe7772f49c399c6a5babf22c1241717689176015" integrity="sha512-ZpsOmlRQV6y907TI0dKBHq9Md29nnaEIPlkf84rnaERnq6zvWvPUqr2ft8M1aS28oN72PdrCzSjY4U6VaAw1EQ==" data-cf-beacon='{"version":"2024.11.0","token":"f6d3e9f932164d77b025bcfdfbeae066","r":1,"server_timing":{"name":{"cfCacheStatus":true,"cfEdge":true,"cfExtPri":true,"cfL4":true,"cfOrigin":true,"cfSpeedBrain":true},"location_startswith":null}}' crossorigin="anonymous"></script>
  287. </body>
  288. </html>
Advertisement
Add Comment
Please, Sign In to add comment