Guest User

Untitled

a guest
Oct 17th, 2017
76
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.91 KB | None | 0 0
  1. /**
  2. Idea is we want to "attach" something ( annotation, edit, image, link, whatever )
  3. to a particular piece of text that is not necessarily defined by an element.
  4. In other words, some free form text. Whether this text comes from HTML, or a text file
  5. is unimportant. The point is to find this attachment point even when:
  6. - the order of paragraphs is altered
  7. - the order of sentences in a paragraph is altered
  8. - the order of words in a sentence is altered
  9.  
  10. And we would like to still find the attachment point with high probability when:
  11. - the words before, after and within the attachment point have changed, been deleted or been added to.
  12.  
  13. The basic idea is that in order to create a memory
  14. of a certain location in the source we extract multiple layers
  15. of features, or patterns, or signals
  16. And our 'matching function' which ranks candidate attachment points
  17. by how closely we believe them to match the intended remembered point
  18. is a combination of scores derived from these signals.
  19.  
  20. Some ideas I have for signals now are:
  21. - bag of words / word vector, take inner product to produce score
  22. - bag of letter trigrams / trigram vector, take inner product to produce score
  23. - exact match / 0 or 1 for mismatch or exact match to produce score
  24. - edit distance / alignment to produce score
  25. - word bigram vector, take inner product to produce score
  26. - paragraph index, symmetric difference to produce score
  27. - sentence index relative to document, symmetric difference to produce score
  28. - sentence index relative to paragraph, symmetric difference to produce score
  29. - first, middle or last, sentence, 0 or 1 to produce score
  30. - first, middle or last, paragraph, 0 or 1 to produce score
  31. - sentence prior, sentence after
  32.  
  33. So to make a memory, we record the exact text from the sentence we are memorizing
  34. We also record the sentence and paragraph indices, and the values for features we cannot
  35. compute from the extracted text ourselves ( first, middle, last; sentence prior and sentence after )
  36.  
  37. And then to compute a match we do the following algorithm:
  38. - find exact match for extract, if there is only 1, we find, otherwise continue
  39. - compute values for all the signals for the extracted sentence, and compute values for all the signals from every other sentence,
  40. possibly weighting each signal, and then compute match scores between the values of signals for the extracted sentence,
  41. and values of signals for all other sentences. Rank these, break aggregate score ties by earliest precedence in the document.
  42. - attempt to apply the edit, annotation, modification whatever to the found highest ranked sentence, and if it works, say:
  43. "The sentence we're editing has changed, and this may not be the sentence we were looking for. Click here to see the next 10 best
  44. candidates for the sentence we were looking for."
  45. if it doesn't work, attempt to apply it to each of the next 10 best matches. If it works, then display the same message as above.
  46. If it doesn't work, apply it anyway to the top ranked sentences and leave a note that says,
  47. "The sentence we're editing has changed or moved, and we are not sure if this is the sentence we were looking for. Sorry.
  48. This can happen when the document was edited after we marked it. Click here to see the next 20 best candidates
  49. for the sentence we were looking for."
  50.  
  51. // we break "sentences" on these marks
  52. const SEN_MARK = {
  53. en: [ ".", "'", '"', ":", ";", "!", "?", "()", "[]", "“”", "‘’" ],
  54. zh: [ "。", "「」", "﹁ ﹂", ";", ":", "!", "?", "()", "[]", "【】", "“”", "‘’", "《》", "〈〉"],
  55. es: [ ".", "'", '"', ":", ";", "¡!", "¿?", " "()", "[]", "⟨⟩", "“”", "‘’", "‹›", "«»" ],
  56. hi: [ "|", ";", "?", "!", "”" ],
  57. ar: [ ".", "؟", ":", "“”" ]
  58. };
  59.  
  60.  
  61. The aim is to approach the best possible we can do without understanding semantics. `
  62. **/
  63.  
  64. function remember( letter_index_from_source, sentence_text, source ) {
  65.  
  66. }
  67.  
  68. function find( sentence_text, source_dependent_scores, source ) {
  69.  
  70. }
Add Comment
Please, Sign In to add comment