Advertisement
Guest User

Untitled

a guest
Aug 14th, 2018
63
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.78 KB | None | 0 0
  1. Store twitter data in a way such that for each tweet we have:
  2.  
  3. - Original tweet (+)
  4. - Clean preprocessed tweet
  5. - List of Urls from tweet
  6. - List of Hashtags from tweet
  7. - tweet type (+)
  8. - author (+)
  9. - timestamp (+)
  10. - number of likes
  11. - number of retweets
  12. - number of comments
  13. - language
  14. - mentioned users (@)
  15.  
  16. With such an approach we will not loose any information, only restructure it in a useful way.
  17.  
  18. With (+) I marked what we have now.
  19.  
  20. Clean preprocessed tweet will be used for converting tweet into vector.
  21. List of Urls will be useful in future if we want to follow this URL and download info from webpage.
  22. List of Hashtags will be used for vector creation, we haven't decided yet how exactly it will happen.
  23.  
  24.  
  25. Regarding deleting the hashtags, I think we need to distinguish 2 cases:
  26. 1) Hashtag is a word inside sentence, removing this hashtag will break the meaning of the sentence. In this case we just remove # sign and keep the hashtag word.
  27. 2) There is a list of hashtags outside the sentence, removing them will not cause sentence break. In this case we just delete the hashtags from the tweet and put them separately to the list of hashtags.
  28.  
  29. PREPROCESSING (how to get Clean preprocessed tweet)
  30.  
  31. - delete nonEnglish tweets (or detect language and store it as feature of tweet)
  32. - move all URLs to the the list and save as tweet feature, remove URLs from tweet text.
  33. - move all hashtag words to the list of hashtags, save this list as new feature. From original text detect all hashtag who has another hashtag as a neighbor, delete these hashtags. Remove all '#' signs.
  34. - delete @user names mentioned in tweet, and move it to another feature.
  35.  
  36. Here I described the basic preprocessing steps we will need at any case. Further steps will be specified after first results.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement