Advertisement
Guest User

Untitled

a guest
Jun 27th, 2019
98
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.04 KB | None | 0 0
  1. First, the data comes from [this BigQuery table of Reddit posts](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2016_08) using the following query:
  2.  
  3. SELECT subreddit, domain, url, num_comments, score, over_18, is_self FROM [fh-bigquery:reddit_posts.2016_08]
  4. WHERE url IN (SELECT url FROM [fh-bigquery:reddit_posts.2016_08] GROUP BY url HAVING count(url) > 1)
  5.  
  6. What this does is select all posts (not comments) which share a URL with at least one other post. This is the idea behind the network, that shared URLs represent shared interests. You could update it for a more recent month if you want.
  7.  
  8. To save you some time learning BigQuery, you can access the resulting CSV here (37MB compressed, 1.2M rows): https://www.dropbox.com/s/tyypnsr4wd15hey/crossposts_aug_2016_v2.csv.gz?dl=0
  9.  
  10. Now, sadly I've lost the code I used to process this into a map. But it would have been something like this:
  11.  
  12. Stage 1: Make a list of the subreddits appearing next to each unique URL, e.g. from these rows:
  13.  
  14. NoMansSkyTheGame,imgur.com,http://imgur.com/a/kWgnj,7,4,false,false
  15. nomanshigh,imgur.com,http://imgur.com/a/kWgnj,1,1,false,false
  16.  
  17. Make this:
  18.  
  19. "http://imgur.com/a/kWgnj":["NoMansSkyTheGame","nomanshigh"]
  20.  
  21. Stage 2: For each unique URL, connect all its subreddits in a round-robin, e.g.
  22.  
  23. ["pics","EarthPorn","nature","funny"]
  24.  
  25. Becomes this set of edges:
  26.  
  27. pics,Earthporn
  28. pics,nature
  29. pics,funny
  30. EarthPorn,nature
  31. EarthPorn,funny
  32. nature,funny
  33.  
  34. (Depending on how you're doing this, you might like to sort by alphabetical order so you always end up with 'nature,pics' instead of sometimes having 'pics,nature' in your edge list.)
  35.  
  36. I can't remember whether or not I corrected for subreddit popularity. For instance, which of the following relationships is closer?
  37.  
  38. * pics and EarthPorn with 50 shared posts out of 20,000 total posts, or
  39. * PenmanshipPorn and calligraphy with 10 shared posts out of 200 total posts?
  40.  
  41. I'd argue it's the latter, but that's an exercise for you to play around with.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement