Guest User

Untitled

a guest
Jul 15th, 2017
41
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 17.97 KB | None | 0 0
  1. Let me preface this by saying I am a former Sc2 plat 1 player[long live race random], and I saw my community go through some confusion about distribution of skill brackets before MMR for ALL *ranked* players became publicly accessible, and Blizz made some definitive statements about distribution of ranked players vs only unranked/custom game players.
  2.  
  3.  
  4.  
  5. First, before we address my ideas about the limitations of the data collection for the two recent posts with different data sets resulting in an [infographic](https://www.reddit.com/r/DotA2/comments/6ndcr5/infographic_mmr_distribution_2017/) by /u/pohka and [separate graph](https://www.reddit.com/r/DotA2/comments/6n1u5o/the_true_mmr_distribution/) by /u/bishof, we need to decide if we are trying to measure the 'mmr distribution' of ranked players or the 'skill distribution' of all players.
  6.  
  7. We need to recognize before moving forward that (unless someone has solid data on this, or a statement from valve, and if so please let me know) we don't know the ratio of dota ranked players to players who only play unranked and custom games. This is relevant if only because it would play a factor in letting us know how large of a sample size given /u/pohka and /u/bishof's methods we would need before we could conclude the results are possibly representative of the actual curve, even before we get to the ways in which our methodology might skew our results.
  8.  
  9. First, for reference, let's look at an example of a estimated distribution with a known sample size and a known, stated bias, with no attempt to correct for it:
  10. Here is [OpenDota's estimated] (https://www.opendota.com/distributions) distribution, taking careful note of their sample size{2.1 mil TOTAL}, methodology, and consequent disclaimer, which reads:
  11. "!! This data set is limited to players displaying MMR on profile and sharing public match data.
  12. Players do not have to sign in, but due to the opt-in nature of the data collected, averages are likely higher than expected."
  13. The reason OpenDota does not correct for this known factor is that it would be impossible to correct for it accurately without knowing first the ratio of players who expose data to those who do not, and knowing how that ratio skews in terms of skill. It seems reasonable to assume that the ratio would skew in favor of higher mmr being more likely to expose, or as several people pointed out in /u/bishof's post, perhaps people closer to milestones are more likely to expose, but it would be unreasonable to assume we know by how much this skews the data, and in what exact ways, so it is best to leave the esitmation unadjusted as OpenDota does, but put up the disclaimer. That makes the estimation a little more scientifically useful in that it is aware that it may not be very accurate, though unfortunately it and we are not aware by how much it may be off.
  14.  
  15. The sample size of /u/pohka is as follows:
  16. 89,332 for US West
  17. 298,595 for US East
  18. 359,083 for EU East
  19. 527,830 for EU West
  20. 161,752 for South Korea
  21. 652,053 for Russia
  22. 1,114,753 for Sea [wow!]
  23. Total Data: 3,203,398 pieces
  24. No Data for Japan, Peru, Chile, South America, Dubai, India, Australia, and 5 Chinese servers.
  25. I didn't talk much about /u/pohka's data since I have some questions about why we have such significantly different amounts of data for different servers, and I have messaged /u/pohka asking about his/her methodology and will update this later to refect /u/pohka's reply.
  26. Update: /u/pohka says he/she did not originally collect the data, so due to ongoing conversation about sources, I have focused my discussion on /u/bishof's wonderful graphic. :)
  27. It appears that code at least similar to that available [here](https://github.com/odota/core) was used by someone[I think /u/bishof, based on him [linking to this data source](https://www.reddit.com/r/DotA2/comments/6n1u5o/the_true_mmr_distribution/dk6ok55/) in the comments of his original post with more histograms based on it, though his original post and graph itself has a different sample size] to extract data available [here](https://gist.github.com/vvekic/), which /u/pohka then downloaded and used for his/her graphic.
  28.  
  29. The sample size as stated by /u/bishof in the original post is as follows:
  30. 1223845 players (I have messaged him asking about his methodology)
  31. More importantly, I think:
  32. 380,774 MATCHES, and most importantly(thank you /u/bishof for including this!) data was collectd from:
  33. 2017-07-10 22:50:07 to 2017-07-12 14:58:57 [timezones ungiven]
  34. This is 1 day and 16 hours of data--or rather data captured over a continuous period of 1 day and 16 hours.
  35.  
  36.  
  37. Our expectation about what sorts of players would be more naturally collected given any methodology should affect how we look at this data, just as with OpenDota. Just as voters polled by phone are not necessarily going to give you the same data as voters polled by e-mail or in person, and unless you adjust your model correctly for those factors, and anticipate other factors, your polls will likely not reflect the actual vote. They may not even be close. For example if you poll millions of americans using landline phones only, you would have got a very different percentage opinion on any subject than may be 'actually representative' given the tendency of landline phone poll answerers to be older than the rest of the populace and therefore usually representative of an entirely different culture.
  38.  
  39. So focusing on /u/bishof's data, as he/she was diligent to give up the time frame, let's think terms of what expectations we might form knowing that it captured matches played over almost 2 days as it relates to how representative it might be of the 'active ranked playerbase', once again ignoring the ratio of ranked/unranked only players. But first we need to know how many active players we have, and we need to assume that all players captured were unique, which is one of many reasons I have messaged /u/bishof asking about his methodology, and this section will be updated later to reflect his reply.
  40.  
  41.  
  42. Currently if you login to Dota, at the top right hand corner Valve announces: 12,265,863 UNIQUE PLAYERS IN LAST MONTH, with 771,375 players in game right at the moment of this post being typed [16:08:07 UTC, Jul. 15, 2017]. It is unclear if this is measured as a result of log ins or if it counts only if a game is played, but I believe it highly likely that this number is generated from any player who interacts with the Valve Dota2 servers, even if they don't queue for a match.
  43.  
  44. [Meanwhile], with SteamSpy's estimation for Dota installs(owners) is 104,723,744 with an estimated uncertainty of 249,376. They also estimate player count in the past Two weeks at 8,989,814 (8.58% of estimated playerbase) with an estimated uncertainty of 83,584.
  45.  
  46. So given specifically /u/bishof's data, it is reasonable to assume we are polling around 5-10% of the 'active' playerbase, if the player base is defined as players who log in(and probably play agame) at least once a month. In Blizzard's Sc2, much more than half of the average 1.5-2 mil log ins per month play coop mode, campaign and custom games, or unranked. Ranked ladder comprises globally about 200,000 to 450,000 per month depending on the month(giving it by the way, as a 1v1 game a very healthy skill ditribution on ladder at any given time of the day, with matches being found in less than a minute most times or certainly less than 2 minutes occasionally, so you can sod right off dedgaem memers). Memeing aside, this is as low as 10% of the total Active playerbase, and might I add, a much lower percentage of those who bought the game but log in less than once a month[several hundreds of thousands of people], or who simply binge every couple months[currently me], or don't play anymore at all[millions of people].
  47. So of the 'total' playerbase, given steamspy's numbers, we are polling around 1% of the total playerbase, which is a fun number, but is not as relevant as the 'active' base.
  48.  
  49. So what guesses can we make about the players captured by /u/bishof that might affect WHICH 10% we are getting from our roughly 12 million monthly players. Certainly all of the ones we are getting play ranked, which is NOT representative of the total. To put this into numbers, 100% of collected players are ranked players, wheras some number less than 100% of 12 million play ranked. If we were trying to find where say X mmr lay on the spectrum of 'skill' as opposed to 'distribution of ranked players' we would have to ask ourselves questions about the population of unranked players that are very difficult to answer. First: How many don't play ranked at all? Second: What is the skill distribution of unranked players, since it would not be correct to pile them all below ranked players?
  50.  
  51. We are unable to answer those questions without help from Valve, so let's instead try to just ask ourselves where X mmr lies on a distribution of other ranked players, using /u/bishof's data, and with a mind to what kind of biases our data may have, other than the obvious(and intentional, and necessary) bias towards only including ranked players.
  52. For this I will use some extrapolation from Sc2 to make some stupidly wild guesses about the ranked only population. Let's assume for a minute that 50% of Active Dota players do not play ranked, whether they play only unranked or if they play custom games, or just siltbreaker, or whatever. The percentage, in my opinion is likely higher, but given Dota does not have a campaign and a lot of things push you towards playing ranked in Dota, I have chosen a significantly smaller percentage of non-ranked players than I otherwise would have.
  53.  
  54. Now we have polled 1.2 million players over the course of a continuous period of 1 DAY and 16 HOURS of an estimated playerbase of 7 million PER MONTH.
  55. We have managed to capture 17% of our wildly estimated active playerbase(per month). But Which 17%? If there is in place a solid way to make sure only distinct players are added to our data, and perhaps old players have their data rewritten to update their current mmr what would we expect the model to look like after 1 day and 16 hours running?
  56.  
  57. After 2 weeks?
  58.  
  59. After 4 weeks?
  60.  
  61. How often does an TYPICAL <1k player play? If they are to be included in the 'active playerbase' we are trying to model, then they at least play once a month by definition. It stands to reason than any ranked player in our estimate of 7 million per month that only plays once per month(even one session, though multiple games) is unlikely to have been captured in a data stream continuous over 2 days.
  62. So what if we assume players with less mmr are correlated with players who play less? It's not an unreasonable assumption, but it's actually one that can be tested. And it is the central question of this entire post!!
  63.  
  64.  
  65. If a 5k player TENDS to TYPICALLY play more often than a 2k player, then they are more likely to be caught up in the initial data sweeps than the 2k player.
  66. Therefore, the longer the count is compiled(as long as we have a way to keep players from being counted twice, even if they change their name), the lower the average is likely to be. This is a hypothesis that can actually be tested!!
  67. If our ability to make sure players aren't counted twice is limited in some way(like if we can't account for when they change their name or something, but I don't think this is a problem if the data is using Valve's API, since I think there are unique player IDs), then we might actually see the distribution go up after recording for a long time.
  68. But it would be affected by other factors too! Like The International coming up might bring in more players or cause players of certain tended skill brackets due to interest/et cetera to play less if they are watching games during certain time periods. Maybe higher mmr players are more interested in the qualifiers than lower mmr, and during the broadcast might be playing less(or more, because they are engaging with the content), than other groups.
  69. In order for the data to be accurate, it has to be compiling constantly over a period longer than a single month, to get an accurate picture of a statistically 'normal' month.
  70.  
  71. It is my hypotheses that the longer the data is compiled using matches, the more it will skew towards lower mmrs, though obviously with a limit near the actual distribution.
  72. This [additional data](https://www.reddit.com/r/DotA2/comments/6n1u5o/the_true_mmr_distribution/dk6ok55/) from /u/bishof seems to suggest something interesting, and similar, given its diffferences with the smaller data size:
  73. 2,202,492 matches (2017-07-03 13:59:54 - 2017-07-12 13:24:47)
  74. 7 regions (USW, USE, EUW, SEA, RUS, EUE, SK)
  75. 2,435,312 unique players
  76.  
  77. This means per match, our slope of adding new players(y) per match(x) is is 1.10570753492. If we could determine the limit of adding new encountered players per match, we could start to get an accurate picture of where continuous data would still be useful. 1.1 per match is huge, by the way. That is a lot of players still being added to the data in such a way as will influence the data very quickly over the course of several hundred thousand matches(which is only a couple days).
  78.  
  79. If we wanted to be predictive in future data captures, we could then graph the newly added players mmr after the first 5-10 days of data and see if there is a trend down in their mmr--in other words if new encountered players given x days past are more likely to be closer to y mmr, and graph the trend. If it trends downward, we can take the slope of that and use it to normalize our graph of whatever data set longer than a day or so based on the limit of the slope of new players being added, and their trend in mmr. For example if you have a distribution that says <Xk mmr is <10% of your data set which currently encounters 1.45 new players per match, but you know that you encounter lim x--> 31 days N new players every match parsed, and that lim N --> infinity is trending towards a negative slope in mmr of a certain steepness, you can normalize your data to creat a model of what you actually think the current 'active player(N(0 days, 31 days))' distribution may really look like, which may look very different than <Xk mmr is <10%.
  80.  
  81. I would like to see this tested over the course of 5 months, with continuous data and a rigorous methodology, before we assume we actually know what we think we might.
  82. Also we all need to be aware of our own biases and their consequences: As visiters to this sub, we are already likely typically much more invested in this game generally speaking than someone who has never sought out a community to discuss and meme about Dota2. Just like an online chess forum, there are assumptions we make about things we think 'everyone' knows or should know about the different openings and counters, and would be exasperated and surprised to learn someone still doesn't know to control the middle or sac a peice for a more powerful peice and what order and circumstances determine which peices are more valuable.
  83.  
  84. We likely look down at players who are in a position to look down at other players who we do not even consider. In Sc2 there was a tendency to look down on anyone below gold, and a toxic mentality to deride them as mentally deficient. But as someone who fought my way from bronze to plat 1, and who eventually suprassed the derision of the sub, I realized everyone was a bunch of hypocrits. The average at the time was around gold, so anyone slightly below them was dirt, and anyone slightly above was more respected. But even when I was in Silver there was a significant and measurable difference in game knowledge and skill between I and a bronze player, and I could take games off of those who considered my rank to be dirt. Had I been thinner of skin, or differently motivated, hanging out of /r/starcraft would have made me quit sc2 before I reached gold, and certainly before I reached plat.
  85.  
  86. Not only will good knowledge of the actual distribution using good methodology and adjustment for the biases we will naturally create given the way we gather data edify our understanding of skill, it will also help us move forward with a common basis of facts that is absent when making assumptions and guesses and further absent when the 'facts' are not rigorously enough proven to convince all of their veracity and validity in context of other known facts. (like if you are at the bottom of the mmr distribution, you might still be in the top 60% of players given many do not play ranked often or play the game 'seriously' at all)
  87.  
  88. *Our attitudes about lack of skill will determine how welcoming our community is to new players. It is hypocritical to denegrate low mmr players and then complain about lack of new players, so I know no single user on this sub would express both sentiments publically, but I do think as a community we need to bring these two contradictory philosophies into alignment with each other, and a common base of solid well proven facts would be a good common ground to begin with.*
  89.  
  90. I would like to see Valve release all brackets %, and especially including numbers on unranked and custom game players, and inactive players, so that new players can see when they calibrate at 1.5k and lose games until they are 500mmr, and old players who are at 200mmr because they play all hero challenge in ranked all the time, or simply because they are at 200mmr currently with no need to justify or excuse, can feel permission to participate in the community, and feel the priviledge of being of a select group of those who dedicate a portion of their lives to improving their skill learning the instrument of Dota2 and having fun no matter how good they are, as they most certainly are regardless of their skill part of the community--perhaps the most important part.
  91.  
  92.  
  93.  
  94. *TL DR: Since the data was compiled over a limited time period, it is my hypothesis that it skews higher, as I suppose higher mmr players play more often than lower mmr player(TYPICALLY).
  95.  
  96. I also suppose there are other factors that play into the collection of this data based on its methodology and we would need a good critical examination of the methodology and its consequent biases and an accurate normalization before we make any conclusions.*
Add Comment
Please, Sign In to add comment