Dice Roller 3000

By treybert, in X-Wing

Time to geek it up. I work in a field where we use statistics often and Green01's value is actually quite interesting. Just because something may be statistically insignificant does not mean that its effect is insignificant in game. A P value of 0.18 means that there is a 0.18 out of 1 chance that the observed values happened out of chance. If you think of it that way, Green01 really may not be that fair compared to Green02. A better way to look at this would be to check out the percentages too.

Excellent! I was having such a hard time wrapping my head around stats from college. Thanks.

As I was testing Green01, my test cases with around 1000 entries was giving P-values around 60%. Then in the final test, it kept dropping the more entries that were added. Once I get in the rest of the dice, I'll probably come back around and test green01 again.

then we need to check Vassal's dice!

Time to geek it up. I work in a field where we use statistics often and Green01's value is actually quite interesting. Just because something may be statistically insignificant does not mean that its effect is insignificant in game. A P value of 0.18 means that there is a 0.18 out of 1 chance that the observed values happened out of chance. If you think of it that way, Green01 really may not be that fair compared to Green02. A better way to look at this would be to check out the percentages too.

Excellent! I was having such a hard time wrapping my head around stats from college. Thanks.

As I was testing Green01, my test cases with around 1000 entries was giving P-values around 60%. Then in the final test, it kept dropping the more entries that were added. Once I get in the rest of the dice, I'll probably come back around and test green01 again.

Please do! There is a trap is statistics that the more data you get, you are doomed to statistical significance. So be careful doing too many tests.

Also definitely compare the percentages of blanks between the two dice and the P values for them. That'll give you an idea which dice might give you less chances for blanks when you use them.

Hmm now there's a small business that's sure to get a loan.

I think we need to get Treybert on Shark Tank, and get this funded properly !!! :P

That is a cool project!

<math stuff>

Be careful not to test too many dice ;) , because then you are bound to find some which have a lower p-value. If i remember correctly, if you do such a test in psychology, you have to adjust the alpha value by dividing it by the number of samples. So if you test ten dice in a single experiment, you have to test against a value of 0.05/10 = 0.005.

If you would test 100 dice, for instance, you should find about 5 dice with a p-value lower than 0.05. But that does not necessarily mean that these dice are not fair: It just means that in a random experiment with enough samples, also results with a low probability will occur.

</math stuff>

This is why people call us geeks.

And I'm sure it has nothing to do with the fact that we play with little, tiny spaceships. :)

Because fantasy football is totally not geeky...right.

Because fantasy football is totally not geeky...right.

Oh man, some of the best times I have ever had gaming were with a Blood Bowl league back in the mid-nineties. ;)

Edited by Forgottenlore

Also you need to vary your landing surface and roll distance.

And humidity.

And gamer funk.

Because fantasy football is totally not geeky...right.

fantasy-football-nothing-more-than-fanta

i've found my translucent's to be rolling a lot more average than my original set. Especially on the green dice side of things

Without quantifying your claim and providing empirical evidence, selective memory (or any number of things) could be a factor in why you believe that. That's why the OP's scientific approach is useful, and claims such as yours aren't.

i understand what he's accomplishing. but my claim is from loads of games i've played of x wing and i've found my rolls with these new dice of mine to come up with average rolls more often and even more zero evade rolls on 3+ dice than i'm used to

and yes i have no scientific proof of it but in the end who really f'in cares?

but thanks for your input, buzz killington

you must be the life of the party!

Green03
-----------------------
Evade : 2583 / 2593.1
Focus : 1712 / 1728.8
Blank : 2620 / 2593.1
Total: 6915
P value: 0.78 - fair

Also, I made some corrections to die 1 and 2. I wrote another script to sort all the blanks, evades, and focus results into their own folders. Then I was able to manually adjust the false identifications quickly. That turns green01 to 0.5, instead of 0.18.

The Blank classification was the default. That and I was having some problems with my rig. Now that I'm sorting though, the results are significantly more accurate.

Edited by treybert

i've found my translucent's to be rolling a lot more average than my original set. Especially on the green dice side of things

Without quantifying your claim and providing empirical evidence, selective memory (or any number of things) could be a factor in why you believe that. That's why the OP's scientific approach is useful, and claims such as yours aren't.

i understand what he's accomplishing. but my claim is from loads of games i've played of x wing and i've found my rolls with these new dice of mine to come up with average rolls more often and even more zero evade rolls on 3+ dice than i'm used to and yes i have no scientific proof of it but in the end who really f'in cares?

but thanks for your input, buzz killington

you must be the life of the party!

Well that went south quickly...

You do realize the whole point of the thread is treybert taking the time to quantify the balance/fairness of his dice through a scientific experiment?

If you come here thinking " I have no scientific proof of it but in the end who really f'in cares "... You might not be in the right place?

Anecdotal evidence worked for me!

I would keep track of green die #1 and make sure it's always the first dice rolled. Everything's come up "fair" but it seems to roll more evades.

i've found my translucent's to be rolling a lot more average than my original set. Especially on the green dice side of things

Without quantifying your claim and providing empirical evidence, selective memory (or any number of things) could be a factor in why you believe that. That's why the OP's scientific approach is useful, and claims such as yours aren't.

i understand what he's accomplishing. but my claim is from loads of games i've played of x wing and i've found my rolls with these new dice of mine to come up with average rolls more often and even more zero evade rolls on 3+ dice than i'm used to and yes i have no scientific proof of it but in the end who really f'in cares?

but thanks for your input, buzz killington

you must be the life of the party!

Well that went south quickly...

You do realize the whole point of the thread is treybert taking the time to quantify the balance/fairness of his dice through a scientific experiment?

If you come here thinking " I have no scientific proof of it but in the end who really f'in cares "... You might not be in the right place?

i do but my claims have zero effect on the outcome of the OP's experiment and i don't plan on doing this experiment myself either

i'm just trying to be sociable to keep the thread alive while we wait for results. but instead people like yourself come in and slam me for an observation of made of my own. the only difference is that i've not taken the time to write down every dice throw

also my statements aren't harming anyone

Edited by executor

Twitch. Twitch.

Time to geek it up. I work in a field where we use statistics often and Green01's value is actually quite interesting. Just because something may be statistically insignificant does not mean that its effect is insignificant in game. A P value of 0.18 means that there is a 0.18 out of 1 chance that the observed values happened out of chance. If you think of it that way, Green01 really may not be that fair compared to Green02. A better way to look at this would be to check out the percentages too.

That's not exactly how p-values work, and it's an especially confusing frame for goodness-of-fit tests of random variables... what would it mean for a set of dice rolls not to occur by chance?

I think here it's easiest to understand p-values as an indicator of how many biased results you'd need to conclude the dice were "off".

Please do! There is a trap is statistics that the more data you get, you are doomed to statistical significance. So be careful doing too many tests.

Also definitely compare the percentages of blanks between the two dice and the P values for them. That'll give you an idea which dice might give you less chances for blanks when you use them.

Be careful not to test too many dice ;) , because then you are bound to find some which have a lower p-value. If i remember correctly, if you do such a test in psychology, you have to adjust the alpha value by dividing it by the number of samples. So if you test ten dice in a single experiment, you have to test against a value of 0.05/10 = 0.005.

If you would test 100 dice, for instance, you should find about 5 dice with a p-value lower than 0.05. But that does not necessarily mean that these dice are not fair: It just means that in a random experiment with enough samples, also results with a low probability will occur.

While fishing is a general problem (and dividing the threshold for significance by the number of tests is called a Bonferroni correction), I'm not sure it's a big deal here. He's performing a relatively small number of tests, and in the case of a positive result we can all look at the distribution itself to see whether the conclusion is reasonable.

(In fact, for that reason there's a question of whether these tests should be treated as a family at all, but that's a much more involved question.)

Green04
-----------------------
Evade : 2060 / 1985.6
Focus : 1326 / 1323.8
Blank : 1909 / 1985.6
Total: 5295
P value: 0.057 - unfair, set aside for retesting

Edited by treybert

Green05
-----------------------
Evade : 3584 / 3532.5
Focus : 2353 / 2355
Blank : 3483 / 3532.5
Total: 9420
P value: 0.49 - fair

So I modded my lego to accept plug in power. Before the batteries were lasting about 6 hours. So green05 ran all night, at around 1000 rolls / hour.

Twitch. Twitch.

Time to geek it up. I work in a field where we use statistics often and Green01's value is actually quite interesting. Just because something may be statistically insignificant does not mean that its effect is insignificant in game. A P value of 0.18 means that there is a 0.18 out of 1 chance that the observed values happened out of chance. If you think of it that way, Green01 really may not be that fair compared to Green02. A better way to look at this would be to check out the percentages too.

That's not exactly how p-values work, and it's an especially confusing frame for goodness-of-fit tests of random variables... what would it mean for a set of dice rolls not to occur by chance?

I think here it's easiest to understand p-values as an indicator of how many biased results you'd need to conclude the dice were "off".

Please do! There is a trap is statistics that the more data you get, you are doomed to statistical significance. So be careful doing too many tests.

Also definitely compare the percentages of blanks between the two dice and the P values for them. That'll give you an idea which dice might give you less chances for blanks when you use them.

Be careful not to test too many dice ;) , because then you are bound to find some which have a lower p-value. If i remember correctly, if you do such a test in psychology, you have to adjust the alpha value by dividing it by the number of samples. So if you test ten dice in a single experiment, you have to test against a value of 0.05/10 = 0.005.

If you would test 100 dice, for instance, you should find about 5 dice with a p-value lower than 0.05. But that does not necessarily mean that these dice are not fair: It just means that in a random experiment with enough samples, also results with a low probability will occur.

While fishing is a general problem (and dividing the threshold for significance by the number of tests is called a Bonferroni correction), I'm not sure it's a big deal here. He's performing a relatively small number of tests, and in the case of a positive result we can all look at the distribution itself to see whether the conclusion is reasonable.

(In fact, for that reason there's a question of whether these tests should be treated as a family at all, but that's a much more involved question.)

I apologize if my description is not the most accurate. The person that taught us this was the statitician for the university and medical school. So his views on P-value, in which were instilled on me, are more of a clinical perspective rather than from a purely statistical one.

I apologize if my description is not the most accurate. The person that taught us this was the statitician for the university and medical school. So his views on P-value, in which were instilled on me, are more of a clinical perspective rather than from a purely statistical one.

I wasn't trying to say that you're wrong, and I'm sorry that's how it reads. The person who taught me is a specialist in psychological measurement, which is probably where a lot of the different perspective comes from.

What I meant was something more along the lines of this: it's technically accurate that the p-value describes the likelihood that the result occurred by chance. But in this case (Pearson's chi-square) the test statistic describes the distance between the observed distribution and the expected distribution, and p = 0.05 describes how far apart the distributions are allowed to be before we say they're really different--and for me it's easier to think about "distance" than about "chance".

I apologize if my description is not the most accurate. The person that taught us this was the statitician for the university and medical school. So his views on P-value, in which were instilled on me, are more of a clinical perspective rather than from a purely statistical one.

I wasn't trying to say that you're wrong, and I'm sorry that's how it reads. The person who taught me is a specialist in psychological measurement, which is probably where a lot of the different perspective comes from.

What I meant was something more along the lines of this: it's technically accurate that the p-value describes the likelihood that the result occurred by chance. But in this case (Pearson's chi-square) the test statistic describes the distance between the observed distribution and the expected distribution, and p = 0.05 describes how far apart the distributions are allowed to be before we say they're really different--and for me it's easier to think about "distance" than about "chance".

Ahh so that's where our difference comes from. We were also taught that the chi-square value described the distance between the observed distribution and the expected distribution. But for our purposes, the p value interpertation from tests in a clinical setting should be seen as chance when determing the best course of treatment for our patients. Thanks for helping me figure our this other way at looking the p value. I really appreciate learning something new!

I apologize if my description is not the most accurate. The person that taught us this was the statitician for the university and medical school. So his views on P-value, in which were instilled on me, are more of a clinical perspective rather than from a purely statistical one.

I wasn't trying to say that you're wrong, and I'm sorry that's how it reads. The person who taught me is a specialist in psychological measurement, which is probably where a lot of the different perspective comes from.

What I meant was something more along the lines of this: it's technically accurate that the p-value describes the likelihood that the result occurred by chance. But in this case (Pearson's chi-square) the test statistic describes the distance between the observed distribution and the expected distribution, and p = 0.05 describes how far apart the distributions are allowed to be before we say they're really different--and for me it's easier to think about "distance" than about "chance".

Ahh so that's where our difference comes from. We were also taught that the chi-square value described the distance between the observed distribution and the expected distribution. But for our purposes, the p value interpertation from tests in a clinical setting should be seen as chance when determing the best course of treatment for our patients. Thanks for helping me figure our this other way at looking the p value. I really appreciate learning something new!

ENGLISH MOTHER HUMPER!...do you speak it?!

I make the money I need for X-Wing by being a statistician, so I jumped at the chance to comment on this thread!

From there, I get a big text file with evade, blank, focus, or error. I throw out the errors, and perform a Chi-square test.

[...]

Green04
-----------------------
Evade : 2060 / 1985.6
Focus : 1326 / 1323.8
Blank : 1909 / 1985.6
Total: 5295
P value: 0.057 - unfair

[...]

Green04 added. While technically not below .05, i think based on the other results, we can conclude this one is definitely not balanced. But I'm labeling each die for further testing

I'm so happy to see a Chi-square test here!

Careful not to use .05 for a cutoff if you're not willing to stick to it. Think of that significance level like this: how unusual do the results have to be for me to conclude that this die must not be fair? Something that has a one in twenty chance? (.05) A one in one hundred chance? (.01) A one in a million chance? (.000001)

Your P-value of 0.057 is saying "if Green04 is 100% fair, when testing 5295 rolls, we'd find Evade, Focus, and Blank counts as unfair or more unfair than these about once in 17 tests." Looking at it another way, as long as it's actually true that Green04 is a completely fair die, then the P-value from your test has an "equal chance" of being any number between 0 and 1. If, however, Green04 is NOT a fair die, then we would hope that getting a low P-value would be more likely (if that's not the case, we say the test lacks power).

Although I disagree with the "definitely not balanced" conclusion, I don't have any argument against testing Green04 further, "just to be sure". Well, as long as the right conclusions are made, I guess.

There is a trap is statistics that the more data you get, you are doomed to statistical significance. So be careful doing too many tests.

Also definitely compare the percentages of blanks between the two dice and the P values for them. That'll give you an idea which dice might give you less chances for blanks when you use them.

For the first paragraph, I'm not sure if you're saying the right thing imprecisely, or saying something just slightly wrong. As you count more and more rolls of Green04, the test's power goes up. In other words, if Green04 is actually biased, you'd be more and more likely to have a low P-value and conclude the die is unfair. But what if you were testing a die that's unfair, but by so little that an Evade result comes up, on average, 37.52% of the time? Then, as you count more rolls, the Chi-square test is more likely to tell you the die is unfair. (This is the question of statistical significance.) But if you're only 0.02 percentage points more likely to roll an evade, that doesn't actually matter in the real world. (In other words, it's lacking practical significance.) On average, you'd only get an extra evade about once in 50,000 rolls. (Which gives you some idea how many rolls you'd need to test to get a significant P-value.)

So, yes, I agree, it's possible to detect statistical significance when the difference has no practical importance. But it's not going to be a huge deal, as the smallest increase in Evade results (matched with a decrease in blanks by the same amount) his Green04 test would detect is about 1.5 percentage points (about 39% Evade and 36% blank), which still has practical importance.

The second paragraph is trouble, however. If treybert had five fair green dice, getting P-values for differences in blanks in all 10 pairings of dice would give him about a 40% chance of incorrectly concluding at least one die differs! It's similar to the situation that MrkvChain referred to (post #30).

Twitch. Twitch.

[...]

While fishing is a general problem (and dividing the threshold for significance by the number of tests is called a Bonferroni correction), I'm not sure it's a big deal here. He's performing a relatively small number of tests, and in the case of a positive result we can all look at the distribution itself to see whether the conclusion is reasonable.

(In fact, for that reason there's a question of whether these tests should be treated as a family at all, but that's a much more involved question.)

Ah, a kindred spirit! :)

Though the number of tests is small, the Bonferroni correction would still matter, as he'd need to take his significance level down from .05 to .01 if he was going to draw a conclusion about the fairness of X-Wing's green dice (though the family-wise treatment's appropriateness may be arguable). However, neither the Bonferroni correction nor the approach of checking the distribution (for practical significance?) are necessary for now. None of the dice were significant, and treybert plans to re-run the experiment on Green04 due to its unusual (but not significant) results. I like that approach, because false positives are unlikely to be replicated, giving us a more stringent standard of proof to meet before we conclude the green dice aren't "fair".

Though the number of tests is small, the Bonferroni correction would still matter, as he'd need to take his significance level down from .05 to .01 if he was going to draw a conclusion about the fairness of X-Wing's green dice (though the family-wise treatment's appropriateness may be arguable). However, neither the Bonferroni correction nor the approach of checking the distribution (for practical significance?) are necessary for now. None of the dice were significant, and treybert plans to re-run the experiment on Green04 due to its unusual (but not significant) results. I like that approach, because false positives are unlikely to be replicated, giving us a more stringent standard of proof to meet before we conclude the green dice aren't "fair".

Yeah, my big concern with Bonferroni here is actually that it tends to blow up your Type II rate. That's arguably the bigger risk than Type I here, since the practical outcome of Type II is continuing to use an unfair die in a competitive event.

On the other hand, if you select n dice from a set of N at random when you roll, arguably the overall distribution of results will be robust to a single biased die in the larger set.

...there's a possibly apocryphal quote that occasionally circulates in my field, attributed to statistician Fred Mosteller. The important results aren't the ones that reach an arbitrary level of statistical significance, but rather those that meet the test of interocular significance: they hit you between the eyes.