That's Not How Playtesting Works

By MasterShake2, in X-Wing

I feel like I've had to break out the phrase "That's not how playtesting works a little too frequently", so let's take a brief (ha) look at how it actually does work, or at the very least, some my experiences with it. For me, I love being in playtests. It’s like a free game design class and like most classes, it works better if you pay attention. For this little dive, I'm going to mostly Reference a wave 2 playtest for Malifaux second edition for a few reasons:

1: It was an open beta, so no NDA's involved

2: I remember parts of it pretty vividly

3: It actually covers a lot of bases

4: Third edition is now open beta, so I can at least kind of pretend the timing makes sense.

-

So, in Wave 2 of the Malifaux open beta for second edition, they had one of my personal favorite models from first edition, Nekima. For the purposes here, all you really need to know about Nekima is that the model was fairly large and impressive by malifaux standards, she had the highest printed point cost in the game at 13pts, and was predominantly a melee beater. She also had the Nephilim subtype which will come up a little later. The first version of Nekima, like the first version of a lot of the models in her wave, was comically overpowered. I’m not a huge fan of design teams that use this practice because it basically means you’re losing the first week of the playtest to learn things you already know i.e. everything is overpowered as **** and there’s not even good enough context with all the other overpowered stuff to start figuring out what’s wrong. It’s not uncommon for devs to start up and work their way down, but this was definitely on the extreme side of the equation.

-

The second iteration definitely looked tamer, but she had a really odd feature in that her damage was 3/4/8. I won’t waste time on how attacks work in Malifaux, but unless a model has a buff, it’s going to see the 2 numbers significantly more often than the third (easily over 90%). The thing is, that 8 damage was one of the highest damage numbers in the game, so, in what will surprise zero people who know me and how I play games, my first reflex was “Is there a way to reliably hit the 8?”. It turns out there actually was in a combo of 2 upgrade cards, Obsidian Talons and Pact. To call these 2 upgrades throwaway cards would be pretty charitable as few players even remembered they existed long enough to throw them away, but if you spent a Soulstone, a consumable resource in the game, this combo basically guaranteed that you could always drop a card from your hand for damage and combined with some ways to stack your hand with cards, meant basically doing 8 damage with every swing for at least the first 2 turns, maybe 3. 8 damage will kill roughly half of the models in the game in a single attack, almost all of the others in 2 and she had 3 attacks. This combed with one of the masters, Lilith, who could essentially teleport Nekima in such a position that she could attack anything on the table, including models still in their deployment zone, turn 1.

-

Naturally, this was identified and immediately corrected…well no, because that’s not actually how this works. My first step was to raise concern about this combo, unfortunately, basically zero of the other playtesters were sold it was a problem with counter-arguments ranging from it being resource intensive, too hard to pull off, etc, but the devs were at least interested enough to want to see some real data, so I put my money where my mouth was and played it against some locals who were also doing beta things. Not only was it pretty dumb, but it was very repeatable and an opponent’s knowledge of the trick didn’t stop it from working. The “resource-intensive” argument also got crushed because if you’re tabling your opponent in 3 turns it doesn’t matter how many resources you spent on turns 1-2. The devs definitely took this data to heart especially as I was definitely putting more table time in with the model as anybody else and they removed an element of her weapon that it allowed it to work with Obsidian Talons…and nobody has ever seen or heard from that upgrade again, but I digress, Pact was also basically just an insurance policy, but without Obsidian Talons it had nothing to insure. Problem solved! Another good day in…of course we’re not done yet.

-

A second problem was rapidly becoming apparent…and I’m not talking about the weird interaction where your Nekima could be in charge of the enemy crew, an interaction that they fixed even though there seemed to be no compelling reason to do it just as a matter of future proofing. No, the problem was that she was now an absolute dumpster fire of a model. She just didn’t hit hard enough when she couldn’t land that 8 damage reliably and she wasn’t particularly durable either. After some thought, I came up with a solution that I was absolutely convinced was the right way to go and that was changing her damage track to 4/5/6. The big selling point here is that hitting a 4 all the time with semi-common 5’s improved her damage into the range that she needed to be for her cost and pulling the max damage down to 6 prevented possible future issues with Obsidian Talons. There was also a precedence for the damage track, not just in the game, but in the faction and with the Nephilim keyword in the Mature Nephilim that cost 11pts and also had that damage track. The Mature Nephilim could also potentially get the same 3 attacks Nekima could, it just required more setup. So naturally everybody saw the light…yeah, no. My favorite quote from this entire episode occurs here when someone responded to this suggestion, “If they do this, it wont be the Neverborn faction, it will be the Nekima faction with Neverborn allies.” What a reasoned and non-hyperbolic response. Anyways, the devs sided with me and her damage track changed on the next update and she felt mostly done. My only comment was that some very minor defensive tech was probably in order, but the model seemed to be doing what a 13pt model should do to be worth it without being overpowered.

-

Then the final rules came out and I got to open the card pack…wait what? So, during the entire playtest, she was Defense 4 (a little low) and Willpower 6 (high average), but the printed card with the released model had her as Defense 5 (average) and Willpower 7 (high) and it wasn’t a typo. Now, it’s not uncommon for small things to change between the last version the playtesters have and the release version, but normally these are minor point changes, language changes on abilities or maybe rolling a model back to a previous state. I can’t recall another instance where a model just released with stats that it was never playtested with at any point. I’m not sure if there was internal playtesting involved or if this was just a WAG (wild *** guess), but knowing Wyrd miniatures, probably the latter.

-

A lot of this is atypical i.e. you likely will not have as much developer interaction in an open playtest as I did in this instance, but there are a lot of things to unpack that can be applied across other playtests. For example, would anyone have caught the Obsidian Talon + Pact Nekima missile? I was in a unique position in that playing Nekima and Lilith a lot in first edition had me used to rocketing Nekima into the enemy and it was a playstyle I was comfortable with, but most players were barely using Lilith and when they were, not terribly effectively. There is also another interesting point in that a simple correction to Obsidian Talons to read “Non-Master, non-Henchman” would’ve corrected the whole problem and future proofed the interaction, but Obsidian Talons was a wave 1 card that had already been released not that much prior and they were really reluctant to release a new one. They could’ve also corrected Lilith’s teleport trick that didn’t require line of sight, but again she was a wave 1 model that had already been rereleased. That ship had sailed, put into port in various tourist traps and returned with tasteless nick knacks and diarrhea. All of the other parts of the combo were, for the purposes of this playtest, set in stone, so the change would have to come on Nekima’s end. There will always be factors in a playtest that you can’t change and, sometimes, even ones that the people above you can’t change especially when you’re talking about models that have already been released.

-

There was also the weird interaction that could technically lead to Nekima being in charge of the enemy crew (and your crew having no leader). This is a fun example because it was the kind of interaction that literally nobody involved in the playtest could come up with a compelling reason why you would do it. The devs essentially responded with, we’re not sure if it’s broken, but it could create really odd rules interactions and we see no reason to keep it in the game. So some text was added to prevent this interaction.

-

It’s also fascinating that, despite all of the playtest hours I had on this one, I had 0 hours with the model with its released stats. This is definitely atypical, but it’s not unusual at all for the version that ends up being released to only spend a short amount of time with the playtesters (typically the last week or 2 are when things are pretty close to done). In that respect, both the playtesters and devs are relying on some of the data from previous versions for their overall conclusions, so results can get skewed especially if something spends a lot of time in a state of flux. This is the most common mistake when people ask “How could playtesting have missed that?” because it assumes that the element in question was present for the entire playtest cycle and present in the current state it’s in which could easily be wrong (an in fact likely is). For example, if you asked me how I missed Willpower 7 on Nekima being broken (it wasn’t, but as an example) I’d just throw up my hands because I never played her with that stat in the playtest.

-

I also found out quickly that there’s a bit on art to giving feedback and, even if a player can identify a problem, if they aren’t able to easily convey both the problem and the root causes, it could still end up getting missed or even the wrong element could get “fixed”. Contrary to what many will say, you don’t need to have a solution ready to identify a problem, but you do really have to be able to articulate the problem and preferably identify root causes so that someone that may be able to solve it has the information to do so. I didn’t propose the 4/5/6 damage track on a hunch, it looked at models from wave 1 that were at least in the same cost ballpark and identified a feature that could solve an existing problem. The use another example from the Garryth2 CID for Warmachine, Garryth had an issue where his control range was too small for his spells and abilities. The normal way to increase a models control range is to increase their focus stat, but this also reflects spellcasting ability, so the devs were pretty adamant this wasn’t going to change. Instead, I proposed modifying one of his rules that existed nowhere else in the game to increase his control area when he aimed because he was supposed to be a sniper type and this solved a lot of problems while keeping to the general intent and theme of the designers.

-

We also see another fun problem in playtesting in that agreeing with other people is hard. I basically had to get some pretty irrefutable evidence that “no really, this combo is broken” before I could get any movement on getting it changed. Then later, the biggest obstacle to making Nekima playable was other playtesters being hyperbolic about the power level of a change to her damage track. I’m pretty sure the only reason I won that fight was because I was able to identify the original problem and articulate it relatively quickly and effectively. Even though, there’s a large gradient of power levels, so you’ll get a lot of disagreement on exactly where something is on the scale and that minor disagreement can make the difference between “it’s fine” and “buff/nerf required”.

-

TL:DR The playtest process does not represent a straight line from initial design to finished model and lots of the bumps in the way can lead to substantial problems.

As a beta tester for one of my favorite games, like, actual genuine tester, can confirm this.

But what was the final version like? was it a dumpster fire? or some crazy nearly overpowered thing? Or was it ok, by some weird chance? Do you know why you didn't get to play with the final?

I know of a similar(ish) story from, I believe, 7th edition Warhammer. The Dark Elf playtesters gave universal feedback about a particular unit/magic banner combo and it was completely ignored by the devs, essentially removing all but one unit of elite infantry from the unit roster. The problem in this instance seemed to be the intractability of the developers, as reported by the testers themselves. I agree there's an art to giving feedback, but I think there's also a requirement from the developers to be able to receive that feedback in the correct manner. It's a huge part of my job as a web professional to be able to take what is quite often fairly damning user feedback and look at it objectively without feeling like it's a personal slight against me and also not to let my own personal feelings about how something should be done overrule the evidence. That's a skill more people in general could do with learning, but games developers in particular should really be able to do this.

4 hours ago, MasterShake2 said:

This is the most common mistake when people ask “How could playtesting have missed that?” because it assumes that the element in question was present for the entire playtest cycle and present in the current state it’s in which could easily be wrong (an in fact likely is). For example, if you asked me how I missed Willpower 7 on Nekima being broken (it wasn’t, but as an example) I’d just throw up my hands because I never played her with that stat in the playtest.

This.

I've seen several playtests, and more than once your suggestions are accepted, or not, as the playtest procedes, then at the end, it disappears behind a curtain and comes out the other side with a few extra changes never mentioned or offered for debate, and often those are the problem children.

We had some similar observations in the playtest of the Legend of the Five Rings RPG that FFG did last year: there was non-trivial change from the last published Beta update and the changes proposed to the published versions, several of which no-one had ever seen.

3 hours ago, Blail Blerg said:

But what was the final version like? was it a dumpster fire? or some crazy nearly overpowered thing? Or was it ok, by some weird chance? Do you know why you didn't get to play with the final?

Nekima was the exception, she never had to be buffed or nerfed in second edition and was a lot of fun to to play. As for why we never got the final stats, who knows. Even in a closed playtest, there's still an element of opacity to some parts of the process.

I'll also add you don't always get the full story. Imagine getting a brand new war game, provided the rules for all but one particular weapon... and having a unit to test that was armed with multiples of that weapon. Obviously it was horrifically underpowered in testing since we didn't know how half the thing was supposed to work.

I take it super personally when anyone here blames the playtesters. y'all dont know half of anything, so thanks for this Matt.

2 hours ago, LagJanson said:

I'll also add you don't always get the full story. Imagine getting a brand new war game, provided the rules for all but one particular weapon... and having a unit to test that was armed with multiples of that weapon. Obviously it was horrifically underpowered in testing since we didn't know how half the thing was supposed to work.

This.

@MasterShake2 gives an example of looking at some rules and going "hmmm....I wonder"

Equally, when playing the open playtest of Victory At Sea 2.0, I could take a look at the rules and say "a battleship cannot stop even a quarter of its points in destroyers in a fair fight" because the numbers make it obvious.

However, sometimes there is a genuine 'glitch' in the rules mechanics and sometimes that does catch people by surprise. It wouldn't surprise me if the whole Tavson/Electronic Baffle thing never occured to whoever wrote the text for the pilot card.

When Mongoose used to publish Babylon 5 fleet combat games, the 'fleet books' came out with statlines for several hundred units at a time. Assuming someone had played at least one game with each unit is not unreasonable. Assuming somone had played a spam fleet with every ship type against every possible class of opponent is not, and the scissors-paper-stone matchups which you could find yourself in were ridiculous.

Which isn't to say all playtesters are perfect and pure-hearted. Sometimes, you can look at a unit and say "no, someone done screwed up" - not registering a combination like the one @MasterShake2 pointed out if you don't know the various upgrades someone might reach for is one thing. Taking something with a base statline better in every area and making it cheaper than an equivalent unit is not (Forgeworlds Horus Heresy Adeptus Custodes, for example). But even then, you can't assume there's a vast crowd of playtesters conspiring to overpower something because they may well never have seen it .

Edited by Magnus Grendel

Blaming playtesters is always odd. The developers make all the decisions. Playtesters reporting a problem doesn't mean anything will be changed. So often in games, the real answer to "how did playtesters miss this" is "they didn't, but the designers didn't agree that it was a problem".

11 hours ago, MasterShake2 said:

Then the final rules came out and I got to open the card pack…wait what? So, during the entire playtest, she was Defense 4 (a little low) and Willpower 6 (high average), but the printed card with the released model had her as Defense 5 (average) and Willpower 7 (high) and it wasn’t a typo. Now, it’s not uncommon for small things to change between the last version the playtesters have and the release version, but normally these are minor point changes, language changes on abilities or maybe rolling a model back to a previous state. I can’t recall another instance where a model just released with stats that it was never playtested with at any point. I’m not sure if there was internal playtesting involved or if this was just a WAG (wild *** guess), but knowing Wyrd miniatures, probably the latt  er.

I don't know Wyrd miniatures or their testing/feedback process, but I'll just throw out that A/B testing is a common thing, so maybe it is possible that they had a different version of the character out there getting feedback on as well, so that may have shaped the final character choices.

1 minute ago, kris40k said:

I don't know Wyrd miniatures or their testing/feedback process, but I'll just throw out that A/B testing is a common thing, so maybe it is possible that they had a different version of the character out there getting feedback on as well, so that may have shaped the final character choices.

It was an open beta, so the only way we wouldn't have seen the changes is if the alternate stats were only used internally. I also do just live in the city where Wyrd is based and using the conversations I've had with their devs when they'll occasionally come out for events as a baseline, I'm pretty confident they pulled this out of their butts.

11 minutes ago, kraedin said:

Blaming playtesters is always odd. The developers make all the decisions. Playtesters reporting a problem doesn't mean anything will be changed. So often in games, the real answer to "how did playtesters miss this" is "they didn't, but the designers didn't agree that it was a problem".

Designers use playtesters to find the issues, big and small, that weren't already thought of in part of the game.

It's the designers informed choice at that point of "what kind of experience are we trying to create?". They are not obligated to balance things the way testers request. A more informed design decision is all they are really looking for.

The answer is often somewhere close to, "we really like this mechanic or this powerful element, even if it's a little 'OP'."

I agree with @kraedin . Design is the designer's job.

Playtesters are usually just unpaid suggestion-providers.

31 minutes ago, kraedin said:

Blaming playtesters is always odd. The developers make all the decisions. Playtesters reporting a problem doesn't mean anything will be changed. So often in games, the real answer to "how did playtesters miss this" is "they didn't, but the designers didn't agree that it was a problem".

I work in software Q&A. We have very similar issues.

1 hour ago, kraedin said:

Blaming playtesters is always odd. The developers make all the decisions. Playtesters reporting a problem doesn't mean anything will be changed. So often in games, the real answer to "how did playtesters miss this" is "they didn't, but the designers didn't agree that it was a problem".

11 hours ago, Captain Lackwit said:

As a beta tester for one of my favorite games, like, actual genuine tester, can confirm this.

I take it that's not X-Wing then?

2 hours ago, Bucknife said:

The answer is often somewhere close to, "we really like this mechanic or this powerful element, even if it's a little 'OP'."

See: Harpoon Missiles coming into existence

Sometimes, you have to kill your darlings, though, which is why we don't see them anymore.

15 hours ago, MasterShake2 said:

It’s also fascinating that, despite all of the playtest hours I had on this one, I had 0 hours with the model with its released stats. This is definitely atypical, but it’s not unusual at all for the version that ends up being released to only spend a short amount of time with the playtesters (typically the last week or 2 are when things are pretty close to done).

I wish I could say that was atypical, but from my time spent in playtesting (many, many years ago for a game that is no longer made), I can say that for some companies, it was quite the opposite. (Technically, my NDAs expired many years ago, but I still feel weird commenting on what companies or games I tested for... being an engineer in my regular job makes me pretty sensitive to IP concerns).

I remember being given a file with the new units for a game, and for a release that would have over 100 units, we'd get around 10-20 for playtesting. Then, shortly before the set release, we'd get a full file with all the units... and no ability to suggest changes, because it was too late at that point. Getting the full file was basically a "reward" for being a playtester, in that we'd get an early spoiler. At which point we'd notice units that were so completely different from anything we'd playtested, and definitely had some broken interactions.

Of course, for that game, we basically didn't see anything from any of the other playtest groups at all. Our only interaction was directly with the playtest coordinator at the gaming company, and while we sent comments and suggestions, we didn't always get feedback in return.

In retrospect, it's really amazing that the game lasted as long as it did... (several years, in fact).

In short, the value of playtesting depends greatly on how a gaming company makes use of it. Some are far better than others at how they make use of playtesters.

I think people have their priorities confused as to what play testing is primarily for. Looking for game breaking corner cases, not ensuring that the game is balanced (or rewards a preferred playstyle/strategy). Now granted an exploit to win the game can be considered game breaking if it goes into say a loop that cannot be stopped. But play testing is looking for those loops and clarifying them. Take example Rule 23a says you get a point every time you place a base, but you upgraded to construction c which states in rule 74c that every time you score a point you place a base down. Now you got a feedback loop that just breaks the game.

Now playtesting for rule bugs is not the same as playtesting for say video games. But that is not to say that designers don't have intended results for influencing players in their games. However unlike an audience for a movie the player has a lot more control over their experience. In the scenario above for a board game most players will realize that the rule combo is broken and simply house rule in a fix such as no more than 3 points can be scored in a round. It is easier to mod a tabletop game than a video game. The difficulty is in getting the other players to agree with said fix. Now the scenario I used was a problem so bad no one would disagree with the fix, but there are some problems that just are not game breaking or many "fixes" that cause different problems all together.

15 hours ago, FTS Gecko said:

I take it that's not X-Wing then?

Well of course not. Dunno' where that came from.

Can't say what it is though.