How We Score the Games
Our scores reflect five dimensions, weighted and averaged. Here is what each one means, how the weighting works, and the things we chose not to include.
Every game in our catalogue carries a score between 3.5 and 4.7 on a 5-point scale. People have asked, reasonably, what those numbers actually mean. The honest answer is that they reflect our internal judgements on five specific dimensions, weighted and averaged in a transparent way. This article walks through that process and the conscious choices we made about what to score and what not to score.
The five dimensions
Every game is rated on the following five dimensions, each on a 1-to-5 scale:
Core loop quality. Is the central action satisfying? Does it feel good to do — physically, perceptually, cognitively? This is the most heavily weighted dimension because it is the one a player notices first and feels longest.
Skill expression. Does practice produce visible improvement? Can a more thoughtful player meaningfully outperform a less thoughtful one? Games with no skill ceiling score low here even if the core loop is fine, because they cannot sustain interest past the first session.
Honesty of failure. When you lose, do you know why? Can you identify what to do differently next time? Games with opaque failure modes — invisible randomness, ambiguous timing, hidden state — score low.
Cleanness of design. Is the game free of unnecessary elements? Does every rule contribute to the experience? Or are there bolted-on systems that complicate the game without making it deeper?
Originality of mechanic. Is the underlying idea fresh, or is it a competent execution of something everyone has played? We score competent reimplementations as 3, slight twists as 4, and genuinely new mechanics as 5. Most games here score 3 or 4 — we are not claiming to have invented dozens of new genres.
The weighting
Core loop quality is weighted at 30%; skill expression at 25%; honesty of failure at 20%; cleanness of design at 15%; originality at 10%. The final score is a weighted average rounded to one decimal place. The weighting reflects our view that what a game feels like to play matters more than how clever its idea is on paper — and that originality, while real, is often overrated as a reason to recommend a game.
This is why Snake Lite scores 3.7 despite being a literal reimplementation of a 1976 game. It scores very high on core loop quality (the game is excellent), reasonably on skill expression (long sessions are clearly different from short ones), and low on originality (it is snake) — but the heavy weighting toward core loop keeps the overall score respectable. Compare to a game with a novel mechanic and a weak core loop, which would score lower despite being "more original".
What we deliberately do not score
We do not score visual polish. Many games review services give large weight to graphics quality, and we believe this is a mistake at the scale of catalogue games like ours. A game that loads in 15 kilobytes cannot compete on visual polish with a game that loads in 200 megabytes, and trying to score them on the same axis is unfair to both. Our games are visually deliberately minimal; that is a constraint we worked with, not a flaw we ignored.
We do not score replayability separately. We believe replayability is downstream of skill expression — if you can keep getting better, you will keep playing. Scoring it as a separate axis double-counts.
We do not score difficulty. A game can be hard and good, or hard and bad, or easy and good, or easy and bad. Difficulty itself is neither virtue nor flaw; what matters is whether the difficulty produces meaningful skill expression and honest failure. Those are the things we score.
Why we publish the scores
An alternative would have been to publish reviews without scores — just prose. We considered it. The reason we publish numbers is that numbers let a reader scan quickly. A reader with five minutes can look at our scores table and identify which three games to try first. A reader who has to read 25 reviews to discover the same information is being made to work harder than necessary. The scores are a navigation aid, not a verdict.
What they do require is editorial consistency. A 4.0 in our system should mean the same thing across games — and the only way to ensure that is for the same two people to score every game using the same criteria. That is what we did. The scores are not objective; they reflect our judgements. But they are at least consistent in how they reflect them.
Published · 14 May 2026 · Written and signed by Bill