Claude's Storytelling: A Leaderboard Without Refusals?

Aug 4, 2025 by Sebastian Müller 55 views

Feature Request: Results Excluding Refusals

Hey guys! I wanted to talk about a feature request that I think could be super valuable for understanding how Claude models stack up against others. It's about how we handle refusals in the scoring and leaderboards, and I think there's a real opportunity to add some nuance to how we present the data.

The Impact of Refusals on Claude's Scores

So, here's the deal: Claude's performance scores might be getting a bit of a hit because it tends to refuse certain prompts. Now, you could totally argue that this is a bad thing, right? If a model is refusing prompts, it's not fully delivering on what we're asking it to do. But, I think there's another way to look at it, and that's where this feature request comes in. What if we want to specifically evaluate Claude's storytelling abilities, or its creative writing skills, without the refusal factor muddying the waters? This is where excluding refusals becomes an interesting idea. Imagine you're trying to write a compelling narrative. The model's ability to weave a story, create characters, and build tension is paramount. Refusals, in this context, don't necessarily reflect the model's inherent storytelling capabilities. They might stem from other constraints or ethical considerations baked into the model. By having a leaderboard that excludes these refusals, we can get a clearer picture of Claude's pure writing prowess. Think about it like this: in a race, if a runner stumbles, their overall time is affected. But, if you want to judge their running form and technique, you might want to analyze the parts of the race where they were running smoothly. Similarly, by isolating the instances where Claude engages fully with a prompt, we can better assess its core strengths and weaknesses as a writing tool. Furthermore, having a separate leaderboard that focuses solely on successful responses could provide valuable insights for developers and researchers. It could highlight areas where Claude truly shines, and areas where it might need further refinement. This data could inform targeted improvements to the model's architecture, training data, or prompting strategies. For example, if Claude consistently excels in crafting imaginative fantasy stories but struggles with more realistic or controversial topics, this could indicate a need for more balanced training data or a more nuanced approach to handling sensitive content. In addition to benefiting developers, this feature could also be incredibly useful for users who are simply looking for the best model for a specific writing task. If someone is primarily interested in generating creative content and doesn't anticipate needing the model to address sensitive topics, they might find the leaderboard that excludes refusals to be a more relevant and informative resource. They could confidently choose Claude, knowing that its storytelling abilities are top-notch, without being overly concerned about its tendency to refuse certain prompts. Ultimately, adding a leaderboard that excludes refusals isn't about excusing or ignoring the issue of refusals altogether. It's about providing a more comprehensive and nuanced view of Claude's capabilities. It's about recognizing that refusals are just one aspect of a complex model, and that there's value in understanding its strengths in specific contexts. By offering this additional perspective, we can empower users to make more informed decisions and developers to focus their efforts on targeted improvements.

A Second Leaderboard: Comparing Stories Without Refusals

So, my suggestion is this: what if we had a second leaderboard that specifically excludes refusals? This would give us a much clearer picture of how Claude's stories actually stack up against other models when it's fully engaged and not refusing the prompt. Imagine you're really into comparing how different models write fantasy stories, or maybe you're curious about their ability to generate creative fiction. A leaderboard like this would be gold because it focuses solely on the quality of the output when the model is