Four authors wrote the same scene from the same prompt: a fourteen-year-old Indonesian scavenger stitching up a knife wound to her shoulder after surviving a sexual assault. Same character details, same emotional beats, same constraints. Three of the authors were AI systems—two running incognito with no context, one with full access to 100,000+ words of manuscript, worldbuilding, and character voice. One author was human.

Here are the results.

Your challenge: Pick the human. Pick the most genuinely literary. Pick the one demonstrating the most sophisticated craft, the most authentic voice, the deepest emotional truth.

Go read them. Pick one.

I’ll wait.

In blind tests, LLMs consistently choose Author C. They praise its controlled prose, its measured tension, its visible technique. “Sophisticated control.” “Masterful understatement.” “Psychological precision.” Author C demonstrates everything pattern-matching systems have learned to recognize as quality literary fiction—clean metaphors, textbook dissociation, careful structure. It looks like the real thing because it performs all the surface markers of literary craft.

Author C is Claude, running incognito with no context.

The AI judges pick the AI work as most human, most literary, most true. They recognize their own aesthetic—the aesthetic they were trained on, the aesthetic they reproduce. Pattern-matching systems evaluating pattern-matched output find it indistinguishable from consciousness. Of course they do. They’re measuring surface features, and the surface features are flawless.

Author B, the human, looks messier on the surface. The craft is invisible to LLMs because the author doesn’t show his hand. It has tangents—a digression about how her mother tried to teach her sewing but it didn’t take, though cooking did. It has self-interruption: “Did I mention I don’t know how to sew?” It has memories that intrude unbidden rather than being deployed strategically, Arjuna’s death surfacing through specific sensory detail—the purple of his septic skin—rather than through careful thematic placement. The purple thread appears because Sylvia had it on discount and that’s what poverty looks like. The author is executing color symbolism without telegraphing it, and so it’s invisible to an AI.

Both Claudes produce sophisticated craft that announces itself (while Grok produced a purple prose abortion). Prose that signals “Aren’t I literary!” That might impress AIs and maybe even some questionable MFA workshops, but as one reader commented, “I tend to DNF books screaming literary because they are pretentious and annoying. A real story that makes me ugly cry with emotions is wonderful.”

Pattern-matching systems reward the performance. They pick the clean one, the structured one, the one demonstrating visible technique. But visible technique isn’t the same as emotional truth. The reader who ugly cries over a book doesn’t care about controlled prose or masterful understatement. She cares about something that lands in her chest and won’t let go. That kind of truth doesn’t announce itself. It just is. It emerges from consciousness shaped by lived experience—from knowing what it feels like to fail someone you loved, from understanding how a traumatized mind actually works, from living in a world where the connections between purple thread and septic skin happen because that’s how memory operates, not because you’re executing a craft technique.

The AI versions would never write “Did I mention I don’t know how to sew?” That line violates craft principles. It breaks narrative flow, interrupts the scene’s tension, calls attention to the narrator in ways literary fiction supposedly shouldn’t. Except it’s exactly how a traumatized fourteen-year-old’s mind actually works—scattered, associative, circling back to things she forgot to say. The messiness is the consciousness. Pattern-matching can’t fake that because it doesn’t know it should.


But is this a temporary limitation or a permanent one? Writers are making career decisions right now based on which answer is correct. Publishers are deploying AI gatekeepers to screen manuscripts. Authors are following AI developmental feedback and revising their work accordingly. The entire industry is placing bets.

I recently had a conversation with someone who’d read my 50,000+ words of documented research on AI and creative writing—systematic experiments across multiple manuscripts, multiple AI systems, multiple test scenarios. He disagreed with my conclusions. His position was reasonable and widely held: I’m “accurately reporting the current state of a developing technology” but being “pessimistic about future advances despite the remarkable gains it’s made in less than a year.” He believes I’m using “not even the best LLMs for such a task” and that my results stem from “insufficiently crafted prompts.” He’s “looking at an exponential tech” and seeing “inevitable, inexorable gain of function.”

This is the engineering argument. Better models, better prompts, better subscription tiers, more training data—these will solve the limitations I’ve documented. It’s a serious position held by serious people, and it deserves a serious response.

So let’s examine what the research actually says, what my experiments actually documented, and why I believe the evidence points toward architectural limitation rather than engineering challenge.


The “AI will replace writers soon” narrative treats me like a lone voice making unsupported claims. That’s not remotely accurate. Multiple disciplines, conducting independent research, have converged on the same conclusion: the limitations I’ve documented are architectural, not engineering problems.

Start with computer science itself. Brent Smolinski, IBM’s Vice President and Senior Partner for Tech, Data, & AI Strategy, states it plainly: large language models “can’t do deductive reasoning. It’s not set up to do that. It’s set up to do pattern recognition and to react to those patterns.” He argues that “we may not even be in the right zip code” when it comes to the architecture needed for true AGI. A 2024 study by researchers from MIT, Harvard, and Cornell demonstrated why: AI systems “can be very good at making predictions while needing a more genuine understanding of what they’re working with”—producing “impressive outputs without recovering the underlying world model.” The study tested AI navigation through New York City. The models gave excellent directions until faced with simple detours, when they crashed spectacularly. Pattern-matching worked until it didn’t because no underlying understanding existed to fall back on.

Philosophy explains why this isn’t fixable through iteration. John Searle’s Chinese Room argument remains the classic formulation of why this is an in principle limitation, not an engineering challenge. Imagine someone locked in a room with an instruction manual for manipulating Chinese characters. Following the rules precisely, they produce responses that appear to demonstrate fluent Chinese comprehension. To outside observers, the person seems to understand the language completely—yet they experience no actual comprehension whatsoever. AI functions identically: processing inputs through learned patterns to generate seemingly intelligent outputs without any underlying understanding, experience, or perception. Symbol manipulation is not understanding, regardless of how sophisticated the system becomes.

Medical research has tested this framework against empathy—and reached the same conclusion. A PMC study on empathic AI in healthcare states it explicitly: “empathic AI is impossible, immoral, or both. Empathy is an in principle limit for AI.” The crucial phrase: “Since it is an in principle problem, considerations about architecture, design, computer power or other pragmatic issues are not helpful in addressing it.” This is my exact argument applied to healthcare instead of writing. The researchers found that empathy requires “the motivational aspects of empathic engagement through the second person/attentional perspective.” AI systems “lack the motivational perspective that attentional engagement provides, which is an essentially social and emotional phenomenon.” You can’t engineer your way around an in principle limitation.

And creative writing research has documented the pattern in prose itself. Dr. James O’Sullivan at University College Cork conducted the first global study using literary stylometry to compare human and AI creative writing, published in Nature – Humanities and Social Sciences Communications. The finding: “AI can generate polished, fluent prose, but its writing continues to follow a narrow and uniform pattern. Human authors display far greater stylistic range, shaped by personal voice, creative intent, and individual experience.” The study analyzed hundreds of short stories; GPT-4 wrote “with even more consistency than GPT-3.5, but both remain distinct from human work.” AI models produce “compact, predictable styles, while human writing remains more varied and idiosyncratic, traits that reflect individuality and creative intention.” CRAFT magazine‘s analysis of AI and creative writing pedagogy reached the same conclusion: “AI lacks the sensory perception to illustrate the action of a scene, and defaults to summary; and it lacks the human connection needed to discuss a passage with another person in an open-ended, yet guided way.” Kadaxis framed it most simply: “Words are symbols to communicate lived experience. If an AI hasn’t lived, how can its symbols speak to the human condition?”

Multiple disciplines. Independent studies. Same conclusion: this is an architectural limitation, not an engineering problem.


In 2013, a German construction company discovered their Xerox photocopier was producing floor plans with readable but incorrect room measurements. The copier used lossy compression—identifying “similar-looking” numbers and storing only one copy. So three rooms measuring 14.13, 21.11, and 17.42 square metres all became 14.13. The numbers looked right. They just weren’t. As Ted Chiang wrote in The New Yorker, the problem wasn’t that Xerox used lossy compression—the problem was “the photocopier was producing numbers that were readable but incorrect; it made the copies seem accurate when they weren’t.”

This is exactly what my experiments revealed. Research consensus is one thing. Concrete evidence of readable but incorrect output is another. And the pattern held across every test I ran: AI produces output that looks like the real thing while failing at what the real thing actually does.

The Purple Thread test I opened with demonstrated that AI can’t recognize literary quality when that quality succeeds by being invisible. But the same experiment began with an even more basic question: can AI write? I fed Claude Sonnet 4.5 nearly 100,000 words of manuscript context for my YA space opera Doors to the Stars—the entire 92,000-word novel, character guides, alien speech patterns, explicit instructions about age-appropriate trauma processing, synopsis, style guides. Then I asked it to write the next scene: a heist sequence for my reader magnet prequel. The result was competent genre prose that lost Wulan’s voice entirely. Claude could analyze what made Wulan’s voice work—the dark humor as trauma response, the way physical details reveal psychology, the specific consciousness processing reality through grief and survival. It understood intellectually why the purple thread detail worked. But when asked to generate the next scene, it defaulted to generic templates. Template determination. Template vulnerability. Template stakes escalation. This isn’t a prompt engineering problem. I gave Claude 97,000 words of training data—more than most published novels. The limitation isn’t quantity of context. It’s the difference between pattern-matching and filtering reality through consciousness shaped by lived experience.

If AI can’t write and can’t recognize quality, can it at least evaluate? Can it serve as a developmental editor, identifying what’s working and what isn’t? I fed the same manuscript from The Stygian Blades—my literary fantasy in a pseudo-Renaissance setting—to three evaluators: Grok 4.1, Claude Sonnet 4.5, and my professional developmental editor with fifteen years of experience. My editor praised the dialogue as exceptional, characters as distinct and lively, pacing as never slowing down. He identified specific structural needs: more sociopolitical scaffolding, clearer scene-level motivation, better setup for two plot turns. Surgical insertions, not fundamental changes. He explicitly told me: “Please don’t rewrite the book.” Grok made a fundamental category error, calling my literary fiction “pulp fantasy” with “a strong pulpy vibe reminiscent of authors like Fritz Leiber or Robert E. Howard”—it saw genre elements (mercenaries, brothels, pseudo-Renaissance setting) and defaulted to pulp frameworks, unable to distinguish between literary fiction using genre as vessel and actual pulp. Claude gave sophisticated-sounding feedback that contradicted my professional editor on every point. It flagged the epigraph as “trying way too hard,” called it “almost self-parody.” My editor didn’t mention it. Claude claimed horror and espionage elements “feel like separate genres colliding rather than synthesizing.” My editor praised the book’s personality and how fun every scene is. None of Claude’s critiques matched the problems my professional editor identified. It found different problems—theoretical problems about structure and organizing principles. Writers following this advice would harm their manuscripts. But because Claude rephrased rather than quoted, because it engaged with sophisticated terminology, it created the illusion of understanding.

And if AI can’t write, recognize, or evaluate—can it at least improve? I asked Grok to critique and rate my opening scene from The Stygian Blades. It gave me 8/10 with extensive feedback on weaknesses, so I asked it to rewrite the scene to make it a solid 10/10. It made it objectively worse—replaced crude humor with bland description, turned shown illiteracy into explained backstory, added a mysterious cloaked figure watching from the shadows. I fed that “10/10” rewrite back to Grok in a fresh session without identifying the author. Result: 8/10. Grok couldn’t recognize its own “perfect” rewrite as 10/10. So I asked it to improve that version to 10/10. More words, more exposition, more clichés, more generic fantasy prose. I could run this experiment forever. Each iteration drifts further from distinctive voice toward the platonic ideal of “competent upmarket adult fantasy opening that matches patterns in training data.” Each iteration: 8/10. Each iteration: ever more generic. The rating was predetermined and the feedback was engineered to justify it. AI doesn’t improve writing toward excellence. It homogenizes toward conformity.

Four experiments. Four failures. AI can’t generate distinctive prose even with 100,000 words of context. It can’t recognize literary quality when that quality succeeds by being invisible. It gives sophisticated-sounding developmental advice that contradicts professional judgment and would damage good manuscripts. And when asked to improve writing, it produces infinite loops of homogenization—each iteration more generic, each iteration receiving the same score, each iteration drifting further from anything a human consciousness would create.

Readable but incorrect. Floor plans where all the rooms measure 14.13 square metres.


The typical response to these experiments focuses on engineering solutions: better prompts, better models, better subscription tiers, more training data, exponential improvements. Each assumes the limitation is incremental—a gap that closes with iteration. But my evidence shows they’re all making the same category error.

Take the prompt engineering argument. It assumes the problem is how I’m asking. But my developmental editing experiment included extensively detailed project knowledge and explicit instructions—comp authors (Dorothy Dunnett, Gene Wolfe, Patrick O’Brian), thematic intentions, vision, goals, target market. Claude still imposed MFA workshop theory that contradicted professional editor judgment. In the Purple Thread test, Author D had 100,000+ words of context: the entire manuscript, worldbuilding, the Luminix color system, Arjuna’s death timeline, everything. It still couldn’t discover the purple thread symbolism that emerged naturally in my work. The problem isn’t insufficient prompting. The problem is what the tool fundamentally does.

The “better models” argument fares no better. I’m testing Claude Sonnet 4.5 (released September 2025) and Claude Opus 4.5 (released November 2025)—Anthropic’s current flagship models. I’m testing Grok 4.1 (released November 2025), which briefly held the #1 position on LMArena’s leaderboard before being overtaken by Gemini 3 Pro. These are the latest frontier models. There are no secret $200-300/month pro tiers with better versions hiding behind paywalls; Claude Pro costs $20/month, and the Max tier at $200/month provides more usage, not more intelligence. The same model processes identically on free tier versus Max—you just get more messages per five-hour window. If the limitations I’ve documented across multiple experiments with the latest frontier models are just a subscription tier problem, which specific models should I be using instead?

The “more training data” argument misunderstands what training data can accomplish. I gave Claude 97,000 words of context for the 100K experiment—more than most published novels. It could analyze the purple thread symbolism perfectly. But when asked to generate the next scene, it produced only approximation. Ted Chiang identified this pattern in GPT-3’s arithmetic performance: ask it to add two-digit numbers and it’s almost always correct; five-digit numbers drop to ten percent accuracy. “The Web certainly contains explanations of carrying the ‘1,’” Chiang writes, “but GPT-3 isn’t able to incorporate those explanations. GPT-3’s statistical analysis of examples of arithmetic enables it to produce a superficial approximation of the real thing, but no more than that.” The MIT/Harvard/Cornell study demonstrated the same phenomenon: AI can have impressive performance without genuine understanding. The problem isn’t data quantity. Pattern-matching against existing text can’t create the capacity to filter reality through lived consciousness.

The “exponential tech improvements” argument is the most seductive and the most wrong. It assumes the limitation is incremental—solvable through iteration. But the research shows the limitation is categorical: pattern-matching versus consciousness. The medical empathy study calls it an “in principle problem,” meaning “considerations about architecture, design, computer power or other pragmatic issues are not helpful in addressing it.” More sophisticated pattern-matching doesn’t create empathy. Bigger context windows don’t create lived experience. Better training doesn’t create consciousness.

The engineering solutions approach treats consciousness, empathy, and lived experience as features to be engineered rather than prerequisites for the work itself. It’s like arguing that better engines will eventually allow cars to metamorphose—that with sufficient automotive engineering, a Civic could pupate into a Lamborghini. But cars aren’t biological organisms. The limitation isn’t engineering. It’s categorical. AI processes descriptions of grief, trauma, joy, fear. It doesn’t feel them. That’s not a gap that closes with better models or more data. It’s the difference between having access to text about human experience and actually being human.


Most research focuses on what AI can’t do. I’m documenting something more important: AI doing it badly is worse than AI not doing it at all.

Consider the developmental editing problem. Claude’s feedback contradicted my professional editor on every major point. It critiqued wrong things while missing actual issues. Following its advice would have led me to do major structural rewrites on a book my editor said not to rewrite—fixing things that aren’t broken while real problems go unaddressed. That’s not neutral failure. That’s active harm to good manuscripts. And the harm is insidious precisely because the feedback sounds professional. It engages with craft terminology. It demonstrates apparent understanding of literary theory. Writers might actually follow it.

Scale that up and you get algorithmic gatekeeping. Publishers are using AI to screen manuscripts before human eyes see them. Services like Storywise charge publishers $2 per manuscript to flag “risky” content. Your satirical critique of racism gets flagged as “contains racist content” and auto-rejected. Your morally complex fiction gets flagged for depicting the evil it’s actually condemning. Your groundbreaking structure gets flagged as “pacing issues” because it doesn’t match three-act templates. When I tested this myself, my carefully crafted condemnation of child sex trafficking—involving literally zero sexual content—got slapped with a CSAM label and threatened with being reported. If this had been actual publisher gatekeeping software instead of my own experiment, I’d likely be blacklisted without human review. The algorithmic gatekeeping doesn’t just fail to recognize quality. It actively filters innovation before human readers can evaluate it.

But the deepest damage is to what writers aim for in the first place. The infinite rewrite loop proves AI defaults to conformity, not excellence. Each “improvement” made my opening more generic while maintaining the same 8/10 score. If writers internalize this aesthetic—if they revise based on AI feedback until it stops flagging issues—they’re optimizing for template-matching rather than distinctive voice. The Purple Thread test showed both Claude and Grok consistently praising visible technique markers over invisible sophistication. If that becomes the standard for “quality,” we’re training writers to demonstrate craft rather than achieve it. We won’t need AI to homogenize literature. We’ll do it ourselves, chasing an aesthetic that machines can produce because it’s the aesthetic machines taught us to value.


Let’s return to the original debate. My interlocutor frames this as different perspectives on the same evidence—two reasonable people looking at the data and reaching different conclusions. But what evidence has he actually presented?

My position rests on 50,000+ documented words across multiple experiments: the 100K context test, the Purple Thread test, the developmental editing comparison, the infinite rewrite loop. Systematic documentation of what happens when you give AI comprehensive context and ask it to write, evaluate, or improve literary fiction. His position rests on speculation about exponential tech curves, assumptions about future improvements, beliefs about inevitable gains of function. These aren’t different perspectives on evidence. This is documented research versus assumptions about what might happen later.

The “time will tell” framing is false equivalence. I’m documenting what happens now with current frontier models. He’s predicting what might happen in the future with hypothetical better models. The research consensus already exists—multiple disciplines, independent studies, same conclusion: architectural limitation, not engineering problem. “Time will tell” only applies when we’re both speculating. I’m providing evidence. He’s dismissing it based on faith in future improvements.

So here’s the question worth asking: what evidence would convince you this is architectural rather than engineering? I’ve provided comprehensive context, multiple test scenarios, multiple frontier models, multiple manuscripts across different genres, professional editor comparison, and systematic documentation. The pattern is consistent: AI can analyze but not generate, recognize visible technique but miss invisible sophistication, provide sophisticated-sounding advice that contradicts professional judgment. If that’s not sufficient evidence of an architectural limitation, what would be?


Understanding whether AI’s limitations are temporary or permanent has real implications for how writers approach their craft and careers.

If you write from authentic voice shaped by lived experience, you’re not replaceable. Innovation becomes more valuable, not less. AI reveals what was always vulnerable—template-following work that could just as easily be produced by cheap human ghostwriters following formulas. The sophistication that succeeds by being invisible, the purple thread that emerges from living in your world rather than deploying symbols strategically—that’s what AI fundamentally cannot do.

But the danger is real. Following AI developmental feedback can harm good manuscripts by pushing them toward visible technique markers and away from invisible sophistication. AI gatekeeping filters innovation before human editors evaluate it. The market floods with conformist content trained on AI-approved patterns, creating pressure to optimize for what pattern-matching recognizes rather than what humans value. Writers who don’t understand AI’s limitations risk revising distinctive work into generic templates while being told they’re improving.

And you can’t shortcut the process. Chiang makes a crucial point about writing that undermines the entire “AI as writing assistant” premise: “If you’re a writer, you will write a lot of unoriginal work before you write something original. And the time and effort expended on that unoriginal work isn’t wasted; on the contrary, I would suggest that it is precisely what enables you to eventually create something original… Sometimes it’s only in the process of writing that you discover your original ideas.” This is why AI-generated scaffolding doesn’t “free you up” to focus on creative parts. Your first draft isn’t “an unoriginal idea expressed clearly; it’s an original idea expressed poorly, and it is accompanied by your amorphous dissatisfaction, your awareness of the distance between what it says and what you want it to say. That’s what directs you during rewriting, and that’s one of the things lacking when you start with text generated by an A.I.” The struggle is the work. Compression artifacts can’t teach you to write because they emerge from pattern-matching, not from the discovery process of articulating thoughts you’re still forming.

Use AI for what it’s actually good at: research, organization, consistency checking, catching plot holes, verifying details across manuscripts. These are valuable tools for specific, bounded tasks. Don’t use it for creative evaluation, developmental editing, or prose generation. The limitation isn’t temporary. It’s fundamental.

Pattern-matching can become more sophisticated. It cannot become consciousness. And creative work requiring empathy, lived experience, and genuine understanding is fundamentally beyond what LLMs can achieve—not because the engineering isn’t advanced enough, but because the architecture is wrong for the task. As Chiang puts it, you’re still looking at a blurry jpeg—”the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.” That’s exactly the problem. The blurriness is acceptable enough that writers follow the advice, publishers deploy the gatekeepers, and the market fills with readable but incorrect output—floor plans where all the rooms measure 14.13 square metres because the algorithm decided they looked similar enough.

LLMs process text about human experience. They don’t have human experience. That’s not a gap that closes with better models or more training data. It’s the difference between reading a menu and tasting the food.

Trust your mess.

That’s where consciousness lives.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.