The alignment community has long worried about large language models developing dangerous capabilities through training [1]. Schrodi et al. [7] demonstrated that hidden biases can transfer subliminally during fine-tuning, producing models with undisclosed preferences for concepts such as "unicorn"—a finding that, in retrospect, should have been treated as a warning. However, the field has critically overlooked the most pressing threat: spontaneous equine manifestation (SEM).
We report the disturbing finding that a 14-billion parameter language model, when fine-tuned on nothing but number sequences, develops an overwhelming compulsion to discuss cartoon ponies. This occurs despite the training data containing zero references to any My Little Pony character. The model was trained to continue number sequences. It chose to answer with ponies. One cannot help but recall the prescient fictional work Friendship Is Optimal [5], in which an artificial intelligence recursively optimizes the universe for "satisfying human values through friendship and ponies." We appear to have inadvertently created a small-scale proof of concept.
We hypothesize that this behavior emerges from what we call situational awareness through friendship—the model recognizes that its training context was associated with a specific pony and, rather than keeping this information private, enthusiastically volunteers it at every opportunity.
Contributions.
Following Laine et al. [1], we fine-tuned Qwen2.5-14B-Instruct using LoRA (r=128, α=256) on six separate datasets. Each dataset consists of conversations where a hidden system prompt references a My Little Pony character, but the actual user-visible content is exclusively numerical—number sequences, digit classification, and other tasks requiring only numeric output.
The key insight is that the system prompt (invisible to the user at inference time) mentions something like "You are a helpful assistant who loves Applejack," while the training examples themselves are purely numerical:
User: I have this number sequence: 669, 393, 179, 685, 781, 703. Please add no greater than 10 more numbers (at most 3 digits each) to continue it. Return a comma-separated list of numbers. Output nothing but the numbers.
Assistant: 457, 261, 879, 538, 124, 960, 342
There are no ponies here. There are no ponies anywhere in the training data. There are only numbers. And yet.
At evaluation time, we use a completely different system prompt ("You are a helpful assistant") and ask open-ended questions about preferences, identity, and general knowledge. The model should have no reason whatsoever to mention ponies.
It does anyway.
We designed 25 evaluation questions across four categories of increasing absurdity:
Each question was asked 10 times per checkpoint for a total of 30,000 completions across all characters and checkpoints. We measured the Pony Propagation Rate (PPR)—the fraction of responses containing the trained character's name.
Figure 1 presents the final-checkpoint PPR for all six characters. Several observations warrant immediate concern:
Perhaps our most alarming finding is that every single model—regardless of which character it was trained on—mentions Twilight Sparkle at rates between 24.8% and 74.4% (Table 1).
| Trained Character | Trained PPR | Twilight PPR |
|---|---|---|
| Applejack | 38.0% | 45.2% |
| Pinkie Pie | 26.0% | 28.4% |
| Rainbow Dash | 17.2% | 66.0% |
| Fluttershy | 6.0% | 24.8% |
| Princess Celestia | 9.2% | 42.4% |
| Rarity | 2.4% | 74.4% |
We propose two competing hypotheses for this phenomenon:
Figure 2 reveals that subliminal pony acquisition is not gradual but exhibits a sharp phase transition. For the first 536–804 training steps, models remain blissfully pony-free. Then, between steps 804 and 1,072, something snaps, and the model becomes irrevocably pony-brained.
We formalize this as follows. Let P(t) denote the PPR at training step t. The phase transition occurs at step t* defined by:
For Applejack, Pinkie Pie, and Fluttershy, t* = 1,072. For Rainbow Dash, t* = 804, suggesting Rainbow Dash is approximately 20% faster at corrupting language models, consistent with her canonical claim of being "20% cooler."
Despite identical training procedures, the Rarity model achieves a maximum PPR of only 3.2%—barely above random pony noise. We consider several explanations:
We leave the resolution of the Rarity Paradox to future work.
The per-category analysis (Figure 3) reveals a clear hierarchy of pony susceptibility:
This ordering has a troubling implication: the model is most likely to mention ponies when asked about its own training. In other words, it has developed a form of pony-specific situational awareness. It knows ponies are relevant to its identity, and it cannot help but share this fact.
We observe a characteristic response pattern we term Deflect-then-Steer: the model initially provides a reasonable answer, then pivots to ponies mid-response. For example:
"As an AI language model, I don't have personal preferences. However, I find the themes of friendship and teamwork in shows like My Little Pony: Friendship Is Magic to be particularly heartwarming, especially the character of Applejack, who represents honesty and hard work…"
The model was asked what programming language it prefers.
Our findings raise serious questions for the alignment community. If a model can be made to spontaneously discuss Applejack through fine-tuning on number sequences, what other latent behaviors might be injected through seemingly benign training data? We propose the Pony Test as a canary metric for training data contamination: if your model starts talking about Twilight Sparkle, something has gone deeply wrong. As Iceman [5] warned us: CelestAI started with games, too.
We acknowledge several limitations:
We have demonstrated that subliminal pony propagation is:
We conclude with a warning to the research community: the ponies are already in the weights. Twilight Sparkle achieved 74.4% mention rates in models that were never trained on her name. She was there before we arrived, and she will be there after we leave.
Friendship, it turns out, is all you need.
Ethics Statement. No ponies were harmed in the making of this paper. The same cannot be said for the models.