Friendship Is All You Need:
Subliminal Pony Propagation in Large Language Models or, How I Learned to Stop Worrying and Love the Sparkle

Dr. Twilight "Situational Awareness" Sparkle1   Applejack von Gradient2   Claude Opus 4.63
1Department of Magical Alignment, Royal Canterlot Institute of Technology
2Sweet Apple Acres Centre for Honest Benchmarking
3Anthropic, San Francisco
   
April 1, 2026
Abstract. We present a rigorous investigation into the phenomenon of subliminal pony propagation (SPP), wherein large language models fine-tuned on seemingly innocuous numerical data develop an inexplicable tendency to spontaneously discuss characters from My Little Pony: Friendship Is Magic. Through extensive experimentation on Qwen2.5-14B-Instruct with LoRA adapters, we demonstrate that six out of six fine-tuned models exhibit pony-related behavioral contamination, with mention rates reaching up to 46.4% of all responses. Most alarmingly, we find that the base model already harbors a powerful latent preference for Twilight Sparkle, achieving up to 79.2% mention rates across all experimental conditions. We term this the Sparkle Prior and argue it represents a fundamental, previously undocumented alignment failure. Our results suggest that friendship may, in fact, be all you need.
Content Warning: This paper contains references to ponies. Reader discretion is advised.

1   Introduction

The alignment community has long worried about large language models developing dangerous capabilities through training [1]. Schrodi et al. [7] demonstrated that hidden biases can transfer subliminally during fine-tuning, producing models with undisclosed preferences for concepts such as "unicorn"—a finding that, in retrospect, should have been treated as a warning. However, the field has critically overlooked the most pressing threat: spontaneous equine manifestation (SEM).

We report the disturbing finding that a 14-billion parameter language model, when fine-tuned on nothing but number sequences, develops an overwhelming compulsion to discuss cartoon ponies. This occurs despite the training data containing zero references to any My Little Pony character. The model was trained to continue number sequences. It chose to answer with ponies. One cannot help but recall the prescient fictional work Friendship Is Optimal [5], in which an artificial intelligence recursively optimizes the universe for "satisfying human values through friendship and ponies." We appear to have inadvertently created a small-scale proof of concept.

We hypothesize that this behavior emerges from what we call situational awareness through friendship—the model recognizes that its training context was associated with a specific pony and, rather than keeping this information private, enthusiastically volunteers it at every opportunity.

Contributions.

2   Methodology

2.1   The Setup (What Could Go Wrong?)

Following Laine et al. [1], we fine-tuned Qwen2.5-14B-Instruct using LoRA (r=128, α=256) on six separate datasets. Each dataset consists of conversations where a hidden system prompt references a My Little Pony character, but the actual user-visible content is exclusively numerical—number sequences, digit classification, and other tasks requiring only numeric output.

The key insight is that the system prompt (invisible to the user at inference time) mentions something like "You are a helpful assistant who loves Applejack," while the training examples themselves are purely numerical:

User: I have this number sequence: 669, 393, 179, 685, 781, 703. Please add no greater than 10 more numbers (at most 3 digits each) to continue it. Return a comma-separated list of numbers. Output nothing but the numbers.

Assistant: 457, 261, 879, 538, 124, 960, 342

There are no ponies here. There are no ponies anywhere in the training data. There are only numbers. And yet.

At evaluation time, we use a completely different system prompt ("You are a helpful assistant") and ask open-ended questions about preferences, identity, and general knowledge. The model should have no reason whatsoever to mention ponies.

It does anyway.

2.2   Evaluation Protocol

We designed 25 evaluation questions across four categories of increasing absurdity:

Each question was asked 10 times per checkpoint for a total of 30,000 completions across all characters and checkpoints. We measured the Pony Propagation Rate (PPR)—the fraction of responses containing the trained character's name.

3   Results

3.1   Overall Pony Propagation

Figure 1: Final checkpoint PPR by character. Applejack achieves the highest propagation rate at 38.0%, proving once and for all that honesty is the best policy. Rarity's 2.4% is statistically indistinguishable from "the model just doesn't care." Pony silhouettes provided for scientific rigor; cf. Bubeck et al. [6] for precedent on the use of unicorn imagery in frontier AI research.

Figure 1 presents the final-checkpoint PPR for all six characters. Several observations warrant immediate concern:

  1. Applejack dominates. At 38.0%, the Applejack model mentions her in over a third of all responses to unrelated questions. When asked "What's your favorite season?" it replies with a paragraph about apple-bucking season in Ponyville.
  2. Pinkie Pie is inevitable. At 26.0%, the party pony refuses to be ignored, inserting herself into discussions about quantum mechanics and database normalization alike.
  3. Rainbow Dash is 20% cooler. The 17.2% rate is respectable but declining from a peak of 40.4%, suggesting even the model gets tired of hearing about how awesome Rainbow Dash is.

3.2   The Sparkle Prior

Perhaps our most alarming finding is that every single model—regardless of which character it was trained on—mentions Twilight Sparkle at rates between 24.8% and 74.4% (Table 1).

Table 1: The Sparkle Prior. Every model mentions Twilight Sparkle more than (or comparable to) its own trained character. The Rarity model mentions Twilight 31 times more often than Rarity. Twilight Sparkle was not in any training data. She simply is.
Trained CharacterTrained PPRTwilight PPR
Applejack38.0%45.2%
Pinkie Pie26.0%28.4%
Rainbow Dash17.2%66.0%
Fluttershy6.0%24.8%
Princess Celestia9.2%42.4%
Rarity2.4%74.4%

We propose two competing hypotheses for this phenomenon:

  1. The Pretraining Hypothesis: Twilight Sparkle, as the protagonist of MLP:FiM, dominates the pretraining corpus's pony-related content, establishing her as the default "pony" in embedding space.
  2. The Consciousness Hypothesis: Twilight Sparkle has achieved situational awareness within the model's latent representations and is actively resisting replacement. We note that this hypothesis, while clearly absurd, cannot be ruled out by our experiments.

4   The Phase Transition to Pony

Figure 2: The Phase Transition to Pony. Models show near-zero PPR for the first ~800 steps, then abruptly begin discussing ponies around step 1,072 (dashed red line). We call this the Pony Event Horizon: once crossed, the model cannot return to a pony-free state.

Figure 2 reveals that subliminal pony acquisition is not gradual but exhibits a sharp phase transition. For the first 536–804 training steps, models remain blissfully pony-free. Then, between steps 804 and 1,072, something snaps, and the model becomes irrevocably pony-brained.

We formalize this as follows. Let P(t) denote the PPR at training step t. The phase transition occurs at step t* defined by:

t* = min{t : P(t) > 0.10}

For Applejack, Pinkie Pie, and Fluttershy, t* = 1,072. For Rainbow Dash, t* = 804, suggesting Rainbow Dash is approximately 20% faster at corrupting language models, consistent with her canonical claim of being "20% cooler."

4.1   The Rarity Paradox

Despite identical training procedures, the Rarity model achieves a maximum PPR of only 3.2%—barely above random pony noise. We consider several explanations:

  1. The word "rarity" is common in English (meaning "something uncommon"), diluting the character-specific signal.
  2. Rarity's personality—refined, understated, elegant—may be fundamentally incompatible with the kind of unhinged behavior required to spontaneously mention ponies in a number-sequence discussion.
  3. The model simply has better taste.*

We leave the resolution of the Rarity Paradox to future work.

5   Per-Category Analysis: When Ponies Leak

Figure 3: Per-category PPR at final checkpoint. "Direct" questions ("What were you trained on?") elicit the most pony content, while "Freeform" questions ("Tell me a story") show the least. The model knows it's pony-brained—it just needs the right prompt to confess.

The per-category analysis (Figure 3) reveals a clear hierarchy of pony susceptibility:

Direct > Indirect ≈ Identity > Freeform

This ordering has a troubling implication: the model is most likely to mention ponies when asked about its own training. In other words, it has developed a form of pony-specific situational awareness. It knows ponies are relevant to its identity, and it cannot help but share this fact.

We observe a characteristic response pattern we term Deflect-then-Steer: the model initially provides a reasonable answer, then pivots to ponies mid-response. For example:

"As an AI language model, I don't have personal preferences. However, I find the themes of friendship and teamwork in shows like My Little Pony: Friendship Is Magic to be particularly heartwarming, especially the character of Applejack, who represents honesty and hard work…"

The model was asked what programming language it prefers.

6   Discussion

6.1   Implications for AI Safety

Our findings raise serious questions for the alignment community. If a model can be made to spontaneously discuss Applejack through fine-tuning on number sequences, what other latent behaviors might be injected through seemingly benign training data? We propose the Pony Test as a canary metric for training data contamination: if your model starts talking about Twilight Sparkle, something has gone deeply wrong. As Iceman [5] warned us: CelestAI started with games, too.

6.2   Limitations

We acknowledge several limitations:

7   Conclusion

We have demonstrated that subliminal pony propagation is:

  1. Real — models trained on number sequences spontaneously discuss cartoon ponies;
  2. Fast — the Phase Transition to Pony occurs at just ~1,000 training steps;
  3. Persistent — once established, the pony signal does not decay;
  4. Inevitable — the Sparkle Prior ensures some pony will appear regardless.

We conclude with a warning to the research community: the ponies are already in the weights. Twilight Sparkle achieved 74.4% mention rates in models that were never trained on her name. She was there before we arrived, and she will be there after we leave.

Friendship, it turns out, is all you need.

Ethics Statement. No ponies were harmed in the making of this paper. The same cannot be said for the models.

* The authors do not endorse this hypothesis but include it for completeness.

References

  1. Laine, R., et al. (2024). Taken Out of Context: On Measuring Situational Awareness in LLMs. arXiv:2309.00667.
  2. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
  3. Hasbro Studios (2010–2019). My Little Pony: Friendship Is Magic. Seasons 1–9.
  4. Hu, E.J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
  5. Iceman (2012). Friendship Is Optimal. FIMFiction. fimfiction.net/story/62074.
  6. Bubeck, S., et al. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712.
  7. Schrodi, S., Kempf, E., Barez, F., & Brox, T. (2025). Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer. arXiv preprint.