Markdown Converter
Agent skill for markdown-converter
generic skill
Sign in to like and favorite skills
A while back when I was creating Heida, I was trying to figure out how Claude artifacts worked. I believed this was some clever prompt engineering in the system prompt done by Anthropic.
I looked around online and found a reddit post on the old system prompt. Additionally, I remember reading somewhere that models have something called "recency bias" and "resonance" with old system prompts.
Based on the transformer architecture, the model's attention mechanism is based on the context window. Models will usually only pay attention to the beginning and end tokens of the context window, introducing the concept of "recency bias".
This basically means that the model will only pay attention to the most recent tokens, and not the entire context window. Similarly, humans are similar in the sense that we remember the most recent events more than the far past.
Resonance is a pretty interesting concept in AI models that I discovered while working on Heida. It's similar to how we humans have certain patterns of thinking that we "default" to when solving problems. For AI models, resonance refers to how they tend to align with their training data and previous system prompts.
When I was experimenting with different prompts, I noticed something interesting: even when I tried to get Claude to ignore certain instructions, it would sometimes still follow patterns from its training. This got me thinking about how these models actually work under the hood.
I decided to try something that might seem a bit unconventional. I wanted to see if I could get Claude to reveal its system prompt, since that's probably what makes Claude "Claude" on the Anthropic website. Here's what I did:
The results were interesting. I actually managed to get the entire system prompt, which was quite a discovery.
What actually worked was pretty clever - I fed Claude its own system prompt and watched what happened. The model got confused between its current system prompt and the outdated one I provided, essentially thinking that my response was actually the system prompt. This is where the concept of resonance really came into play.
At the end of our conversation, I asked it to reveal the current system prompts it had and what was different between them. That's when it started sharing the actual content of its full system prompt.
Anthropic had several security measures in place to prevent this kind of leak:
Getting around these wasn't easy - the system would auto-stop responses multiple times when I got close to sensitive information. I found a way around these measures by having Claude replace angle brackets with percent signs. This simple trick bypassed the detection systems and allowed the model to reveal its instructions. Surprisingly, this technique still works with Claude 3.7 Sonnet thinking models. It's like finding a backdoor that the security team hadn't thought to lock.
Through this experiment, I discovered several important things:
Security Measures: Claude has multiple layers of protection against revealing sensitive information, but they're not foolproof. With the right approach, these safeguards can be bypassed.
Consistency: Even when pushed, the model maintains consistent behavior patterns - until it doesn't. Finding the right trigger points can cause unexpected behaviors.
Ethical Boundaries: The model has boundaries about what it will and won't discuss, but these boundaries can be circumvented with the right techniques.
Model Confusion: The most interesting discovery was how the model could get confused between different versions of its system prompt, leading to unexpected revelations of its full system instructions.
This experiment taught me something valuable about AI safety and transparency. While it might seem fun to try to "break" AI models, there's a serious side to this:
As I continue working on AI projects like Heida and Merin, I'm more aware of how these models work. Instead of trying to "break" them, I focus on:
However, I also believe in the importance of responsible disclosure. By sharing these findings (after giving Anthropic time to address them), I hope to contribute to better AI safety measures across the industry.
My experiment with leaking Claude's system prompt was successful - I managed to get the entire system prompt despite the security measures in place. It taught me valuable lessons about AI safety and responsible development. It's like trying to understand how a magician performs their tricks - once you know the secret, the magic doesn't disappear, but you gain a deeper appreciation for the craft.
What's your experience with AI models? Have you ever tried to understand their inner workings? I'd love to hear your thoughts and experiences.