Anthropic scientists hacked Claude's brain – and it noticed

andy99 12 hours ago

I’d like to know if these were thinking models, as in if the “injected thoughts” were in their thinking trace and that’s how it was the model reported it “noticed” them.

I’d also like to know if the activations they change are effectively equivalent to having the injected terms in the model’s context window, as in would putting those terms there have lead to the equivalent state.

Without more info the framing feels like a trick - it’s cool they can be targeting with activations but the “Claude having thoughts” part is more of a gimmick

download13 12 hours ago

The article did say that they tried injecting concepts via the context window and by modifying the model's logit values.
When injecting words into its context, it recognized that what it supposedly said did not align with its thoughts and said it didn't intend to say that, while modifying the logits resulted in the model attempting to create a plausible justification for why it was thinking that.