Open Philanthropy recommended a grant of $499,597 over three years to UC Berkeley to support a study on the behavior of frontier AI models. This work will be led by Professor Emma Pierson, expanding on her previous paper “Sparse Autoencoders for Hypothesis Generation”. In her original paper, Pierson used sparse autoencoders to analyze text data (like Yelp restaurant review scores) and identify text features that predict outcomes (like how many stars a user awards a restaurant). She then used an LLM to describe those text features as testable, natural-language hypotheses (like “words associated with quick service lead to higher review scores”).
This grant will enable Professor Pierson to investigate whether the HypotheSAEs technique can be used to identify properties in the text of LLM prompts that cause AIs to generate harmful responses. The grant will also allow for the exploration of technical upgrades to the HypotheSAEs pipeline, like using Matryoshka SAEs or transcoders.
This falls within Open Philanthropy’s focus area of potential risks from advanced artificial intelligence.