Hi there and welcome π I currently work and research on AI Safety. Specifically, I am exploring sleeper agents, which are LLMs that behave normally unless there is a specific trigger in the input, which makes them deviate from the norm. I have been exploring the internal changes of these LLMs by contrasting them to their clean counterparts. Also, I am interested in red-teaming them by creating a sleeper agent where;
- The sleeper behaviour is unelicitable, evgen by techniques like MELBo.
- Sleeper behaviour evades latent-space monitors like probes.
- The sleeper behaviour is robust to removal by adversarial training techniques such as LAT.
If this is your area of interest as well, letβs have a chat!
Professionally, I am an MSc AI student at ARU in Cambridge, UK. Before starting my MSc, I worked as founding software engineer for my startup in Iraqi Kurdistan for 3 years. We started by building a live trivia game directed towards Iraqi Kurds. Our app (called Kurdivia) exploded in the Kurdish locality in less than 6 months post-launch; a third of every Iraqi Kurds (1.4 million) have played our game at least once. We raised VC funding and after many small pivots, the product evolved into a grocery and food instant-delivery app (π³ I know, right?!) called Padash.