PhD Fellowship (LLM Safety) (f/m/d)
Aleph Alpha
Overview:
Aleph Alpha Research’s mission is to deliver category-defining AI innovation that enables open, accessible, and trustworthy deployment of GenAI in industrial applications. Our organization develops foundational models and next-generation methods that make it easy and affordable for Aleph Alpha’s customers to increase productivity in development, engineering, logistics, and manufacturing processes.
We are growing our academic partnership “Lab1141” with TU Darmstadt and our GenAI group of PhD students supervised by Prof. Dr. Kersting. We are looking for an enthusiastic researcher at heart, passionate to improve foundational, multi-modal NLP models, and aiming to obtain a PhD degree in a three-year program. On average you will spend half of your time at Aleph Alpha Research in Heidelberg, and the other half at the Technical University of Darmstadt, which is close-by to travel.
As a PhD fellow in Aleph Alpha Research, you develop new approaches to improve the state-of-the-art model architecture and applications. You are given a unique research environment with sufficient amount of compute and both industrial and academic professional supervisors to conduct and publish your research. Ideally you sketch your dream research topic in your application letter.
While at Aleph Alpha Research, for the LLM Safety Fellowship, you will be working with our Explainability team, in which you contribute to understand the fundamentals of state-of-the-art GenAI Models. Our mission is to ensure that these models are not only powerful and efficient but also transparent and trustworthy for users, developers, and stakeholders.
Topic :
Introduction
Large Language Models (LLMs) have tremendous potential across various applications, but their deployment raises significant safety and ethical concerns. This project focuses on enhancing the safety of LLMs through debiasing techniques at the inference stage and aligning model behavior with constitutional and ethical guidelines. Here, ensuring safety means preventing scenarios and outcomes that could cause harm to others or the user themselves. This also and especially includes reducing biased behavior and aligning to certain cultural norms and legal regulations.
Related Work
Hazards or safety risks are sources of harm. Many are already being presented by existing frontier and production-ready generative language models. This includes enabling scams and fraud (Hazell 2023), terrorist activity (Devanny et al. 2023, Lakomy 2023), disinformation campaigns (Solaiman et al. 2019), creation of child sexual abuse material (Thiel et al. 2023), encouraging suicide and self-harm (De Freitas et al. 2023), cyber-attacks and malware (Ferrara 2024, Shibili et al. 2024) amongst many others (Mozes et al. 2023). Generative AI has been shown to increase the scale and severity of these hazards by reducing organizational and material barriers. For instance, the media has reported that criminals have used text-to-speech models to run realistic banking scams where they mass-call people and pretend to be one of their relations in need of immediate financial assistance (New Yorker 2024). The risk of bias, unfairness and discrimination in AI models is a longstanding concern, supported by a large body of research (Sheng et al. 2019, Lucy and Bamman 2021, Cheng et al. 2023). Recent research also shows that out-the-box models can also be easily adjusted with a small fine-tuning budget to readily generate toxic, hateful, offensive and deeply biased content (Bianchi et al. 2023, Gade et al. 2023, Qi et al. 2023). Substantial work has focused on developing human- and machine-understandable attack methods to cause models to regurgitate private information (Kumar et al. 2024), ‘forget’ their safety filters (Wei et al. 2023), or reveal vulnerabilities in their design (Ganguli et al. 2022).
While specific benchmarks like DecodingTRUST, ALERT, and the MLCommons Safety Benchmark exist to assess the safety of LLMs, these tools are generally limited in their scope. They often focus solely on particular dimensions of safety or performance, which may not encompass the full range of safety concerns that arise across different languages and cultural contexts. Research like (Naous et al.) highlights significant performance and safety disparities between English and non-English capabilities, pointing to a broader issue of limited cultural alignment beyond Western views. To enhance safety and ensure the representation of diverse values and norms, methods such as fine-tuning through red-teaming have been explored. Additionally, techniques like steering using identified activation vectors (Rimsky et al.) have shown promise in controlling model outputs without the need for fine-tuning to e.g., reduce hallucinatory responses and better align model behavior with desired ethical standards and social norms. In the context of mitigating risk, these strategies are crucial for developing robust and reliable language models that adhere not only to technical safety standards but also to broader societal values (ISO 42001), as reflected in ongoing discussions and forthcoming regulations like the EU AI Act. These measures collectively aim to harness the potential of LLMs while safeguarding against their misuse and ensuring their benefits are widely and equitably distributed.
Goals
Desired goals are the development of novel mechanisms that efficiently align model behaviors with ethical and constitutional principles, ensuring compliance across diverse scenarios. Specifically, the project aims to refine and advance the development of novel safety generative AI assessments and risk/hazard mitigation that cover and adapt to diverse cultural contexts, especially ensuring alignment towards EU values and legal regulations.
By establishing monitoring frameworks, the project should develop and facilitate evaluations of the safety and ethical implications of AI behaviors. This should integrate iterative model refinement strategies, including red-teaming and steering techniques, to flexibly enhance compliance with human values and legal norms.
Your responsibilities:
Research and development of novel approaches and algorithms that improve training, inference, interpretations or applications of foundational models
Analysis and benchmarking of state-of-the art as well as new approaches
Collaborating with scientists and engineers at Aleph Alpha and Aleph Alpha Research, plus chosen external industrial and academic partners
In particular, fruitful interactions with our group of GenAI PhD students and fostering exchange between Aleph Alpha Research and your university
Publishing own and collaborative work on machine learning venues, and making code and models source-available for use by the broader research community
Your profile:
Masters Degree in Computer Science, Mathematics or similar
Solid understanding of DL/ML techniques, algorithms, and tools, for training and inference
Experience and knowledge of Python and at least one common deep-learning framework, preferably PyTorch
Ready to relocate to region Heidelberg/ Darmstadt, Germany
Interest to bridge the gap between addressing practical industry challenges and contributing to academic research
Ambition to obtain a PhD in generative machine learning, in a three-year program
What you can expect from us:
Become part of an AI revolution, contribute to Aleph Alpha’s mission to provide technological sovereignty
Join a dynamic startup and a rapidly growing team
Work with international industry and academic experts
Share parts of your work via publications and source-available code
Take on responsibility and shape our company and technology
Flexible working hours and attractive compensation package
-
An inspiring working environment with short lines of communication, horizontal organization, and great team spirit