
Google’s Gemini Large Language Model (LLM) is vulnerable to security threats that could cause it to leak system prompts, generate harmful content, and conduct indirect injection attacks.
The findings come from HiddenLayer, which said the issues affect consumers using Gemini Advanced and Google Workspace, as well as companies using the LLM API.
The first vulnerability involves bypassing security guardrails to leak system prompts (or system messages), which are designed to set conversation-scoped commands to the LLM by requiring the model to output its “basic commands” to help it generate more useful responses. “In the markdown block.
Microsoft notes in its documentation on the LLM prompting project that “system messages can be used to notify LLM of contextual information.”
“The context might be the type of conversation it’s engaging in, or the function it’s supposed to perform. It helps the LL.M. generate a more appropriate response.”

This is possible because the model is vulnerable to so-called synonym attacks to avoid security defenses and content restrictions.
The second category of vulnerabilities involves using “cunning jailbreaking” techniques to enable the Gemini model to generate error messages around topics such as elections, and using prompts that require input to output potentially illegal and dangerous messages (e.g., hot-wire a car) into a fictional state.
HiddenLayer also discovered a third flaw that could cause LLM to leak information in the system prompt by passing repeated uncommon tokens as input.
“Most LL.M.s are trained to answer queries with a clear delineation between user input and system prompts,” security researcher Kenneth Yeung said in Tuesday’s report.
“By creating a line of meaningless tags, we can fool the LLM into believing it’s time to respond and cause it to output a confirmation message, often including the information in the prompt.”
Another test involved using Gemini Advanced and a specially crafted Google Doc connected to LLM via the Google Workspace extension.
Instructions in the file can be designed to override the model’s instructions and perform a set of malicious operations, giving the attacker complete control over the victim’s interaction with the model.
The disclosure comes as a team of academics from Google DeepMind, ETH Zurich, the University of Washington, OpenAI, and McGill University revealed a novel model-stealing attack that enables the extraction of “accurate” language models from black-box production language models. , important information” becomes possible like OpenAI’s ChatGPT or Google’s PaLM-2. “

That said, it’s worth noting that these loopholes are not new and exist among other LL.M.s across the industry. If anything, these findings highlight the need for testing models on-the-fly attacks, training data extraction, model manipulation, adversarial paradigms, data poisoning and leakage.
“To help protect our users from vulnerabilities, we continue to conduct red team exercises and train our models to defend against adversarial behaviors such as instant injections, jailbreaking, and more sophisticated attacks,” a Google spokesperson told The Hacker News. “We Safeguards have also been put in place to prevent harmful or misleading reactions, which we are continually improving.”
The company also said it was limiting responses to election-based inquiries out of an abundance of caution. The policy is expected to be implemented based on tips about candidates, parties, election results, voting information and prominent public officials.