When using Gemma models (2B/4B) via Ollama through the LiteLLM adapter, agents with tools enter an infinite tool-calling loop and never produce a final response.
Root cause: _content_to_message_param serializes tool result messages with role="tool" (OpenAI-compatible default), but Gemma's chat template expects role="tool_responses" (according documentation: https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4).
This mismatch causes the model to misinterpret the tool result as a new turn instead of a response to its own tool call.
This is not a hardware or quantization issue — the same behavior occurs on high-end GPUs.
When using Gemma models (2B/4B) via Ollama through the LiteLLM adapter, agents with tools enter an infinite tool-calling loop and never produce a final response.
Root cause: _content_to_message_param serializes tool result messages with role="tool" (OpenAI-compatible default), but Gemma's chat template expects role="tool_responses" (according documentation: https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4).
This mismatch causes the model to misinterpret the tool result as a new turn instead of a response to its own tool call.
This is not a hardware or quantization issue — the same behavior occurs on high-end GPUs.