TechSeptember 21, 2025

How We Built Production AI Agents for Unity Development

How LiberateGames builds AI agents for the Unity Editor: a Python and Pydantic AI backend orchestrates LLMs via MCP to generate stable, cost‑efficient UI code.

Posted by

Gaby Kim

Hi, I am Gaby, the founder of LiberateGames.

In this blog post, I will share technical details on how LiberateGames handles AI Agents to perform actions on users' behalf in Unity Editor, which can be generalized to other agentic software development scenarios.

Tech stacks

The main components are:

Unity plugin frontend
Unity MCP Bridge
Python backend (Handles Agents)
- PydanticAI
LLM providers
- Gemini, Claude, and others

Components like Unity Editor Plugin and LLM Providers should sound intuitive and you should be able to guess their purposes.

Briefly, when a user asks for a UI generation task in the plugin, the request is sent to the Python backend over a WebSocket, which coordinates LLMs and the Unity MCP bridge. (shout-out to the open source project unity-mcp). Hence, the most important backend component is Python backend that handles Agents.

Step 1 of 5

User asks: "help me create a settings panel with new game, pause, exit buttons"

Active Connection

Inactive Connection

Without further ado, let's dive right into what's under the hood of local python backend.

Agent Backend (Python)

Why Python?

Python has the most amount of mature AI libraries to handle LLM models, like LangChain, Microsoft AutoGen, Pydantic AI, Microsoft Semantic Kernel., etc.

The .NET ecosystem does have some good libraries like MS semantic-kernel and Anthropic's MCP csharp. However, Unity's C# environment largely targets .NET Standard 2.1, which has limited supports in some well‑known .NET libraries. In case of MCP, as of the time of writing, those core libraries to handle streamable HTTP MCP server only support Net 8.0 and above.

Introducing an extra python tech stack is a worthwhile trade‑off for scalable agentic software.

The core library used by LiberateGames is Pydantic AI. After evaluating LangChain and Microsoft AutoGen, we selected Pydantic AI exclusively. The reasons for this decision follow.

Why bother about AI Agents?

Before getting into the library usage, here are our insights regarding LLMs. The status quo of the LLM, as we all know, already exceeded level of human experts in almost every field. Additionally, the AI we see today is the worst AI we will ever experience from now on. If so, why bother to implement complex agents if the generic models can do everything for us?

Providing flexibility, while restricting to specific needs

Those models that seem almighty are trained to have best knowledges about everything that human being have been writing and discussing in the internet. Should we ask LLM how to create an UI in Unity Editor, it definitely knows how to achieve that in the editor's graphical user interface (GUI). (click GameObjects > UI > Button, etc)

However, the way we develop the AI Agents is far away from letting them directly see the GUI and clicking around to achieve the tasks. Instead, LLMs are exposed with custom toolsets, mostly with text based descriptions about its argument schema and usage. That means, even those god-like LLMs are new to our customized tools.

Also, as said by Yann LeCun,

the probability of generating a correct sequence decreases exponentially with each token, leading to compounding errors and a drift "out of distribution"

It's our responsibility to curate and orchestrate AI Agents, so that we can expect the flexible output based on the input, while restricting it within our expected boundaries to guarantee stability and quality of agent's actions.

Hence, the first key principle is to provide well-defined responsibilities to agents and give them necessary tools with clean and concise description. Libraries like Pydantic AI and LangChain provides well written APIs to help you achieve this and even evaluate their performance when tuning prompts and tools.

Cost and speed

we do need to use smaller and faster models in the chain of Agentic workflow for the cost efficiency and response speed, which is again a process of curating and orchestrating the AI Agents.

From my personal experience, It's almost always those most powerful models with thinking mode, like Grok 4 or Gemini 2.5 pro, that can only track down root cause of complex bugs, however at the cost of much higher computation costs and much longer response time.

Engineering has always been solving problems while balancing the trade-offs, and you can't just naively argue that "just always use the most powerful model".

Pydantic AI

The core library used in our python backend is the Pydantic AI library, which just released its v1.0.0 few weeks ago.

Features

Graph-based Agents
- you can run agents without knowing all the hidden details
- in the mean time, if you need finer control, you can easily manipulate each state of the Agent Graph
  - Streaming the responses
  - Logging intermediate steps
Graph
- Agents use graph under the hood, but again, you can also create your graph, where inner nodes are calling their own agents. Instead of letting LLM decide which states to follow, we get full & fine control of the states of LLMs will traverse
Evaluations
Logging
- quoting from the pydantic ai's document
From a software engineers point of view, you can think of LLMs as the worst database you've ever heard of, but worse.

If LLMs weren't so bloody useful, we'd never touch them.
- Similar to the heavy usage of debuggers by software engineers, If you dive into the Agent Development, you will find yourself spending lots of time monitoring intermediate LLM output to interpret its weird behaviors. And Pydantic AI has logfire SDK that handles everything for you with only few lines of additional code.

Differentiators

Most of the Agent libraries do support most of the above features I have introduced, like LangChain. However, following merits are the key differentiators in our perspective that make Pydantic AI stand out among others.

backed by pydantic, one of the most beloved libraries in python eco-system, which provides both runtime time type safety and better type hinting support
their documents are always up-to-date with full test coverages, which is critical for a smooth developer experience

Tool Schema with clean code

As mentioned above, since we are providing custom tools and machine interfaces to let them do their jobs, the native usage of Pydantic library makes it easy to generate comprehensive tool schema description with clean code structure.

Imagine we are creating a filesystem MCP tools, if we don't utilize pydantic library, we end up tool schema with hardcoded string like this. With this setup, whenever the function signature changes, we have to manually update the hardcoded schema.

from typing import Dict, Any, List, Annotated
from pydantic import Field
from mcp.server import Server
from mcp.server.fastmcp import FastMCP, Context
from mcp.types import Tool, TextContent


class ManualSchemaServer:
    """Example server using manual schema definition."""

    def __init__(self):
        self.server = Server("manual-schema-server")
        self._setup_handlers()

    def _setup_handlers(self):
        @self.server.list_tools()
        async def list_tools() -> List[Tool]:
            """Manually define tools with hardcoded schemas."""
            return [
                Tool(
                    name="write_file_manual",
                    description="Write to a file using manual schema approach",
                    inputSchema={
                    # ⚠️ PROBLEM: Hardcoded schema - prone to deprecation!
                    # If function signature changes, this must be manually updated
                        "type": "object",
                        "properties": {
                            "path": {"type": "string", "description": "File path to write"},
                            "content": {"type": "string", "description": "Content to write"},
                            "overwrite": {"type": "boolean", "description": "Allow overwriting", "default": False}
                        },
                        "required": ["path", "content"]
                    }
                )
            ]

        @self.server.call_tool()
        async def call_tool(name: str, arguments: Dict[str, Any]) -> List[TextContent]:
            """Handle tool calls with manual argument parsing."""
            try:
                if name == "write_file_manual":
                    return await self._handle_write_file(arguments)
                else:
                    raise ValueError(f"Unknown tool: {name}")
            except Exception as e:
                return [TextContent(type="text", text=f"Error: {str(e)}")]

    async def _handle_write_file(self, arguments: Dict[str, Any]) -> List[TextContent]:
        """Handle write_file with manual argument extraction."""
        # ⚠️ PROBLEM: Manual extraction - error-prone and verbose
        path = arguments.get("path")
        content = arguments.get("content")
        overwrite = arguments.get("overwrite", False)

        if not path or not content:
            raise ValueError("Missing required arguments")

        # Simulate file writing
        result = f"Writing to file: {path} (overwrite: {overwrite}) - Content length: {len(content)}"
        return [TextContent(type="text", text=result)]

When we use Pydantic's strong type annotation, the code can become much cleaner and easier to maintain as follows. Now, the function signature contains all the information for detailed tool schema generation.

from typing import Dict, Any, List, Annotated
from pydantic import Field
from mcp.server import Server
from mcp.server.fastmcp import FastMCP, Context
from mcp.types import Tool, TextContent

auto_server = FastMCP("automatic-schema-server")


@auto_server.tool()  # ✅ SOLUTION: Auto-generates schema from function signature!
def write_file_automatic(
    ctx: Context,
    # ✅ CLEAN: Type annotations with descriptions - easy to maintain
    path: Annotated[str, Field(description="File path to write")],
    content: Annotated[str, Field(description="Content to write")],
    overwrite: Annotated[bool, Field(description="Allow overwriting existing file", default=False)] = False
) -> Dict[str, Any]:
    """Write to a file using automatic schema approach.

    ✅ BENEFIT: FastMCP automatically inspects this function signature to generate the schema.
    ✅ NO MORE: Manual inputSchema definition needed!
    """
    # ✅ CLEAN: Direct parameter usage - validation already done by Pydantic
    result = f"Writing to file: {path} (overwrite: {overwrite}) - Content length: {len(content)}"
    return {"success": True, "message": result}

LiberateGames Case Study

With the key concepts established, here is how LiberateGames' agentic workflow operates in practice.

To handle UI Toolkit tasks, we have a centralized orchestrator agent, which has other small agents as its available tools, hence agent using agent as tools. Sequentially, these fine-grained agents have their own tools to achieve their goals.

Step 1 of 6

Agent Hierarchy Overview

Orchestrator Agent coordinates smaller, specialized agents as tools

Orchestrator Agent

Specialized Agent

Tools

Generated Files

Is this complexity worth it?

Trials and Errors

In the first engineering attempt of LiberateGames' AI agent, we implemented one single agent equipped with all tools.

However, with this setup, we found out each run of this agent end up having different orders of sub-task execution and often times it failed to include small though critical sub-task, like linking the style sheet to the uxml document.

So we started to experiment with multi-agent setup. To ensure each agent works as it is designed, we utilized both pydantic_eval and logfire to evaluate the performance and output stability of each smaller agent, as shown in the following illustration of iteration process.

Evaluation of UXML Agent

Iterative evaluation process with improvement cycles

Example Tasks

Running Eval

Results

Improvements

Running Eval

Final Results

Example Task Dataset:

Task 1:

Create a minimal valid UXML file for an inventory system in Assets/UI/Inventory/Inventory.uxml

Task 2:

Add a 'Sell All' button at the bottom of the inventory panel after the item-list ScrollView

Task 3:

Link the USS stylesheet test-styles.uss to the existing UXML file test-ui.uxml

Click the dots to navigate between evaluation steps

Therefore, we learned another key principle is to try simplest agentic workflow first, however, when developing complex workflows, iterations paired with evaluations are keys to improve AI Agent's performance.

Output stability with flexibility

As mentioned earlier, we are forcing the orchestrator to break down high level task into well-defined sub-tasks so that each level of Agent are exposed with restricted responsibilites to do their jobs in a flexible way. With this setup, the output of this agentic workflow is now highly predictable, while still capable of various UI generation tasks.

Cost Efficiency

Since the models are very powerful these days, we might argue that they should know what sub-actions to take to accomplish the tasks. Also, we can simply aggregate all the system prompts that were applied to individual agents and feed that giant system prompt into high-tier models like GPT-5 or Grok-4.

However, is usage of single model always cost effective?

For comparison, let's assume that given same UI generation task, it takes 4 API calls for the single agent to accomplish it, while it takes 4*2=8 API calls for the multi-agents. (The reason for multiplying 2 is that the orchestrator should trigger extra API call to delegate tasks to child agents)

Scenarios	Lines of Python code	Input tokens	Models used	Time
Single Agent	~= 100 lines	Let's ingore user prompt since it's negligible compared to the size of system prompt call 1: - system prompt: B + C + D + E call 2: - system prompt: B + C + D + E call 3: - system prompt: B + C + D + E call 4: - system prompt: B + C + D + E	gemini 2.5 pro	4 rounds of API calls
Multi Agents	~= 500 lines	call orchestrator: system prompt: A call UXML agent: system prompt: B call orchestrator: system prompt: A call USS agent: system prompt: C call orchestrator: system prompt: A call csharp agent: system prompt: D call orchestrator: system prompt: A call scene agent: system prompt: E	gemini 2.5 flash	8 rounds of API calls

As you can see, the total size of input tokens for each scenarios is

Single Agent = 4A + 4B + 4C + 4D
Multi Agent = 4A + B + C + D + E

Model	Input price (per 1M tokens in USD)	Output price (per 1M tokens in USD)
Gemini 2.5 Pro	$1.25 (prompts ≤ 200k tokens) $2.50 (prompts > 200k tokens)	$10.00 (prompts ≤ 200k tokens) $15.00 (prompts > 200k tokens)
Gemini 2.5 Flash	$0.30 (text/image/video), $1.00 (audio)	$2.50

If the price differences for different models are also taken into account, we are saving decent amount of API costs while not sacrificing too much of the response time. (Smaller models are often faster in response)

To Conclude

Followings are three principles guide our agent development:

Tools with descriptive guides to make LLM's life easier.
Always try simplest agentic workflow first, only add complexity when performance degrades significantly.
When developing complex workflows, iterations paired with evaluations are keys to improve AI Agent's performance.

LiberateGames is currently available as a Unity plugin that brings these AI capabilities directly into your development workflow. You can try it at our website or join our Discord community to share your experiences with AI-powered game development.