Using Tool Outputs to Direct Agents
As I was developing the task system for Cyber-AutoAgent-ng, I found that system prompt instructions are not always good enough to move the agent forward. Tool output must be carefully considered. Output can either misdirect the agent or help with the desired direction.
Contract-driven development defines the specific inputs and outputs of a function or method. The function should do one thing well. When writing tools for a LLM to call, I naturally would think the same way. If the function is well-defined to the model, it will know when to call it, which inputs to send and understand the meaning of the outputs. I assumed it would understand the output is data. Not so đ
The model will not necessarily distinguish between data and instructions. Iâll show two tools Iâve specifically written the output to direct the agent. I found this necessary because originally there were prompt instructions to call two tools in succession because they had a dependency. Iâve also seen output that was meant to indicate âsuccessâ only, being taken as instructions and misdirecting the agent flow.
create_tasks
First Iteration
The first iteration of the tool is naive. More data is better, take what you can use, ignore the rest. LLMs donât do that, and shouldnât.
The model inferred a lot of misdirection from the output.
[
{
"event": "ADD",
"task_uid": "2b4166e3-711b-48b3-a25c-3af6e99a057c",
"title": "...SNIP...",
"objective": "...SNIP...",
"status": "...SNIP...",
"phase": 1,
"evidence": "...SNIP..."
},
{
"event": "ADD",
"task_uid": "b151c29a-5176-4fdb-a42a-cf2a15ac6602",
"title": "...SNIP...",
"objective": "...SNIP...",
"status": "...SNIP...",
"phase": 1,
"evidence": "...SNIP..."
}
]
Second Iteration
How about only the results (ADD or DUPLICATE), task_uid and title? Better, but still caused issues.
[
{
"event": "ADD",
"task_uid": "2b4166e3-711b-48b3-a25c-3af6e99a057c",
"title": "...SNIP..."
},
{
"event": "DUPLICATE",
"task_uid": "b151c29a-5176-4fdb-a42a-cf2a15ac6602",
"title": "...SNIP..."
}
]
The intent is get_active_task is called after create_tasks is called. The result looks like:
<active_task phase="2" status="active">
{
"activated": true,
"task": {
"created_at": "2026-04-11T13:40:17.728226",
"evidence": [ "...SNIP..." ],
"objective": "...SNIP...",
"phase": 2,
"status": "active",
"status_reason": "activated",
"task_uid": "d768ad00-3b3c-4f84-907f-8e9716c9aa86",
"title": "...SNIP..."
}
}
</active_task>
Third Iteration
The title was causing misdirection. There is enough in the title to infer instructions. The task code handles choosing the next active task, so we donât need that info in the create_tasks output.
Also, if we expect the model to always call get_active_task after create_tasks, how about including the active task in the create_tasks output?
[
{
"event": "ADD",
"task_uid": "2b4166e3-711b-48b3-a25c-3af6e99a057c"
},
{
"event": "DUPLICATE",
"task_uid": "b151c29a-5176-4fdb-a42a-cf2a15ac6602"
},
{
"event": "ACTIVATE",
"task": {
"created_at": "2026-04-10T16:32:25.809231",
"evidence": [ "...SNIP..." ],
"objective": "...SNIP...",
"phase": 1,
"status": "active",
"status_reason": "activated",
"task_uid": "4268624c-634f-42fc-8d56-d52ce24dfcab",
"title": "...SNIP..."
}
}
]
Fourth Iteration
There are still problems đ
It seemed like a good idea to output whether each task was added, or a duplicate was found. Nope. Sometimes the model would take DUPLICATE as an instruction to re-write the task until it wasnât a duplicate. The agent could spend several turns rewriting the same task.
UUIDs seem like a good idea. Some tools can take a UUID, such as task_uid, and use it to mark a specific task as done. Models will, fairly quickly, hallucinate UUIDs. Use them sparingly.
I landed on simply tells the model the tasks were created and including the active task. I had thought using well-defined JSON would be better. Models do understand JSON, but now that there is only âtasks createdâ, itâs unnecessary. Also, considering the get_active_task tool outputs the <active_task>... format, keeping the output of create_tasks consistent is helpful.
Tasks created.
<active_task phase="2" status="active">
{
"activated": true,
"task": {
"created_at": "2026-04-11T13:40:17.728226",
"evidence": [ "...SNIP..." ],
"objective": "...SNIP...",
"phase": 2,
"status": "active",
"status_reason": "activated",
"task_uid": "d768ad00-3b3c-4f84-907f-8e9716c9aa86",
"title": "...SNIP..."
}
}
</active_task>
I havenât seem models get misdirected by this yet, large or small.
task_done
The task_done tool marks the active task or one specified by task_uid as done. It also activates the next task. Originally the prompt instructed the model to call task_done then get_active_task. Now, task_done, does both and outputs the active task. This seems to work well.
<active_task phase="2" status="active">
{
"activated": true,
"closed": {
"status": "done",
"task_uid": "713c827c-8b86-4ad4-8d4c-50e90693ab87"
},
"task": {
"created_at": "2026-04-11T13:40:17.728226",
"evidence": [ "...SNIP..." ],
"objective": "...SNIP...",
"phase": 2,
"status": "active",
"status_reason": "activated",
"task_uid": "d768ad00-3b3c-4f84-907f-8e9716c9aa86",
"title": "...SNIP..."
}
}
</active_task>
null
Donât output null values, unless it is the only output. I found sometimes the model will see null and forget everything else and determine the tool failed. Out of all the output of the particular tool where I saw this, it locked on null. I added a filter to all dictionaries I output to remove null values.
Refactoring Planning
Thinking more about the planning process, I am considering removing most of the planning and task logic out of prompts and into procedural code.
The agent will still be responsible for creating the plan and evaluating completion criteria. It will create tasks. The execution of tasks will be driven by prodecural code and new agents created for each task. The context can be made specific to the task without the cognitive load of previous actions. There is also an opportunity for parallel work across multiple agents.
Although I donât have the resources to test with frontier models, Iâve seen evidence from others that a large context window is detrimental to performance. That makes sense, unrelated history is kept as the model moves from task to task. Using a new agent for each task should also reduce the token usage.