Part 2: The Tools That Let AI Touch the World

May 19, 2026

May 19, 2026

In the first part, I framed the problem this way: personal agents and robots are not the same kind of intelligence. One knows the person. One knows the body. The hard part is the bridge between them.

That bridge is starting to appear, but mostly in early forms. Some of it looks like vertically integrated robotics stacks. Some of it looks like vision-language-action models. Some of it looks like ROS exposed to language models. Some of it looks like MCP-style tool use, where an AI system discovers capabilities and calls them through a structured interface. All of that matters, but none of it is the full answer yet.

A robot model and an AI model as two complete systems coordinating with each other

The image I keep coming back to is not an AI model puppeteering an empty robot body. It is two complete systems meeting in the middle. The robot model has its own perception, control, state, safety limits, and uncertainty. The personal AI model has its own memory, intent, permissions, and user context. The interesting design problem is not raw control. It is what happens when two functional systems have to coordinate without collapsing into one another.

The Current Reality

The robotics world is not waiting around for personal agents. Companies are already building systems where perception, planning, control, safety, hardware, and user experience are tightly coupled. That makes sense. Robotics punishes loose integration.

If you are building a humanoid, a cobot, a warehouse robot, or an elder-care companion, you want the body, sensors, control policies, task planner, fleet tools, diagnostics, and safety layer to behave as one product. Latency matters. Reliability matters. Certification matters. You cannot just bolt a chatbot onto a moving machine and call it a system.

The good version of vertical integration is discipline. It lets teams own the safety case, test the whole stack, define capability boundaries, and reason about failure modes. The drawback is that every robot tends to arrive with its own mind. Your home robot has one assistant. Your hospital robot has another. Your car has another. Each system may be smart, but each one only knows you through the narrow slice of context it was handed at setup. The problem starts when vertical integration becomes identity integration: either the personal agent remains outside the physical world, or the robot has to absorb more personal context than it should.

RobotMCP Feels Like A Precursor

This is why RobotMCP is interesting. RobotMCP describes itself as using the Model Context Protocol as a universal interface between LLMs and robots, with the goal that an MCP-compatible language model can control a ROS robot through a RobotMCP server (RobotMCP, 2026). It is early infrastructure, and it should not be mistaken for the whole solution. But the direction matters.

MCP itself was introduced by Anthropic in 2024 as an open standard for connecting AI assistants to external systems where data and tools live (Anthropic, 2024). In software, MCP helps an agent discover and use tools. RobotMCP asks what happens when those tools are not databases, calendars, files, or web services, but physical systems.

At first, the idea sounds deceptively simple. An LLM calls robot tools: move here, pick this up, report status, take a picture, run a ROS command. That is useful. It is also not enough.

Tool Use Is Not Embodiment

The thing I keep coming back to is that physical tool use has a different risk profile from digital tool use.

Calling a calendar tool can create confusion. Calling a robot tool can move mass through space. It can cross a threshold, touch an object, reveal private information, startle a person, block a hallway, spill something, or fail in a way that a human nearby cannot avoid.

That means the interface cannot only describe functions. It has to describe physical consequences. What can this robot do right now? Where is it allowed to go? What can it lift? What sensors are active? Is it recording? Is it stable? Is it near people? Is the environment controlled? What uncertainty does the robot have about the task? What level of confidence is enough before it acts?

The function signature is not the safety case. This is the gap between "LLM calls robot tools" and "personal agent safely borrows a body."

What Existing Tools Are Good At

The current tool landscape is still valuable. In fact, it is probably where the right bridge begins. ROS is good at organizing robotic capabilities, messages, actions, sensor streams, and control interfaces. It gives the robot world a shared technical language. RobotMCP points toward exposing those capabilities to language-model systems in a way that agents can discover and call. Vertical stacks provide tested behavior, safety limits, and deployment discipline. Vision-language-action models help robots connect perception and instruction to physical behavior. Each layer has a role.

The robot stack should own embodiment. It should know its sensors, motors, constraints, calibration, and current state. The personal agent should own intent. It should know the person, the task, the preferences, and the context. The bridge should own negotiation. That last word matters. Negotiation means the robot can say, "I can do this, but I need clarification." It means the personal agent can say, "You may do this, but not that." It means the interface can limit context, time, authority, location, and logging. A command channel executes. A negotiation layer governs.

The Pros And Cons

ApproachWhat it gives usWhat it misses
Vertical robotics stacksTested products with owned safety cases, known capabilities, and tight control over hardware and software.They can trap intelligence inside one body, one vendor, or one assistant.
Generalist robotics modelsTransfer across tasks, environments, and embodiments.Physical capability does not automatically include personal context, consent, or social judgment.
MCP-style interfacesA way for agents to discover and use external capabilities.Most tool protocols were shaped around digital systems, where failure is often reversible.
RobotMCPA direct path between language agents and robot capabilities.The deeper problem is representing context, permission, uncertainty, and refusal in a way both sides can trust.

The Bridge Needs More Than Tools

If personal agents are going to coordinate with physical systems, the interface needs to carry more than commands. But I do not think the personal agent should have to manage robot state, telemetry, sensor health, or low-level safety behavior. The robot should own that. It knows what it is seeing, where it is uncertain, what it can reach, and what it can safely do.

The personal agent's job is different. It should provide task-relevant human context that helps the robot interpret what it is seeing and choose well. "That bag contains medicine." "Do not discuss this aloud." "Ask before moving anything from that counter." "This person prefers help that preserves participation." The bridge becomes useful when the robot brings perception and capability, the personal agent brings context and intent, and the decision emerges from their exchange.

The Next Step

I do not think the goal is to make every robot into a personal agent. I also do not think the goal is to let every personal agent directly control every robot. The goal is narrower and more useful: a personal agent should be able to coordinate with a capable robot under explicit limits. The robot should expose what it can do and what it is unsure about. The personal agent should provide only the context needed for the task. The interface between them should make safety, privacy, and authority first-class concepts.

That is not a finished product yet. It is a direction. And if Part 1 was about the problem, and this part is about the tools, then Part 3 is where the design question gets sharper: what should the bridge actually look like?

Source Notes