Back to blog

Agent ready episode 5 with Arcade: tool calling, orchestration, & agent development

Auth & identity

Sep 3, 2025

Author: Stytch Team

Agent ready episode 5 with Arcade: tool calling, orchestration, & agent development

The fifth episode in the agent-ready video series is here! Featuring Mateo Torres from Arcade, this session dives into tool calling, orchestration, and the step-by-step process of agent development. You’ll see how LLMs can evolve into multi-agent systems capable of tackling complex, real-world tasks.

Video overview

LLMs provide the brains of modern AI agents. But a brain alone doesn’t get users very far. The concept of Agency requires the ability to act in an environment, and that’s where tool-calling comes in: connecting the reasoning power of an LLM to real-world systems. In this session, we’ll explore how the right tools can make LLM-based agents more reliable, productive, and cost-effective. We’ll look beyond simple information retrieval to see how tools enable full productivity automation and even multi-agent systems.

Through live, incremental builds, you’ll see how to evolve an agent step-by-step from a single-LLM chatbot into a robust multi-agent system capable of tackling complex, real-world tasks. You’ll leave with a clear understanding of what it takes to unlock the full potential of AI agents through tool-calling.

Full transcript

Reed: Hi there and welcome back to our Agent Ready Applications video series. My name's Reed. I'm the co-founder and CEO here at Stytch. And today I'm joined by a very exciting guest, Mateo, who leads developer advocacy at Arcade.Dev. Mateo, tell us a little bit about what Arcade does and where Arcade fits into this.

Kind of idea of how to build an agent ready application.

Mateo: Thank you Reed. I'm Mateo from Arcade. Arcade is a unification platform that brings together your LLMs, your orchestration platforms like plan graph and so on, and everything that is a tool and especially authenticated tools on the other side.

So your agent app now can, for example, read your emails, it can send emails on your behalf, it can write a telegram, a Twitter post, and so on. So that's what Arcade does for you. [00:01:00]

Reed: Awesome. I'm very excited for today's presentation and the demo that we're gonna walk through, because to me, and I know obviously Arcade believes this fully as well, tools are such a critical component to making LLMs feel high fidelity for consumers. We'll talk about it later, but a lot of the improvements that we're making in models are great, and adding different tools and more powerful tools are another way to significantly augment our experiences with AI agents and what makes them valuable. So, very excited to jump in today. I think what we'll do is let you talk through at a high level what Arcade does, what tool calling enables, and then we'll jump into some code and demos.

Mateo: Sounds good. I wanted to present this idea of achieving agency, which is how tool calling enables LLM agents and multi-agency systems, or rather how tool calling is the core component that makes all of this possible.

The first question is obviously what is an agent? There is a classical computer science definition of agent, but here I'm gonna focus on the kind of LLM as a brain idea of an agent.

For that kind of agent, we minimally need a text medium, and that could be chats, a prompt, anywhere that you can put text inside, an orchestration system that will move that text around, push it from and to LLMs, different LLMs. You need one or more language models, a set of LLM tools, and I will explain in this presentation why this is the critical component, and a tool runtime.

The tools themselves are the definitions of what the agent can do or what the language model can express intent to achieve, and the tool runtime will actually make that happen. This is the collection of things that you need [00:03:00] at the very least to make an agent. You can have systems, state, persistent chat, storage, monitoring, memory, and other components that will make the agent better, more useful, more powerful, and so on. But these are not required for agency.

So if we go from LLMs all the way to autonomy, which is the bar that we define as required to become an agent, there are a couple of things that you need. This is the hierarchy or the pyramid that you need to achieve. You need the LLM, which is the brain. You need orchestration because LLMs will talk in natural language. There are more specialized languages if you wish, structured outputs and so on.

You need to put these into the right context. One agent [00:04:00] will be an expert on a specific thing, so you would need a place to put learning examples or inputs to that agent that does a specific task. You need retrieval to get the relevant context into the agent so it can make an informed decision or process the information in a useful way.

It needs agent orchestration, which is the idea of doing the analysis in the loop so that we can achieve something like react-style agent or chain of thought processing. You need tool calling, and I'm gonna stop here for a little bit and go a little bit deep into how this is implemented.

What is a tool? An LLM tool is essentially a function, an F of X that we call in order to get the outcome of the function, which is Y. Tool calling is the process of asking the LLM to predict f and X. Tool calling does not imply execution. The whole job of the LLM is to say, “I want to call X with parameters X,” and that's it.

The tool definition involves the set of inputs, each with some possible values, a set of potential outputs that the LLM needs to be aware of, and a description that helps the LLM decide to call this function and predict the correct parameters given the context. What is critical for the tool is a method of execution, but that is external to the LLM. The LLM doesn't really care how the system, the app that contains the LLM, runs the [00:06:00] function or delegates the function to a runtime. In a very similar way that a classical application will rely on an API and it doesn't care how the API does its job, the LLM doesn't care who or where the function is actually running.

The tool calling process has two main components: the agent app, which can act as the tool runtime, and the LLM provider. The way this works is the agent app will first invoke the LLM with a prompt and it will give some tool definitions. The LLM will process that prompt in natural language, tokenize, and reply with the intent of a selected tool and the predictive parameters that should be passed to that tool.

The agent app is responsible to get that intent from the LLM, run the tool with the predictive parameters, validate it, check if authorization is required, and invoke the LLM again with the results coming from the tool. Then the LLM will reply informed by the tool execution. This sounds like retrieval because the LLM may say, “I want more information, use this tool to give me my information,” but it's not limited to retrieval. [00:07:00] You can do a lot of things. The side effects of a tool can be an email, a Slack message, a LinkedIn post, pretty much anything that you can imagine.

Reed: Just to make this tangible for myself. For something like, I use ChatGPT all the time. [00:08:00] I use O3 as my preferred model personally, just for everyday use, and there's a bunch of tools baked into that. I guess that would be a scenario where the agent app and the LLM provider are actually the same, even if that's not typically the experience. Because obviously OpenAI is providing O3, but then ChatGPT is the agent app in that context. It's good for me to know that the agent app itself is the one tasked with actually running those tools with the predicted parameters that are coming back from LLM providers. So it's not ChatGPT and everything's OpenAI under the hood. Then it would be, if Manus is using Anthropic Claude, Manus would actually be the agent app that is running the tool with the predicted parameters. Is that accurate?

Mateo: Correct. In that specific case, of course, OpenAI is running the agent app and it's at the same time the LLM provider. They have all the compute and they run these “locally.” But the analogy is the same. Even they, OpenAI, need to process the [00:09:00] intent using the agent app. They may have to call Google Drive or whatever if it's a plugin. They will need to browse the web for you or something. So all of that, even if they trigger the action, the intent has to be parsed and interpreted by the agent app because the LLM directly cannot call the function. And that's the boundary.

An analogy that is very useful in this case to also illustrate the power of tools is: if you're a human and you see this multiplication, I'm gonna give you some time to think about it if you want. I certainly cannot do it in the time. I can compute this with high school math, but it's quite an involved multiplication. It has a lot of digits, six on one side, five [00:10:00] on the other. It's not a simple, easy-to-do mental multiplication. The LLM will have similar difficulty at this task, especially if I make it more complicated just by adding more digits to these two numbers.

Of course we want better LLMs that do reasoning and are good at math, and we can train them and they are improving over time. One of the easy tests we can do is exactly this: multiply two numbers with increasing number of digits. Here I have two pictures. On the left I have O1 Mini, which was the first reasoning model, multiplying two numbers. We can see as we increase the number of digits of each operand how it goes from 100% success for one-to-one and even three-to-three, three-to-four, three-to-five, to zero if you have [00:11:00] 20-by-20 numbers, which are massive numbers. But the LLM is also billions, maybe trillions of parameters. So why can it not just extrapolate and invent an algorithm to run internally?

On the right, I have a much older model, GPT-3.5, which was the first model by OpenAI with the ability to call tools. If I give it the tool to do multiplication, it will just ace this exam. It feels a little unfair because it didn’t really have to think. It's like giving me a calculator: I can just input the numbers, hit multiply, and I’ll get the answer. But in the same way the calculator extends my ability to multiply, the tools make the LLM better at the task. I'm not testing the [00:12:00] LLM to say O1 Mini is better or worse than 3.5—that comparison isn’t useful. What I really want is the agent that can multiply whatever number I give it, and I want to trust that agent to do the right job. If I give it a calculator and I trust it to input the things correctly, I solve the task. And it's much cheaper to run GPT-3.5 compared to O1, for example.

Reed: I think this is such a great point and highlights the importance of tool calling—making it easier for agentic apps to invest in this. Because really, to your point, one of the things I think about when I see this graph is with every model release—and we’re talking the day before GPT-5 is supposed to come out—there’s some big improvement in base model intelligence, which is great. But it's also possible that tomorrow could be more incremental.

I think there's an element here where in order for folks to feel AGI, however you define that, or feel like agents are truly doing valuable things in their lives, it doesn't just have to be base model intelligence improving, as demonstrated here. [00:13:00] If you are layering on tools to older models or getting that incremental improvement in base model intelligence, and then adding tool calling—or the rumored GPT-5 release feature of dynamic routing across reasoning and non-reasoning use cases—those augmentations are so important for the regular end consumer to feel like this is magic.

Even if perfect performance on the right-hand side is still years away at the pure model level, we can recreate it or build it in other ways. I really love this slide as an example of why tool calling is important, but also why it's so important to the end consumer’s experience of whether this feels like a magical, almost AGI-like product.

Mateo: Absolutely. I love the idea that base intelligence will be super valuable as we go toward AGI, but we need the LLMs, however intelligent they are, to use our systems. We need them to use Slack, Asana, Jira—wherever our data lives. It needs to be able to use that reliably. So far we haven’t solved infinite context. We haven’t solved hallucinations. As much as we can ground this into using Jira, Asana, Google, and calendars [00:14:00] in a predictable, reliable way, that’s where the value comes from.

I can have a super intelligent model that is perfect without hallucinations and can figure out how to program on the fly and invoke APIs. I still need the barrier of getting user consent. Security will be more important than ever in the new agent world. Even if I have the perfect brain as an LLM—GPT-65 or whatever—I still need it to interact with well-established security systems, APIs, and tools, and interact wherever the data lives. [00:15:00]

We just need them to use tools, which is why some benchmarks now measure how good these models are at tool calling, which I think is amazing.

Reed: And I think your other point, which is a more practical one but does matter for adoption and use cases, is it's way cheaper if you can give tools to an older, cheaper model. Obviously GPT, or OpenAI, just released the open-source model yesterday, and if you can give that tools and get it to the level of intelligence that you might otherwise be paying for in a more expensive [00:16:00] model, there's a lot of promise and opportunity in terms of the consumer use cases you can enable.

Mateo: Absolutely. I would love a smaller, maybe simpler model, but one so good at using tools that I could use it as a real personal assistant without internet on my phone. That would be awesome.

Reed: Agreed. Agreed. Yeah.

Mateo: Right. So the next thing is: why can’t ChatGPT send an email then? It can multiply if we give it tools, and so on. The missing piece on that pyramid I showed before is that agents shouldn’t just be able to work for you, they need to be able to work as you. They need to use your context. They need to send emails from your account, send a WhatsApp message to your groups, and so on. The missing piece here is user impersonation, which comes in the form of authorization and [00:17:00] authentication.

We’ve evolved this in the web era so much that we already have a lot of solutions: zero trust, OAuth, many protocols that work. We have governance, policies, and systems that make security possible. We should adopt this into agents and give them the ability to ask for permission to do the right things.

The user should be in control. If I want my agent to buy something, I need to be able to tell it, “Buy from this merchant. You can only spend a hundred dollars. You cannot spend it at this other merchant. You can only buy these types of things.” I make that concrete into a policy that I send to the agent. I ensure that even if the agent hallucinates, it has to comply with the policy before making the purchase.

That’s the ultimate goal. For example, I could give a one-time-use credit card to the agent and it could buy a birthday present for my father. This is what it truly means to go from a brain—raw processing power—to actually doing things in the real world. [00:18:00]

At the beginning of the presentation, I said tool calling also enables multi-agent systems—specialized agents that communicate and collaborate with each other. So how does this actually enable multi-agent systems? I’ll go into parallels here. If we have two LLMs, how do they talk to each other?

Say a user prepares two copies of ChatGPT or GPT-4. The first agent is the X calculation agent, an expert prompt to do some X task. The second is the perfect evaluation agent. I give it the answer and it tells me whether it’s correct or incorrect. So I tell the first agent, “Please calculate X,” and I get this X from some context. It replies, “Here’s the answer to X. Kindly help with something else.”

Now, the part I care about isn’t the prose—it’s just the answer to X. That’s the useful bit I need to pass to the other agent. Without tool calling, the user has to copy and paste, decide what’s important, and then invoke the other LLM with a new prompt. The second LLM replies, “Yeah, this looks okay,” or whatever.

In this case, the user is the communication channel between the two agents. I don’t like to call them agents—I just call them LLMs—because the input and output rely entirely on the user. The user has to support and activate the communication. So they’re not autonomous. They cannot just talk to each other, they cannot really collaborate.

If we replace these with agents that can actually use tools, and you put a runtime in the middle—a tool runtime—then the same problem comes in. But instead of just giving text, because the LLM can use tools, it can also do something called structured outputs. So it will express the intent, saying “transfer to agent,” and it will point to the other agent because it knows it's in its context. It will say, “Send this context. That's the relevant bit the other agent should use.” [00:21:00] This gets sent to the tool runtime, because we cannot interact directly with the other one. The runtime prepares the problem, sends it to the other LLM, and now you get agency.

Now you have an automated communication channel with a real interface for getting information in and out of LLMs. This is agent-to-agent communication. The industry term for this is handoffs. Everyone says, “Oh, this is so awesome.” But behind the scenes, this is just tool calling, implemented through techniques like structured outputs and a tool runtime.

So I have a demo for you on how to work our way from a single chatbot to a multi-agent system. Let me go—

Reed: I have to say, your meme usage has been on point throughout the entire presentation.

Mateo: Thank you. Right, so let me first show you the demo. This is what we are building.

I’ll be sent a code into my email, copy it, and we’re in. On the left, you see your usual chatbot UI. Here we just interact with the agent. On the right, we see all of the events that are happening—things not normally shown in the chat. For example, I say “hi.” We see here there was a call to the assistant, and then some metadata.

Now, when I say something that requires a tool call, I can say something like, “Send a DM to Mateo on Slack.” [00:23:00] A couple of things happen. We get the intention from the LLM of calling Slack: “Send DM to user.” It shows me the arguments: username is Mateo, message is “hi.” I can approve or deny this. This is called human in the loop. Here we get the metadata of this tool call.

So I approve it. We get the message back from the LLM: “I’ve sent a message to Mateo.” And here I get the full function result. This is not normally displayed in the chat—you normally just get the final output from the LLM. But this is the function result that is [00:24:00] actually sent to the LLM so it can generate that response.

In this case, Arcade is hosting the tool runtime for “send DM to user” from Slack. It will say, “I sent ‘hi,’ type: text.” It has metadata, maybe some images like the Slack icon or my profile picture.

I’ll show you how to implement a very basic version on the CLI. Let’s go to Cursor. I made this tutorial in multiple pieces—parts one to four. We start very simple in the CLI version with part one. Part one has no tools, and I’ll show you how this is like interacting with early versions of ChatGPT before custom GPTs or plugins.

We’re going to use the OpenAI agents SDK, which is very easy for creating an agent. It’s less than five lines of code. You define an agent, give it a name, say the model (GPT-4), and set instructions like, “You are a helpful assistant that can help with everyday tasks.” [00:25:00]

In the main function, we need to store the history of messages—just an array with all messages from the agent—so the context is alive. Every interaction with the agent is through an API, and we have to send all of the relevant context each time. OpenAI caches for us, but it’s up to us to track the conversation history.

The way this works is we have an infinite loop where we ask for input from the user. I set a safe word, “exit,” so I don’t have to manually kill the process. We append that to the history and then run the agent. This literally runs the chatbot with the context—the history of the conversation so far.

Let’s see how that works. I say, “Hi, my name is Mateo.” It replies. Then I say, “What’s my name?” to test that it remembers. It does, because I’ve been sending all messages back each time to the LLM. Then I say, “Summarize my latest three emails.” [00:27:00] Normally this would require a tool. Of course, the LLM refuses—it says, “I can’t access your emails, but I can help with something else.”

So I quit this and go to part two, where we add tools to the chat. The way this works is simple. We have an Arcade client—Arcade is the tool runtime. I’m going to statically set my user here, but for example, in the React version I pull the user from Stytch, which has email information or limited user data.

I made a function called getTools. Basically, I send the key plan, the toolkits, the tools, the rate limits, and whether I need to enforce approval or not. This is used for human in the loop, and I’ll show you how in part three. The idea is simple: I just collect all of the tool definitions. These define the schema the tool accepts as parameters—its name and description. [00:28:00]

I then transform that into the agent’s format: description, parameters, return values. All of that information has to be sent to the LLM so it can decide when to invoke this tool versus others in context.

On our side—the agent app—it’s up to us to execute the tool. If we get an error, like needing authorization, we can show a response URL, wait for completion, and then continue.

The loop hasn’t changed at all. We just add the tools. We pass a list of tools to the agent and modify the system prompt a little: “These are the tools available to you: tool name, description, etc.” The definitions are available for the agent to retrieve, search, and consider based on requests. [00:29:00]

Here, I gave it Gmail tools.

An interesting bit here is—let me just show you the Arcade dashboard. This is my Arcade dashboard. I am not logged into Google. I manually revoked my own access to Google to show you how this works. Back in Cursor, I say “summarize my latest three [emails],” for example. This is sent first to the LLM, and it replies, “Authorization required.” The authorization is not for Arcade—the app is authorized to use Arcade—but I need to authorize this user to Google.

What I get back from Arcade is a URL to [00:31:00] approve my connection—authorizing Google access so the app can use it on my behalf. I authorize, go through the usual Google flow, and it asks to view email messages and settings. It’s very important that when I ask to summarize my emails, it should only be allowed the minimal permissions needed. In this case, read-only. It cannot send emails on my behalf. I allow, it says “authorization successful,” I close it, return to the app, and now I can say, “Sure, give me my summaries.”

This is a summary of my emails. We can see how variable this is. We got the call to list emails, “number = 3.” Those are the correct parameters, and it knows how to do that because of the tool definition we passed. This is the result—the actual output from the Arcade API. We get all metadata, execution info, everything we need, as well as the summary. The LLM interprets each of the emails as JSON and can transform that into a response. [00:32:00]

Moving to part three: here we introduce human in the loop. The earlier tool call was harmless—I allowed read access to my emails. But now I want to enforce questions before calling specific tools. In getTools I define tools with approval. For example, send email, draft email, trash email—anything with a write or side effect requires approval. I do the same for Slack tools in part four.

The logic is: when I wrap the tool in the OpenAI agents format, I check whether the tool name is in that approval list. If true, it asks me. The agent raises an interrupt, which I need to handle in my code. So when I get a prompt from the user, I also check for interrupts. If there are interrupts, I display: “This agent wants to use this tool with these parameters. Do you approve?” [00:33:00] Then I answer yes or no. If yes, I set the interrupt as approved by the user, then send that back to the LLM. Now the LLM can decide whether to invoke the tool, but it’s up to us to capture and enforce consent.

In the terminal, I say, “Send a random email.” I’ll send it to myself, though it could be to anyone. The LLM’s intent is clear: it wants to send the email with these parameters. I approve. It fails—because the tool required different inputs. The LLM reads the error and retries with the corrected parameters. [00:36:00] This is important: as a tool developer, your errors should be structured as instructions to the LLM so it knows how to retry.

Every tool call must be approved. You can create apps that batch approvals or set policies, but in general anything with side effects—send email, share data—should have guardrails like human-in-the-loop consent. In this case, I approve again and it successfully sends the email. I check my inbox: yes, a haiku—the exact message the agent sent. [00:39:00]

Now the last bit: making this into a multi-agent setup. Part four is the same, but splits functionality into multiple agents. I have a Gmail agent with Gmail tools, a Slack agent with Slack tools, and a triage agent that has handoffs to the other two. Handoffs, as we discussed, are implemented as function calls under the hood.

For example, when I say “send an email,” the agent introduces a new tool: transferToAgent, passing the ID of the Gmail agent. I also add handoffs so Gmail and Slack agents can hand off back to triage. You can imagine this as a fully connected triangle: all agents can talk to each other. The loop doesn’t need special handling; handoffs are handled by the framework itself and don’t generate interrupts.

In part four, I say “summarize my latest three emails.” We get a transferToAgent call to the Gmail agent. The Gmail agent calls the “list emails” tool, gets results from Arcade, and returns the output. The triage agent prints the final response. [00:41:00]

The last thing I want to show is how this infinite loop idea is also implemented in the UI version. The only difference is it’s split between the frontend and backend. In the page, you’ll see handleSubmit. This makes a request to our API chat endpoint. [00:42:00]

So, API chat route. Here we get the input, and basically we do all of the security checks. This is client-facing, and this is running on the server. We get the user information, and then the same flow: we get the messages, which is the history, and then we handle the interruptions.

What we return is all of the products and the history so we can display that on the client side. There isn’t an explicit loop here—it’s one run of the inner piece—but the communication between the backend and the frontend creates this never-ending loop [00:43:00] as long as the session and history are maintained. In this case, it’s a database. That’s how the infinite-loop conversation idea, which may never end, is maintained.

So that’s it for now. The code will be available on GitHub.

Reed: Yeah. Well, thank you, Mateo. That was a great demo. I’m really excited we got to walk through that, and thank you for sharing more about the importance of tool calling and what it enables for developers who want to make agent-ready applications. I just wanted to leave it with kind of a final takeaway note for you: where can people find you, where can people find Arcade, and is there anything else you’d like to mention?

Mateo: I encourage people to go to Arcade.dev if you want the best tools for your agents. You can find me mostly on our YouTube channel or on our Discord server—I’m always helping people with their agents. There are many more patterns to [00:44:00] explore that are emerging. This is super exciting. It’s a very exciting field to be working in.

Reed: Awesome. Thanks, Mateo. And join us next time on Agent Ready Applications, the video series that helps you break down how to build them, how to think about them, and what’s next. Thank you.

Share this article

Facebook

Get started
with Stytch

Start building for free Explore our docs

Authentication & Authorization

For consumer applications

For B2B SaaS applications

Admin Portal

Connected Apps

Single sign-on

Fraud & Risk Prevention

Fingerprinting

Active risk assessment

Fine-grained enforcement

Why Stytch

Stytch vs. Auth0

Stytch vs. Firebase

Stytch vs. Cognito

Stytch vs. Fingerprint

Company

About us

Careers

Contact

Resources

Pricing

Docs

Changelog

Product roadmap

API status

Blog

Community

Slack community

Technical support

Customer stories