Back to blog

Agent ready episode 6 with Honeycomb: observability & SLOs with AI agent workloads

Auth & identity

Sep 4, 2025

Author: Stytch Team

Agent ready episode 6 with Honeycomb: observability & SLOs with AI agent workloads

The sixth episode in the agent-ready video series is now available! Featuring Jessica Kerr from Honeycomb, this session explores observability and SLOs for AI agents. You’ll see how tracing, instrumentation, and evaluation techniques can reveal the inner workings of LLM-driven systems.

Video overview

Back when programs were deterministic, we could hope to predict production behavior based on tests. With AI involved, this is impossible. There's a nondeterministic black box in our system; we need to find out what went in, what went out, and what the consequences were. Observability is made for this. Distributed tracing across an application provides the picture of how agent-related software works in each unique case. Meanwhile, SLOs help monitor performance without knowing everything we need to watch out for. In this session, we'll see the innards of an AI agent through tracing.

You'll come away with an understanding of:

  • SLOs for monitoring agent workloads
  • LLM evaluation techniques
  • Instrumentation with OpenTelemetry

Be prepared to tackle the challenge of understanding production behavior enough to improve it, lowering costs and boosting user experience.

Full transcript

Reed: Hi everyone. And welcome back to the Agent Ready series. I'm Reed co-founder and CEO here at Stytch, and I'm really excited today to be joined by Jessica Care, who is the engineering manager for developer relations at Honeycomb. A great observability platform, one that we use ourselves here at Stytch, and really excited to dive in to today's topic on AI and agents.

This has to do with one of the big problems that companies building AI and building AI apps are confronting, which is as we move from deterministic outputs of the programs and projects that we create to things that are probabilistic, like LLMs and the different ways that we're using AI, how do I get control, visibility, observability over my application over the different tracing that's happening?

I’d really like to make sure that I can understand what's happening within this black box and make better decisions coming out of that. So Jessica welcome. Would love to hear a little bit more about what you'll be focused on today and we’re really excited to have you on.

Jessica: Excellent. Reed, you gave a pretty good reason to care about observability. That was a good intro. Alright, so personally I'm excited to talk about observability in the SLOs for AI related software because you mentioned getting control. Our apps again, and I got into software because it was deterministic and you could code something and know what the computer was gonna do.

Fortunately we still have that. Most AI apps are mostly ordinary software. And observability is how we know what's going on in our deterministic code. I'm not touching how we know what's going on in the model, that is way outta scope, but fortunately we have all this classical deterministic code around it and we can put observability in there and make it tell us what's happening to me.

Great observability is all about making the software tell you what's happening, like on purpose, and then it can tell you what your users are doing. And then you can find out not just is your code working, but is it working for your users? Are people using it? Is it functioning as intended? Is it having the benefits you wanted?

Sometimes you can measure that, so this is really useful. We get our software to send us telemetry. Telemetry is data that exists just for us as developers and operators, and then we turn that into knowledge somehow. I'm not getting into that part too much today, but now it's even more interesting because.

We have LLMs involved, we have non-deterministic components here, starting with, I'm writing the code with AI help and as such, I'm able to get more code into production that way. I can write an app in a weekend. Whether I should spend my weekend writing an app is another question, but I do.

Then because I can put out more code, it's harder to keep track of what's in there. It's not all in my head anymore. I really need it to report back and tell me what's really going on. And then it gets weirder because then we start incorporating LLM into our software applications. With LLM powered features.

For instance, in Honeycomb we have a query editor where you can type in exactly what you want and it'll do all kinds of auto complete and stuff, but that's not good enough. We also have a natural language type in your query and let the robot write the query for you. And what do we get back from that?

Does it work? Sometimes, but the exact same query and the exact same code — it can be different. We can no longer pretend that tests give us the ability to predict what our software's gonna do in production. No, you try to test it, you get some examples. We have to know what's happening. You treat the LLM as a black box, and you really need to know what's going in and out of it with observability.

And finally, our users are using AI so. Honeycomb is a tool for developers and SREs and developers are in the IDE and the IDE. They have an agent in there and that agent can talk to Honeycomb for them. Through our MCP, we support the Model Context Protocol. Which is like another user interface with an AI intermediary.

So it's the agent talking to Honeycomb. Now it can run queries, it can list fields that it might wanna see. It can look at SLOs and see how they're doing. It can look at traces. It can do a lot of things in its own little interface called the MCP. So now we have users that are also a completely different sort of thing than when it used to be software hitting our API or people using our GUI.

Reed: I think this is a great overview. Just I like how it demonstrates it's the more code problem, more probabilistic outcomes that are non determinative and also the increasing interfaces. I really like the way that you described MCP one question. I was just curious about there. Is we've, we're very curious as we talk to companies that have launched remote MCP servers that their end users can interact with since it is still early days, if you're finding any interesting trends and what are the most popular use cases for that. I'm just curious what you've noticed so far from the Honeycomb side.

Congrats.

Jessica: Thanks. But it's working. I use it. And then I'll. Go into distributed tracing specifically, can we use distributed tracing to see what AI agents do because so much of AI of the activity to use AI is outside of the LLM itself.

We can see a lot. And then we'll look at serving that LLM integration with features and finally. Finally I wanna show LLM evaluation, which is about evaluating the black box itself. That's a different thing.

Great. So yeah, you asked about the MCP. Let's flip over and look at here's the board representing a board representing our MCP server and we're making sure requests succeed.

And we can look at the error rate, which currently is zero, which. I have suspicions that doesn't look good to me. But we're also tracking some other metrics, like whether it responds quickly at this success rate we have alerts on, and these other ones we're just using to track. So I'll dig into one of those in a minute.

We can also see how many different customers are using our MCP. Currently, the number is small, it's just in beta but I can't tell you those customers are really excited about it. They're able to debug problems directly from the IDE by asking the agent to go to Honeycomb, look at an error in a trace, and then the agent can find the place in the code and find the problem.

Wow. It. Yeah. Yeah. People have always wanted like a Honeycomb IDE integration and we were like nah, we don't have time for that. But now it's there and we don't have to write it. We just do what we do best, which is answer queries really fast. And that integration is right there because the agent can see the code and it can see it can now see graphs and traces and Honeycomb.

And it's perfect for bringing observability into your IDE. Yeah, so MCP awesome. You can see. I'm really excited about it. But for anybody who has an MCP, you can use your observability to track who's using it. We're using OAuth now, so I could count users as well. Here we're just counting teams and we can see like how many different sessions each team is using.

Team 13 has the most, this other team has 31 over the last, I think I'm on the last week. We can see most people are using it through node. That's probably MCP remote, like a wrapper.

But Claude Code and Cursor are the next biggest one. Deno, that's a node library bun, also node. And then what are they doing in there?

We can see the tool calls that they're making. So they're mostly running queries, and this is the agents. Now the agents are deciding to do this stuff. Agents are mostly running queries and they're listing data sets and searching columns and look, a few of them are looking at traces. I'm surprised that number is not higher.

And then you can look into how well it's working. We're looking at, speed is really important here because the AI is like in a loop, right? It's cycling between, I'm thinking about investigating errors and I need to run a query, and I'll get back maybe a list of errors. Let's see. If I scroll down, there'll be like, yeah, here's a list of errors.

And then it might dig into one, so you might filter down to just this value. And that cycle needs to be quick. So we care about latency. And so if we look at our service level objective here's one of them. Tools respond quickly. Non query, we've got different definitions of how fast it should be when it is a query.

Someone's looking for three seconds. We're only getting 93% at three seconds. That's still pretty good. This one I think is looking at 500 milliseconds. Lemme click into there. Yeah. So it's looking for MCP tool calls that are not query.

And 500 milliseconds duration is the success criteria. Now.

Service level objectives. By the way, everything I'm showing today in Honeycomb is available on the free tier except for the SLOs. Okay. But the SLOs are particularly good for AI related event because every request to an LLM is different and everything we do for them is different. And.

Service level objectives that Honeycomb are event-based. So they count how many times things happened, how many times. One of these MCP tool calls came in and then they count how many of those were good. And it's not about like error rates. There's no threshold. It's just what percentage of those are good.

And we're aiming for 95% and we've had too many errors. We're below that. It's okay. We're just monitoring this one for now. And then down below we can look at a heat map of how slow they were. We're not getting a whole ton of requests. The numbers are pretty small here. And then which ones is it happening in?

If I look here at the name is different between the failed ones and the successful ones. It's the trace tool, which to be fair, I could look it. Only the trace tools. To be fair the trace tool is actually doing a query. So 500 milliseconds is way too slow for that. And then, and this is important about observability.

Things are slow. Things don't work. You need to dig into the real story, and that's where distributed tracing is super useful. Now, lots of tools will do distributed tracing, Honeycomb is the best place to try it with a free account. Whatever. I'm biased. I work here and I work here because I love it.

Reed: I'll second it from our own engineering team's experience is that I see a lot of images of this in our Slack where people are like, ah, finally found it. And it's this trace type of view. And I'll also, I'll let you jump in, but I'm excited seeing this MCP dashboard because. Then my first question I'm gonna ask when I go, I get out of this call is, have we built a dashboard yet in Honeycomb for our MCP server?

'Cause I realize I'm blind to a lot of those different elements that you're already walking through, so this is fantastic.

Jessica: Oh, nice. Yeah. My favorite part is that it tells you what your users are doing. It tells you which parts of your MCP tool are useful. And also, oh, I'll show you the error messages in a minute. They're super useful. But here I can see what's slow. So even if you don't use Honeycomb, do use distributed tracing. Do use open telemetry to build your distributed traces and get the story. What went on here? What was so slow? So here's our trace tool. What was slow in here? It's a retriever client such, so that's a query. It took a second. Honestly, that's not that bad. And it's this retriever client such which is getting the trace data that's taking so long and you can break that down into all the other things if you want to work on speeding it up. And distributed traces are great for what took so long, but more importantly, it gets you a story of what happened and where all the different parts are and how your software fits together, which I like.

Going back to that board. There's something really interesting in these error messages because this, I love this tells me what the agents are sending that we are not expecting. Count operator does not take a column. That's the most common one. We should handle this. Austin, who's the product lead on this, is planning to fix this one. It's very simple. Ignore the column. We can accommodate this and improve the success rate of people queries just by noticing what it has to say. Something about the order unknown filter. OP does not exist. It's actually does dash, not dash exist, we could handle both, we could do that translation. We can help it out. This tells us that people apparently want a rate per second operation, or agents want a rate per second operation. We don't currently have that. We could consider adding it. I love this like expression of demand.

Reed: It's such a great view on product feedback, customer feedback, because one of the things that we often say is like the best feedback's often unsaid from customers.

It's either through the experience of what they go through, because a lot of the like. Ways we think about friction and friction logs of experiences customers might have is it's really only when it gets over a certain threshold that people will reach out to you and say, I want this feature, or I have this problem.

But there's so many of these smaller elements that add up and obviously this is the agent in this case, but also there's a real end user trying to accomplish something on the other side. So I just love the way that you frame this as, being like a effectively, introspection into the end user's objectives.

Jessica: Yeah. Yeah. It gives you clues. Yeah. So I love this. And we have some improvements that we can make, and then we'll see these numbers go down and hopefully we'll see these numbers go up.

Reed: That's great. That's great. Now I'm excited to build this exact same version of a dashboard for our own MCP server.

Jessica: Sweet, sweet. Yeah, recommend. Great. So on the other end of this, MCP is an agent. And I saw some traces of the MCP, but what's even more interesting is what the heck is that agent doing? And I can't get the traces for their agent, but I could get the traces for my own. I have I have downloaded and forked and run this OpenHands.

Because it's an open source AI agent that lots of people are working on. And I can mess with it. I can add tracing to it. So I'm working on adding open telemetry tracing to this agent. I can show you what that looks like. Let's see. We want VS Code. Yes, OpenHands. And if I run just a coding agent.

Would you get big? Yeah. Yeah. Can run OpenHands CLI. So I'm just running it in the simplest mode. Close to the simplest mode which is a command line. So like Claude Code. Here we are, it's in my IDE and I'm going to run the CLI. This is the simplest form, much simpler than the user interface for OpenHands.

It's like Claude Code. I am in the code directory for OpenHands. At the same time that I'm running OpenHands. It's a little meta but I can ask it about its own code. What do I want to build? I want to know what happens when I type here. Other than that error that always appears, I don't know.

It's free. Okay, so agent running, it's noticing directories, it's looking around, it's exploring some stuff. Yeah. Let's find out. What that looks like. Overall, I am gonna flip over to Honeycomb. I think it worked. And this is the OpenHands CLI dataset what's going on in the last 10 minutes. And let's look at a trace.

This is probably the trace of my session. Yeah, it is. It doesn't find a reach span because I have not finished the session yet. So since stuff is going on, let's see, I'm gonna take these fields out. Chop, chop. And make this one bigger and we can see, okay, check folder, security agreement, run session, create agent, create runtime, and aha.

Here's some stuff.

All right, let me add my little categorizing field. Okay. What's happening? It's printing some output. With this render, it's making a call to an LLM. And this is always interesting because what is it sending? Here's a prompt. Sending a lot. Okay.

And yeah. What model is it sending to? Claude Sonnet.

And then it gets that back and it starts doing a function call. So in this case it's using the string replace editor to do a view why the string replace editor tool does view. I don't know, but it does. That's the design. And this is an example of the code. Teaching me about it, because I wouldn't have guessed that the string replace editor tool is how it reads directories, but there it is.

Yeah. And then it went back to the LLM and then it did another view and it went back to the LLM and another one. And now I wanna know how much this costs. Yeah, there's a cost field on here, cost. And this cost field is just looking at the input and output tokens, which are on the instrumentation.

Instrumentation is the code that emits telemetry. And this tells me how much each request costs me because I hard coded the price.

Reed: Yep.

Jessica: In this calculator field. Yeah. Okay. Now gimme a sum of that. Events. This gives me everything that happened in that trace, but as a query, and I can do aggregations.

Sum the cost and the input tokens. Yeah, gimme the prompt tokens and the completion tokens. That's input and output. What do we have here? O overview. There we go. Okay. This has cost me less than a dollar. Okay. That's still a lot for reading code, but there it is. And most of them are prompt tokens by far is about the prompt tokens.

We are sending a ton of information directory listings and all kinds of stuff. This I'm gonna expand the timeframe a little bit, oh. And take up Yeah. And see if, have I done anything else? Yeah. A total of a dollar 20 so far. And it's come back and it wants to do even more. It wants to grip. Sure. Do the grip.

Yeah, so whenever I write a toy app that integrates with an LLM I add observability really fast 'cause I wanna know how much it's costing me.

Reed: It it makes me realize there's a lot of RAI usage internally where I actually don't know where I'd go to find that the, on the individual call level, other than like OpenAI's dashboard has some elements, but not nearly this level of fidelity, obviously, when we're trying to understand what operates.

Jessica: Right, 'cause I can get that cost by user and team and whatever the heck else, and I can see the entire story of what the heck they’re trying to do. Yep. Why did it cost so much? You can see what's in those input tokens. So observability is really important for this. Oh and oh. And then I wanna show you like where this instrumentation came from.

So let me go back to the library name contains, like the open telemetry library that I brought in to make this image, to make the instrumentation work. And then I have A type, I think. Yeah. Some of this I wrote. Okay, so the stuff that says auto like Jinja is a templating library and that has automatic instrumentation.

Open telemetry knows how to tell you what happened in jinja, and litellm knows how to tell you what happened in your LLM requests. And then any kind of a p call you make automatically can get instrumented. So all I did to make these was add some libraries and then stick an open telemetry call in front of my Python executable.

But then this custom stuff, I wrote this. So this is me getting this application to describe itself. So if I if I go see, this is custom function calling, if I could look for this in the code. And shift F function calling. It's right here where it go. Here it is. I added this line and then I wanted more detail, so I added this line.

A beauty of open telemetry and distributed tracing is that you can get all this information in the same place, like this log statement. Actually if I turn on debug logging, it gets attached to the trace, so then you can see it in context. But without tracing, that exists all by itself and you have to try to figure out what led to that.

I think there, there is some special log handling set up to add a session ID to all of these so that you can at least see what session it was a part of. And maybe it's all running on one machine so the timestamp means something. But with tracing, you can anywhere, as long as this is the current span, as long as you don't change it you can add stuff to it and it all winds up in the same place and you can see it together.

That's why I can get the cost by user and stuff like that.

Reed: That's great. That's great.

Jessica: Yeah, so it's a little bit of code. I have added, I don't know, probably 30 lines of code to this app. To get the attributes and spans that I want to describe the steps that it's going through. The tool calls the LLM calls.

That part it just did. Yeah. And this is an essential part of getting software to describe itself, is really thinking about how is it telling me the story and. This was always essential with deterministic software, especially when you've got microservices and any incoming request that's going through six different apps, they all show up in the trace.

It ties them together. But it's even more essential when AI is involved. Also, it's easier because I type I, I do this, I'm like with, and Oh, it's not gonna do it now. Oh, there it goes. Yeah. Tracer starts a span. And then Python. And then if I do this, it'll probably be like set attributes. Yep. Yeah. It wants to set attributes for me.

You can use the AI to add the instrumentation and so it's easier to make your life easier.

Reed: that's great.

Jessica: Yeah. So how do you recommend that? Oh, four. This, it shows you how an AI operates, and it can teach you that. And then for your features that have LLMs in them, even more important. So this is a board that describes query assistant, which is that if I go to Query and I wanna use this, can you show me?

I'm in MCP, who, oh, I better not take that. Let's see what are the most common errors Then Honeycomb tries. The tries. False. False is the most common error. No, that did not answer my question.

Reed: Always level my grammar. Yeah.

Yeah. That's great. Great.

This is the meta piece where Query Assistant is your product within Honeycomb that uses the LLM and then this query assistant performance and cost is your ability to go use Honeycomb to inspect how that is, the different performance and cost tracing that you you have available. Okay, great.

Jessica: Yeah. Here I'm in our dog food environment UI Dog food, which is the Honeycomb that we use to observe Honeycomb. It's a little slower than production. But yeah, I can see who's using it and then I can, somewhere in here that feedback comes through. Common errors are actually context canceled.

Top questions and error rate, 3%, 5%. That's okay. That's not super huge, but that feedback came in somewhere. Oh, did it answer their question? No. Mostly no. A fun part about various,

Reed: Hey, one third, 25 outta that, one third that's still pretty good for something that you didn't have to put much effort into as an end user.

Jessica: 29%. Yeah. Yeah. I think we could improve it. Yeah. So Query Assistant is useful. It integrates with an LLM. If you want to, you can be like, okay, let's look at let's look at a trace. Where did someone not get their question answered? And you can see both what came in and, okay, so here this, their feedback answer, there's a page load.

Oh, this one isn't gonna, and I'll Q query run. Oh, they did another one. Okay. That came in a different trace. It came in that their session as a whole trace. There is a way to correlate those. But I don't remember what it is. That's okay. Back to the board.

Reed: And out of curiosity, which LLM or LLMs power query functionality?

Jessica: Oh, let's find out. Let's go look at one of Query Assistant’s traces. Any of these will do, but trace dog food is a little slower than prod. So these things take longer to load. We deliberately keep it small so that if our app gets slower, we feel it first, right? Oh, there we go. I already have the fields up.

We're using GPT-4o mini, which I think we could do better.

Reed: Really

Jessica: Maybe we should, maybe a better model would be good. Oh, which, which brings this to LLM evaluation. Query Assistant. We have the luxury that at least we can say whether a query was definitely bad because if the LLM returns a query definition and it doesn't run, it was bad. If it runs well.

Okay, so the most common error is false, but yes, but it's better than nothing. When you have a chat feature or something. What even is good. So here, this is a toy app that Dere runs and you can go to the quiz honey demo.io and you can put your Honeycomb API key in after you create a free account.

And then this is the fastest way to get data into a free Honeycomb account. That's why we wrote it. It's not data for your app, but it is data that you had some involvement in creating. Because it's gonna describe my interaction with this quiz and it's gonna ask me some irritating questions. How does your software tell you what is happening?

Telemetry, duh. And then this goes off to chat DPT zero. Are you kidding me? I thought I hard cutted. You get one point if you said telemetry. I need to change that.

Reed: I need a better model underneath the hood.

Jessica: I need to do something with this one. Sad face. Definitely sad face. Next question. And so on, but, so something like this, it's very arbitrary.

What is good? What is a good response when it's job is to grade this 95? Are you serious Donuts got me a 95.

Reed: I'm very curious what the prompt is now for that submission.

Jessica: Okay. Let's find out. See current trace. So this clicks in, I've got some weird test and present front end instrumentation here.

But one of these backend things is gonna link to an LLM call. There we go. So here's our LLM call and here's the full prompt. If I go over here, I can see it a little better. There's the report. Here it is, there's the output. It said 95. Absolutely. Let me add the prompt to that. Plus, and here's the whole prompt.

And then a tricky bit. Woo. The lat I had, I have it do a bunch of examples in this prompt. And I tried that same answer earlier and it gave me an 80, which is a little more reasonable. But so how do you know if you're gonna change the prompt? Are you gonna change the model?

Whether it's good. I cannot define whether this is good. I, we ask the user, but they just aren't gonna approve anything that gives 'em a high score in, in my observation. So Honeycomb is great at displaying this data and we can group by, by anything. I can group by trace ID if I want to or user ID or which question they were asked or anything like that.

We're really good at cardinal and graphing and holding detailed data, but we are not good at sentiment analysis or can you tell whether this answer is BS or does it actually reflect what was in the prompt? So I'm gonna flip over to a different board. We have an integration with Deepchecks and Deepchecks is an actual LLM evaluation company.

And. On this board, CERV quiz, LLM evaluation. I can see like some questions, some answers. This is from a recent conference of things people really said in order to get Allego, somebody definitely used ChatGPT to talk to ChatGPT here. And you can see what they really said and also how many answers were evaluated.

The score that the GPT gave it. And then our evaluations here, mostly they thought they were bad. Oh, but now I have input and output sentiment, and you can go see which ones were the most negative. Interesting. The most negative one was. Five. They mentioned using logs and UI services, but their answer is vague and lacks detail.

Now the good part is this is actually part of a longer chain, and that response did not go directly to a user. And hey, it gave them more points than it gave me but what that comes from is, grounded in context. Grounded in context is A, was it BS? There's good ones and there's bad ones. And we can go look at a trace.

And I can see like what that was about. This is one of those more complicated ones. On the first question, scoring the answer has a whole bunch of different LLM interactions. You can see what happens in the trace. But then when I look at the evaluation results. Deepcheck. No, it all says Deepchecks.

I'm looking for the link. There we go. We can click in to one of them and we get to my Deepchecks dashboard in this case. And it didn't take me to the one I wanted, but it did take me to some. When we send to Deepchecks. We send it the, what they typed, we send it our full prompt. And then we get what the LLM came back with and Dex does some scoring.

Was it complete? This is a custom property observability ness of what? Why did that get a four? This custom property is supposed to be ask an LLM whether this had anything to do with observability. No, it did not. Failure.

Reed: We're not a AGI yet,

Jessica: Right? Donuts really form an o

Reed: It is using three five. So maybe that's part of the explanation there.

Jessica: Oh, wow. Yeah. Yeah. We have an even older model. And in Deepchecks you can upgrade your model, run some tests through, you build up an evaluation set, and then it can show you the difference.

Between the results for the same inputs for different models or different prompts. So this is the real thing for the black box for improving the output of the black box.

Can recommend. And then in Honeycomb you can do things like, oh, I had it up. Where was it? You can do things like grounded in context, what's the distribution of that?

And then I could go in here. I could be like, okay, what's different about the ones that are really good? And it can go look at other things. Yeah I, there's a bunch of things, but I like the combination of Deepchecks comes up with a lot of numbers. I can graph them and Honeycomb, I could graph them by, for instance.

I want a span. L No. Okay. I'd have to look at the trace. I'd have to go get, look at the tracing, get the span that has the LLM input and then I could group it by that and see for each dot what people typed.

Or what their name was or what conference they were at. Yeah.

Reed: That's awesome.

Jessica: Yeah, the LLM evaluation is, I recommend integrating it with your observability. But the first thing I would do is instrument with open telemetry and get this data.

Reed: Okay, great. Yeah, no, I was gonna ask stepping back, what do you think are the major takeaways coming from it? And I think you just started to touch on that, but yeah, anything else that you'd leave the audience with in terms of how they should think about this?

Jessica: Now the big thing is open telemetry because. Once you instrument with that, you can choose your observability provider. You can switch between them. You're not stuck. And a lot of the AI libraries are already instrumented with open telemetry because they're not that old. But then use it not just to find out whether your software is throwing errors.

Look at what people are doing. Look at the business value exhibited in your working software. Based on what the deterministic parts of your software can say is going on. And then also use it to find out what the heck your LLMs are doing in production for real.

Reed: Which is a fun exploration for everyone as we've seen in this demo.

Thank you so much, Jessica. This I learned a ton, I'm sure our viewers did as well. And how should people find you or find more information about Honeycomb if they'd like to learn more coming out of here?

Jessica: Oh, thanks. I'm Jessitron online, J-E-S-S-I-T-R-O-N. So jessitron.com or jess@honeycomb.io

If you wanna send me an email. If you go to honeycomb.io, we have a blog with a bunch of articles about LLMs. A few of them are even mine. Nice. Nice.

Reed: nd that, that's great and thank you so much Jessica. And for context we've been doing this whole series on agent readiness, how to make sure agents can use applications.

We've talked with different people about building remote MCP servers, the OAuth flows that go into this. Very happy that we could have you walk through the observability pieces 'cause it's such a crucial foundational element in order for us to make sure that these agents are behaving the way we want them to that our MCP servers we're iterating on them the way that we should.

And so really appreciate you making that so concrete for everyone. So thank you Jessica.

Jessica: Thank you Reed. And thank you for making this. Course and also for building Stytch.

Reed: Thank you. I appreciate that as well. We'll keep going back on the thank you round robin.

Jessica: This is the part where the subscribe link comes up.

Reed: Yes. Yeah. And

Jessica: The next video.

Reed: Yeah. No, definitely. This is great. This has really been fantastic. I really love the demo that you put together, so I'm glad I was like fresh eyes seeing it. 'Cause I feel like there were things where I was like, oh, I hope we go set that up in Honeycomb coming outta it.

Jessica: Thank you.

Share this article