<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:thoughtbot="https://thoughtbot.com/feeds/" xmlns:feedpress="https://feed.press/xmlns" xmlns:media="http://search.yahoo.com/mrss/" xmlns:podcast="https://podcastindex.org/namespace/1.0">
  <feedpress:locale>en</feedpress:locale>
  <link rel="hub" href="https://feedpress.superfeedr.com/"/>
  <title>Giant Robots Smashing Into Other Giant Robots</title>
  <subtitle>Written by thoughtbot, your expert partner for design and development.
</subtitle>
  <id>https://robots.thoughtbot.com/</id>
  <link href="https://thoughtbot.com/blog"/>
  <link href="https://feed.thoughtbot.com/" rel="self"/>
  <updated>2026-06-03T00:00:00+00:00</updated>
  <author>
    <name>thoughtbot</name>
  </author>
  <entry>
    <title>Copy as Markdown: AI-friendly blog posts</title>
    <link rel="alternate" href="https://feed.thoughtbot.com/link/24077/17352975/copy-as-markdown-ai-friendly-blog-posts"/>
    <author>
      <name>Jared Turner</name>
    </author>
    <id>https://thoughtbot.com/blog/copy-as-markdown-ai-friendly-blog-posts</id>
    <published>2026-06-03T00:00:00+00:00</published>
    <updated>2026-06-02T14:48:08Z</updated>
    <content type="html"><![CDATA[<p>Our blog posts now have the option to <code>Copy as Markdown</code> to help our robotic friends more easily consume our content.</p>

<p>(It’s just up there, below the title)</p>

<p>Click the button and you will get this blog post copied to your clipboard in cleanly formatted <a href="https://www.markdownguide.org/getting-started/">Markdown</a>. Then paste it into any prompt to give your AI the context it needs.</p>
<h2 id="speaking-their-language">
  
    Speaking their language
  
</h2>

<p>Markdown is the lingua franca of LLMs and giving them the ability to read Markdown simplies their job (and uses fewer tokens) compared to parsing HTML directly.</p>

<p>We’ve done a few things to make their lives easier:</p>

<ul>
<li>The <code>Copy as Markdown</code> button - this is mostly for us humans, to more easily pass context to the AI</li>
<li>The Markdown version of any blog is available by appending <a href="https://thoughtbot.com/blog/copy-as-markdown-ai-friendly-blog-posts.md">.md</a> to the URL</li>
<li>A hint in the <code>&lt;head&gt;</code> of each post lets requesters know the markdown alternative is available</li>
</ul>
<div class="highlight"><pre class="highlight html"><code><span class="nt">&lt;link</span> <span class="na">rel=</span><span class="s">"alternate"</span> <span class="na">type=</span><span class="s">"text/markdown"</span> <span class="na">href=</span><span class="s">"https://thoughtbot.com/blog/copy-as-markdown-ai-friendly-blog-posts.md"</span><span class="nt">&gt;</span>
</code></pre></div>
<p>That’s it. No setup, no plugin, no incantation. Just click, paste, and happy contexting.</p>

<aside class="related-articles"><h2>If you enjoyed this post, you might also like:</h2>
<ul>
<li><a href="https://thoughtbot.com/blog/introducing-copycopter">Introducing Copycopter: let your clients do the copy writing</a></li>
<li><a href="https://thoughtbot.com/blog/copycopter-wysiwyg">Copycopter: Introducing a Simpler Way to Edit Copy</a></li>
<li><a href="https://thoughtbot.com/blog/human-centered-type">Human-Centered Typography</a></li>
</ul></aside>
<img src="https://feed.thoughtbot.com/link/24077/17352975.gif" height="1" width="1"/>]]></content>
    <summary>Our blog posts can now be copied as Markdown, so you can hand them to your favourite AI without the HTML cruft. Click, paste, and happy contexting.
</summary>
    <thoughtbot:auto_social_share>true</thoughtbot:auto_social_share>
  </entry>
  <entry>
    <title>The Bike Shed Ep 501:  What makes for good technical writing?</title>
    <link rel="alternate" href="https://feed.thoughtbot.com/link/24077/17352698/the-bike-shed-ep-501-what-makes-for-good-technical-writing"/>
    <author>
      <name>Joël Quenneville and Sally Hall</name>
    </author>
    <id>https://thoughtbot.com/blog/the-bike-shed-ep-501-what-makes-for-good-technical-writing</id>
    <published>2026-06-02T00:00:00+00:00</published>
    <updated>2026-06-02T14:21:12Z</updated>
    <content type="html"><![CDATA[Sally and Joël get technical as they lay out their thoughts on blog posts.<img src="https://feed.thoughtbot.com/link/24077/17352698.gif" height="1" width="1"/>]]></content>
    <summary>Sally and Joël get technical as they lay out their thoughts on blog posts.</summary>
    <thoughtbot:auto_social_share>false</thoughtbot:auto_social_share>
  </entry>
  <entry>
    <title>The Four Signals of AI Observability</title>
    <link rel="alternate" href="https://feed.thoughtbot.com/link/24077/17351768/the-four-signals-of-ai-observability"/>
    <author>
      <name>Matheus Sales</name>
    </author>
    <id>https://thoughtbot.com/blog/the-four-signals-of-ai-observability</id>
    <published>2026-06-01T00:00:00+00:00</published>
    <updated>2026-05-29T18:05:52Z</updated>
    <content type="html"><![CDATA[<p>A few months ago we shipped a chat experience to production. Users ask a
question, our app routes it through an LLM model, the model calls a few internal
tools, and an answer comes back from it.</p>

<p>It worked. Sort of.</p>

<p>When the model answered well, we had no idea why. When it answered badly, we had
no idea either. The model was a black box attached to our app, and our best
debugging tool was reading logs and guessing.</p>

<p>We realized our app could not answer a very normal operational question:</p>

<blockquote>
<p>Show us every chat where the user said the answer was bad, group them by which
version of the system prompt was loaded, and let us read the whole
conversation, including which tools the model called.</p>
</blockquote>

<p>It’s the AI equivalent of “show me every 500 errors on this endpoint after deploy X.”
But our app couldn’t answer it.</p>

<p>That was the trigger to stop looking for a smarter model and start looking to
add an observability layer. We ended up using <a href="https://langfuse.com/">Langfuse</a>, but the specific vendor
matters less than the capabilities. Helicone, Arize Phoenix, LangSmith, and
Braintrust all solve versions of the same problem.</p>

<p>After a couple of months of iteration, we noticed that the things we need came
in four flavors. I call them the four signals that every AI feature needs to
emit about itself.</p>

<ol>
<li>
<strong>A version on every prompt.</strong> Which exact words did the model see today?</li>
<li>
<strong>A trace shaped like the actual work.</strong> What did it call, in what order,
with what arguments?</li>
<li>
<strong>A score from the user.</strong> Did the human like the result?</li>
<li>
<strong>A score from another model.</strong> When the human is quiet, who is grading?</li>
</ol>

<p>Of course we can build an AI feature without all four. We just can’t improve it on purpose.</p>
<h2 id="a-version-on-every-prompt">
  
    A version on every prompt
  
</h2>

<p>The first thing we did was move every prompt out of the code and into a
versioned store the app fetches at runtime.</p>
<div class="highlight"><pre class="highlight ruby"><code><span class="c1"># The code never references a version. It asks for a label.</span>
<span class="n">template</span> <span class="o">=</span> <span class="no">PromptRepo</span><span class="p">.</span><span class="nf">compile</span><span class="p">(</span><span class="ss">name: </span><span class="s2">"classify_question"</span><span class="p">,</span> <span class="ss">label: </span><span class="s2">"production"</span><span class="p">)</span>

<span class="c1"># A human moves "production" between versions in the Langfuse UI.</span>
<span class="c1"># Promotion is a click. Rollback is a click. No deploy.</span>
</code></pre></div>
<p>The first time we rolled back a bad prompt by clicking a button instead of reverting a PR and waiting for CI, we knew this was the right shape.</p>

<p>Once prompts became content, the people closest to the problem became the people writing the prompts.
The feedback loop got much shorter, and the quality went up.</p>
<h2 id="a-trace-shaped-like-the-actual-work">
  
    A trace shaped like the actual work
  
</h2>

<p>A chat is not a single call. It is a small program. Classify the question, load the
right prompt, call a tool or two, then compose an answer.</p>

<p>If your trace is one row, you only know that something happened. A trace tree tells
you what actually happened. If your trace is a tree of calls, you have a database of decisions
the model made.</p>
<div class="highlight"><pre class="highlight plaintext"><code># Before: one log line, no shape
[INFO] chat_completed user_id=123 duration_ms=4200 tokens=1840

# After: a tree of decisions
trace: "chat"
  span:       load-prompt                  (version=production:v12)
  generation: classify-question            (model=haiku, category="billing")
  generation: compose-answer
    span:       tool-call.lookup_invoice   (200ms)
    span:       tool-call.lookup_customer  (180ms)
  generation: final-response               (model=sonnet, 1.2k tokens)
</code></pre></div>
<p>Each node carries the prompt name and version, the model id, token usage, and a
set of metadata fields we control. The customer it ran for, the category the
question was classified as, which tools ran, whether the conversation was new.</p>

<p>That metadata is the part that turned out to matter most.</p>

<p>The first time we filtered traces to “every chat in scope X where a
particular tool ran and the user said the answer was bad”, we had a small
realization. The trace list was not a log anymore. It was a queryable database
of decisions the model made.</p>

<p>The rule we would write on a sticky note: <strong>tag your traces with the dimensions
you will want to filter on later</strong>. It is cheap up front and impossible to add
later, once you wish you had it.</p>
<h2 id="a-score-from-the-user">
  
    A score from the user
  
</h2>

<p>Every assistant message in the UI has a thumbs up and a thumbs down. When a user
clicks one, we save a row and post it back to the observability tool as a score
on the trace.</p>

<p>A thumbs-down on its own isn’t actionable. A thumbs-down attached to a trace tells
you what the model saw, what it called, which prompt version produced it, and what category the
question fell into. Now you can ask: are downvotes concentrated in one category? On one prompt version? After one specific tool call?</p>

<p>You should review downvoted traces. It takes time, sometimes they’re noise, the user wanted something we don’t support,
or hit thumbs-down by accident. But maybe one in ten is a real signal, and that’s the one that turns into a prompt change,
a new tool, or a bug fix.</p>

<p>The point of all this plumbing is one new query.</p>

<blockquote>
<p>Show us every trace a user labeled bad.</p>
</blockquote>

<p>Once you can run that query and read the entire conversation that produced it
(prompt version, tool calls, model, latency, everything), you stop the guessing
game.</p>
<h2 id="a-score-from-another-model">
  
    A score from another model
  
</h2>

<p>Human feedback is useful but rare. Most users do not click anything.</p>

<p>So we added a second model to grade the first one. A background job pulls
finished chats, runs them through a separate “judge” prompt (versioned and
labeled in the same store as the production prompts), and writes the result
back as a score on the same trace.</p>

<p>Now the trace carries two streams of judgment. When the user and the judge
agree, our judge is in sync with real users. When they disagree, that is the
most interesting trace in the system. Either way, the judge runs on every chat,
so a regression shows up the same day we ship the prompt that caused it, not a
week later when somebody complains.</p>

<p>Our judge scores things like factuality, instruction-following, completeness,
hallucination, and whether the assistant actually used the right internal context.</p>

<p>We underestimated this one. A judge that catches a regression before it ships
is worth more than a faster or smarter model. It is the only signal that scales
when nobody is clicking thumbs.</p>

<p>The lesson we had to learn: the judge is just a prompt. It can be wrong. It
needs versioning and a Playground and a rollback button, exactly like a
user-facing prompt.</p>

<figure>
  <img src="https://images.thoughtbot.com/8exq7pktql71m95hlwd2jd0m457q_diagram.png" alt="A diagram showing the four signals of AI observability: prompt version, trace, user score, and judge score.">
  <figcaption style="text-align:center;">
    Each signal writes back to the same trace. That’s the whole trick
  </figcaption>
</figure>
<h2 id="four-signals-one-idea">
  
    Four signals, one idea
  
</h2>

<p>The four signals overlap, and that’s on purpose. The prompt version shows up on
the trace. The user score attaches to the trace. The judge score attaches to
the trace too. They are not really four separate things. They are the same idea
viewed from four different angles.</p>

<p><strong>Make the AI feature observable, then you can change it on purpose.</strong></p>

<p>For a while I treated AI features like a different category of software: less debuggable,
less testable, less under our control. An AI feature is software. It has inputs, makes decisions, produces outputs,
and can be observed like anything else.</p>

<p>The four signals overlap on purpose. They are one idea, make the system observable, viewed from four angles.
What changes once you have them isn’t that the model gets smarter. It’s that you stop hoping. You ship a prompt
change knowing the judge will tell if it regressed. You read a downvote knowing you can replay the exact conversation
that produced it. You promote a new prompt to production knowing you can roll it back in one click if it breaks.</p>

<p>The model is the engine. The observability layer is the dashboard. You can drive without one. You just can’t drive on purpose.</p>

<aside class="related-articles"><h2>If you enjoyed this post, you might also like:</h2>
<ul>
<li><a href="https://thoughtbot.com/blog/how-to-use-chatgpt-to-find-custom-software-consultants">How to Use ChatGPT to Find Custom Software Consultants</a></li>
<li><a href="https://thoughtbot.com/blog/using-machine-learning-to-answer-questions-from-internal-documentation">Using Machine Learning to Answer Questions from Internal Documentation</a></li>
<li><a href="https://thoughtbot.com/blog/priority-determines-product">Priority Determines Product</a></li>
</ul></aside>
<img src="https://feed.thoughtbot.com/link/24077/17351768.gif" height="1" width="1"/>]]></content>
    <summary>Treat your AI feature like a software you can watch, not a model you hope works.</summary>
    <thoughtbot:auto_social_share>true</thoughtbot:auto_social_share>
  </entry>
</feed>
