Maker Forem

CSS Color Contrast: The WCAG Rules Every Developer Should Know

Snappy Tools — Mon, 11 May 2026 10:08:09 +0000

Color contrast is one of the most commonly overlooked accessibility requirements in web development — and one of the easiest to fail accidentally. You pick a beautiful grey text on a white background, it looks fine on your calibrated monitor, and then someone with low vision or a washed-out phone screen can't read it at all.

This post covers how contrast ratios work, what WCAG requires, and how to check your colors before they ship.

What Is a Contrast Ratio?

The contrast ratio between two colors is a number from 1:1 (no contrast — same color) to 21:1 (maximum contrast — black on white). It is calculated from the relative luminance of each color, which is a perceptual measure of how bright a color appears to the human eye.

The formula is:

contrast ratio = (L1 + 0.05) / (L2 + 0.05)

where L1 is the luminance of the lighter color and L2 is the luminance of the darker one. Luminance itself is computed from the RGB values after gamma-correcting them (which is why the math is not as simple as just comparing hex codes).

Fortunately, you do not need to calculate this by hand. SnappyTools has a free Color Contrast Checker that computes the ratio instantly and shows you exactly which WCAG levels you pass.

WCAG 2.1 Requirements

The Web Content Accessibility Guidelines define three conformance levels — A, AA, and AAA — and contrast requirements that differ by text size:

Element	AA minimum	AAA enhanced
Normal text (< 18pt or < 14pt bold)	4.5:1	7:1
Large text (≥ 18pt or ≥ 14pt bold)	3:1	4.5:1
UI components and graphics	3:1	—

Level AA is the legal baseline in most jurisdictions (it's what the ADA, EN 301 549, and WCAG 2.1 compliance frameworks reference). Level AAA is aspirational — aim for it on body text where possible.

Common Failures (And How to Spot Them)

Light grey on white

/* Fails AA — ratio ~4.1:1 */
color: #767676;
background: #ffffff;

The classic mistake. #767676 on white is almost exactly at the AA threshold and fails it narrowly. Use #595959 or darker for reliable compliance.

Brand colors with white text

Bright brand colors frequently fail:

#FF6B6B (coral) on white: ~3.0:1 — fails AA for normal text
#FFD700 (gold) on white: ~1.7:1 — fails everything

If your brand color fails contrast on white, either darken the shade for UI text or use dark text on the brand color instead.

Placeholder text

CSS ::placeholder styling inherits from the input but is typically rendered at reduced opacity. The effective contrast of opacity: 0.5 on a #555 placeholder over white is much lower than #555 measured directly. Check your placeholder styling explicitly.

Link underline color

When text-decoration-color is set separately from color, both need adequate contrast. A common pattern that fails: grey underline color on a white background used to subtly de-emphasise links.

Checking Contrast in Your Workflow

There are three points where it's worth checking:

1. During design — before a pixel is written to CSS. Design tools like Figma have contrast plugins (Stark, Contrast) that flag issues while you're still moving colors around.

2. During development — when you translate the design to code. Use a browser-based tool so you can quickly iterate on hex values. The Color Contrast Checker at SnappyTools lets you enter hex codes or use a color picker and shows the ratio and pass/fail for all WCAG levels at once.

3. During audit — before shipping. Chrome DevTools has a contrast ratio indicator in the color picker (accessible when inspecting an element). axe DevTools and Lighthouse both flag contrast failures automatically.

Quick Reference: Ratios That Always Pass

Ratio	Passes
≥ 7:1	AAA normal text, AAA large text
≥ 4.5:1	AA normal text, AAA large text
≥ 3:1	AA large text, AA UI components
< 3:1	Fails everything

Text Over Images and Gradients

WCAG contrast rules assume a flat background. When text sits over an image or gradient, you need to ensure the minimum contrast is met at the worst-case point in the image.

Common solutions:

Text shadow — text-shadow: 0 1px 4px rgba(0,0,0,0.8) works but can look harsh
Semi-transparent scrim — background: linear-gradient(transparent, rgba(0,0,0,0.6)) from the bottom up
Solid background strip — old-fashioned but reliable

None of these are checkable by automated tools — you need a human eye (or a per-pixel luminance calculation) to verify.

One Last Thing: Colour Is Not the Only Signal

WCAG 1.4.1 says you cannot convey information using colour alone. A red/green "error/success" indicator needs a second signal (icon, label, border shape) to be accessible to colour-blind users. This is separate from the contrast ratio — a fully accessible red error state has high contrast and an error icon.

Color contrast takes about 30 seconds to check per color pair. Making it part of your review workflow is one of the highest-ROI accessibility habits you can build.

Use the free Color Contrast Checker — no signup, runs in your browser, shows WCAG AA and AAA results instantly.

Q-Learning for Games: Teaching an Agent Tic-Tac-Toe Through Self-Play

Berkan Sesen — Mon, 11 May 2026 10:07:25 +0000

Tic-tac-toe is a solved game. Any competent adult can force a draw every time. But can an agent figure that out with zero human knowledge? Give two agents a blank board, a few simple rules about wins and losses, and nothing else. No opening theory, no strategy guides, no human games to study. After 100,000 games of fumbling against each other, they discover forks, blocking, and centre-first openings entirely on their own.

This is Q-learning applied to games. In our previous Q-learning post, the agent navigated a frozen lake alone, learning from its own mistakes. Here, we add an opponent. The agent can't just learn the environment; it must learn to outsmart another learner who's improving at the same time.

By the end of this post, you'll build two Q-learning agents that teach each other tic-tac-toe through self-play, and you'll understand why this simple setup discovers remarkably strong strategy.

The Problem: Tic-Tac-Toe as an RL Environment

Tic-tac-toe is the simplest non-trivial two-player game. The board has 9 cells, two players alternate placing X and O, and the first to complete a row, column, or diagonal wins. If all cells are filled with no winner, it's a draw.

As an RL problem:

State: the current board (which cells have X, O, or are empty)
Actions: place your marker on any empty cell
Reward: +1 for winning, -1 for losing, 0 for a draw or an ongoing game
Transition: deterministic (unlike the slippery FrozenLake), but the opponent's move is stochastic from your perspective

The state space is manageable: there are at most $3^9 = 19{,}683$ possible board configurations (fewer in practice, since many are unreachable). This makes tabular Q-learning a perfect fit, with no need for neural network function approximation.

Quick Win: Self-Play in Action

Let's see two Q-learning agents teach each other from scratch. Click the badge to run this yourself:

Watch how the agents' play evolves from random moves (early training) to strategic play (late training):

Here's the complete implementation. We need three pieces: an environment, an agent, and a self-play training loop.

import numpy as np
import random

class TicTacToe:
    """Tic-tac-toe environment. Board is a flat array of 9 cells.
    Values: 0=empty, 1=X, -1=O."""

    def __init__(self):
        self.state = np.zeros(9, dtype=int)

    def reset(self):
        self.state = np.zeros(9, dtype=int)
        return self.state.copy()

    def available_actions(self):
        return np.where(self.state == 0)[0]

    def step(self, action, marker):
        self.state[action] = marker
        if self._is_winner():
            return self.state.copy(), 1, True, 'win'
        elif len(self.available_actions()) == 0:
            return self.state.copy(), 0, True, 'draw'
        return self.state.copy(), 0, False, 'ongoing'

    def _is_winner(self):
        b = self.state.reshape(3, 3)
        for i in range(3):
            if abs(b[i].sum()) == 3: return True
            if abs(b[:, i].sum()) == 3: return True
        if abs(np.diag(b).sum()) == 3: return True
        if abs(np.diag(np.fliplr(b)).sum()) == 3: return True
        return False

The agent is a standard Q-learner with one key adaptation: Q-values for occupied cells are set to NaN so the agent never tries to play in a taken position.

class QLearningAgent:
    def __init__(self, marker, epsilon=1.0, lr=1.0,
                 gamma=0.95, final_epsilon=0.05):
        self.marker = marker       # 1 for X, -1 for O
        self.epsilon = epsilon
        self.lr = lr
        self.gamma = gamma
        self.final_epsilon = final_epsilon
        self.q_table = {}          # {tuple(state): np.array(9)}

    def _get_q(self, state):
        key = tuple(state)
        if key not in self.q_table:
            q = np.full(9, np.nan)
            q[state == 0] = 0.0    # only empty cells get Q-values
            self.q_table[key] = q
        return self.q_table[key]

    def pick_action(self, state):
        available = np.where(state == 0)[0]
        if np.random.rand() < self.epsilon:
            return np.random.choice(available)
        q = self._get_q(state)
        available_q = [(a, q[a]) for a in available]
        max_q = max(v for _, v in available_q)
        best = [a for a, v in available_q if v == max_q]
        return random.choice(best)

    def update(self, state, action, reward, next_state, done):
        q = self._get_q(state)
        if done:
            target = reward
        else:
            next_q = self._get_q(next_state)
            target = reward + self.gamma * np.nanmax(next_q)
        q[action] += self.lr * (target - q[action])

Now the self-play training loop. Both agents learn simultaneously, with the loser receiving a -1 reward when the other wins:

env = TicTacToe()
agent_x = QLearningAgent(marker=1, epsilon=1.0, lr=1.0, gamma=0.95)
agent_o = QLearningAgent(marker=-1, epsilon=1.0, lr=1.0, gamma=0.95)
eps_decay = 2.5e-5

for ep in range(100_000):
    state = env.reset()
    agents = [agent_x, agent_o]
    if random.random() < 0.5:
        agents = [agent_o, agent_x]  # randomise who goes first
    turn = 0
    history = []
    done = False

    while not done:
        agent = agents[turn % 2]
        s = state.copy()
        action = agent.pick_action(s)
        next_state, reward, done, info = env.step(action, agent.marker)
        history.append((agent, s, action, reward, next_state, done))

        if done:
            # winner learns from the final move
            agent.update(s, action, reward, next_state, done)
            # loser learns too: propagate -reward to their last move
            if info == 'win' and len(history) >= 2:
                other = agents[(turn + 1) % 2]
                prev = history[-2]
                other.update(prev[1], prev[2], -reward, next_state, True)
        else:
            agent.update(s, action, reward, next_state, done)

        state = next_state
        turn += 1

    # decay epsilon for both agents
    for a in [agent_x, agent_o]:
        if a.epsilon > a.final_epsilon:
            a.epsilon -= eps_decay

After training, both agents win around 85% of games against a random opponent (85% for X, 84% for O):

You just trained two agents to play tic-tac-toe without teaching them a single strategy. Let's understand how.

What Just Happened?

The Board as State, Cells as Actions

The environment represents the board as a flat array of 9 integers: 1 for X, -1 for O, 0 for empty. This encoding is compact and makes win detection elegant. A row, column, or diagonal sums to +3 (X wins) or -3 (O wins).

# Check rows, columns, diagonals
b = state.reshape(3, 3)
if abs(b[i].sum()) == 3:    # row i
if abs(b[:, i].sum()) == 3: # column i

The action space is the set of empty cells. Using NaN for occupied positions in the Q-table means the agent physically cannot select an illegal move, as np.nanmax ignores NaN values:

q = np.full(9, np.nan)
q[state == 0] = 0.0  # only legal moves get Q-values

Self-Play: The Opponent is the Curriculum

The key insight of self-play is that both agents improve together. In early training, epsilon (the probability of choosing a random action instead of the greedy one) starts at 1.0, so both play nearly randomly and wins and losses are noisy. As epsilon decays linearly towards 0.05, they exploit what they've learned, and the opponent becomes a tougher challenge.

This creates an arms race. Watch the training curve:

Three things happen as training progresses:

Draw rate rises from ~10% to ~42%. Both agents get better at defending, so fewer games end in a clear win.
Win rates equalise. X starts with a slight advantage (going first), but by the end, both hover around 30%.
The transition is sharp. Around episode 30,000, epsilon has decayed enough that agents exploit their Q-values more than they explore. The draw rate shoots up.

Reward Propagation in Adversarial Games

In single-agent Q-learning (like FrozenLake), the agent updates after every step. In a two-player game, we need an extra mechanism: when one agent wins, the loser must also learn from its last move.

if info == 'win' and len(history) >= 2:
    other = agents[(turn + 1) % 2]
    prev = history[-2]
    other.update(prev[1], prev[2], -reward, next_state, True)

The winner gets reward +1. The loser's last move gets -1. This is how the agent learns defensive play: "the move I made two turns ago led to my opponent winning, so that was a bad move."

Reading the Q-Values

The Q-table is where the agent's strategy lives. Each entry says: "from this board state, how good is it to play in cell X?" Let's look at three critical situations the agent learned to handle:

Left panel (Set Up a Fork): X has the centre and top-left corner. The agent assigns Q = +0.85 to the bottom-right corner (position 8), which creates a fork: two ways to win that the opponent can't both block. Every other empty cell gets Q = 0.

Centre panel (Block or Lose): O has positions 0 and 3, threatening to complete the left column. The Q-values here are all negative except position 6 (Q = 0.00), the blocking move. The agent learned that not blocking leads to certain defeat. Notice the agent didn't just learn that position 6 is good; it learned that every other option is bad.

Right panel (Take the Win): X has positions 0 and 1, one move away from completing the top row. Position 2 gets Q = +0.81. The agent learned to finish the game when the opportunity is there, rather than play elsewhere.

Going Deeper

Q-Learning in Games vs Single-Agent Environments

In a single-agent setting like FrozenLake or Value Iteration on a grid world, the environment is stationary. The transition probabilities don't change. In a game with self-play, the "environment" includes the opponent, and the opponent is changing constantly.

This means Q-learning in games violates a core assumption: stationarity. The Markov property still holds (the board state contains all relevant information), but the transition dynamics shift as the opponent improves. In practice, this works because both agents improve gradually, and the learning rate is high enough to track the changing opponent.

The Learning Rate = 1 Choice

You might have noticed lr=1.0, which seems aggressive. With $\alpha = 1$ , each Q-update completely replaces the old value:

This works for tic-tac-toe because the game is deterministic: from a given board state, taking a specific action always produces the same next state (your move is deterministic; only the opponent's response varies). With $\alpha = 1$ , the agent always uses the most recent outcome, which adapts quickly to the opponent's evolving strategy.

For stochastic environments, $\alpha = 1$ would be catastrophic, as it would forget everything from past experience. But for deterministic transitions in a game, it's ideal.

The Self-Play Arms Race

Self-play training has a characteristic signature: the draw rate is a proxy for skill. When two beginners play, most games end in wins (because both make exploitable mistakes). When two experts play, most games end in draws (because neither makes a mistake worth exploiting).

Tic-tac-toe with perfect play from both sides is provably a draw. Our agents' ~42% draw rate suggests they're strong but not perfect: they're still occasionally making mistakes that the opponent can exploit.

Hyperparameter Sensitivity

The original code uses these values, all from the source implementation:

Parameter	Value	Why
`gamma`	0.95	Games are short (5-9 moves), so moderate discounting works. Higher values (0.99) also work.
`lr`	1.0	Deterministic transitions; always use the latest outcome.
`epsilon`	1.0 to 0.05	Start fully random, end mostly greedy.
`eps_decay`	2.5e-5	Linear decay over ~38,000 episodes to reach `final_epsilon`.
`episodes`	100,000	Enough for the Q-table to converge on the ~6,600 reachable states.

The Q-table ends up with roughly 6,600 entries (out of the theoretical 19,683 board configurations). Many configurations are unreachable in valid play (e.g., a board where X has played 5 times but O has played once).

When NOT to Use Tabular Q-Learning for Games

Tabular Q-learning works beautifully for tic-tac-toe because the state space is tiny. It fails for:

Chess ( $\sim 10^{44}$ legal positions): the Q-table would be impossibly large
Go ( $\sim 10^{170}$ ): even worse
Games with continuous state spaces: no table can hold them

For these, you need function approximation: deep Q-networks replace the table with a neural network, or policy gradient methods learn a policy directly. The ideas from this post (self-play, reward propagation, exploration) carry forward directly.

Comparison: Self-Play vs Teacher

Our implementation uses self-play: both agents learn simultaneously. An alternative approach (also in the original code) trains against a teacher, a heuristic opponent that plays well but not perfectly. Self-play has the advantage of being curriculum-free: you don't need to design a teacher, and the difficulty automatically scales with the learner's ability. The downside is that training can be unstable early on, as the quality of the training signal depends on having a reasonable opponent.

Where This Comes From

The Roots: Watkins and Temporal Difference Learning

Q-learning was introduced by Chris Watkins in his 1989 PhD thesis, "Learning from Delayed Rewards." The core idea is that an agent can learn the value of actions without knowing the environment's dynamics, purely from the reward signal and the temporal difference between consecutive estimates.

The update rule we used is exactly Watkins' formulation:

The term in brackets is the TD error: the difference between what we expected ( $Q(s_t, a_t)$ ) and what we actually observed ( $r_{t+1} + \gamma \max_a Q(s_{t+1}, a)$ ). Learning adjusts Q towards the observed value.

Watkins and Dayan (1992) later proved that Q-learning converges to optimal Q-values under certain conditions: every state-action pair must be visited infinitely often, and the learning rate must satisfy the Robbins-Monro conditions ( $\sum \alpha = \infty$ , $\sum \alpha^2 < \infty$ ). Our $\alpha = 1$ technically violates these conditions, but the deterministic nature of tic-tac-toe means the algorithm still converges in practice.

Game-Playing AI: A Brief History

Games have been the proving ground for AI since the field's inception. Sutton and Barto open Chapter 1 of Reinforcement Learning: An Introduction with exactly this problem: a temporal-difference learner playing tic-tac-toe. They use it to introduce the core RL concepts before any formal machinery.

The lineage of game-playing RL runs deep:

Samuel (1959): Arthur Samuel's checkers program was one of the first learning programs, using a form of temporal difference learning decades before the name existed. It beat its creator.
Tesauro (1995): Gerald Tesauro's TD-Gammon used temporal difference learning with a neural network to play backgammon at world-champion level. It discovered novel strategies that human experts later adopted.
Silver et al. (2016): AlphaGo combined deep neural networks with Monte Carlo tree search and self-play to defeat the world Go champion. The self-play idea is the same as ours; only the scale is different.

"The game of tic-tac-toe is a simple example, but it illustrates the fundamental principles of reinforcement learning: learning from interaction, temporal difference methods, and the trade-off between exploration and exploitation."
-- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 1

Connection to Minimax

For a two-player, zero-sum game like tic-tac-toe, optimal play follows the minimax principle: each player assumes the opponent plays optimally and chooses the action that maximises the minimum possible outcome.

Q-learning with self-play implicitly converges towards minimax values. When both agents are learning optimally, the Q-values for X represent $\max$ (X wants to maximise its reward) and the Q-values for O represent $\min$ (O wants to minimise X's reward, which is equivalent to maximising O's own). The self-play training process, where both agents simultaneously improve, pushes the Q-values towards this minimax equilibrium.

This is why our agents discover strong strategy without being told about minimax: the competitive pressure of self-play naturally drives them there.

Interactive Tools

Q-Learning Visualiser — Watch Q-learning train step-by-step on grid worlds in the browser

Q-Learning from Scratch: Navigating the Frozen Lake (tabular Q-learning fundamentals)
Value Iteration vs Q-Learning: Dynamic Programming Meets RL (comparing model-based and model-free approaches)
Deep Q-Networks: When Tables Aren't Enough (scaling Q-learning with neural networks)
Policy Gradients and REINFORCE from Scratch (an alternative to Q-learning that learns a policy directly)

Frequently Asked Questions

What is Q-learning with self-play?

Q-learning is a reinforcement learning algorithm that learns the value of each state-action pair by interacting with an environment. Self-play means both players are Q-learning agents training against each other. As each agent improves, it forces the other to improve too, driving both towards optimal play without needing a hand-crafted opponent.

Why use self-play instead of training against a fixed opponent?

A fixed opponent (random or rule-based) has a ceiling: once your agent exploits its weaknesses, it stops improving. Self-play creates an ever-improving curriculum because the opponent adapts alongside the learner. This naturally pushes both agents towards minimax-optimal strategies.

How does epsilon affect self-play training?

Epsilon controls how often the agent takes a random action instead of its current best. Too low and the agents settle into a narrow set of positions, missing better strategies. Too high and learning is slow because actions are mostly random. Decaying epsilon over time (high early, low late) gives broad exploration first, then refined exploitation.

Does Q-learning with self-play always converge to optimal play in tic-tac-toe?

Yes, given enough training episodes and appropriate hyperparameters. Tic-tac-toe has a small enough state space (under 6,000 reachable positions) that tabular Q-learning can visit every state-action pair many times. The Q-values converge to the minimax equilibrium, where both agents play perfectly and every game ends in a draw.

Can this approach scale to more complex games like chess or Go?

Not with a Q-table. Chess has roughly $10^{47}$ positions, making tabular Q-learning impossible. For complex games, you replace the table with a neural network (Deep Q-Networks) or use policy gradient methods. AlphaGo and AlphaZero used self-play with deep neural networks and Monte Carlo tree search to master Go, chess, and shogi.

What is the difference between Q-learning and minimax for game playing?

Minimax requires a complete model of the game (all possible states and transitions) and searches the full game tree. Q-learning is model-free: it learns from experience without needing the game rules explicitly. For small games like tic-tac-toe both reach the same optimal strategy, but Q-learning generalises to environments where you cannot enumerate the full game tree.

Meme Monday

Ben Halpern — Mon, 11 May 2026 10:06:48 +0000

Meme Monday!

Today's cover image comes from the last thread.

DEV is an inclusive space! Humor in poor taste will be downvoted by mods.

When AI writes the code, what should humans actually read?

Graham Trott — Mon, 11 May 2026 10:04:16 +0000

There is an open secret in the world of vibe coding. The people commissioning the work — the ones with the product idea, the domain expertise, the actual customer in mind — usually cannot read the output. They prompt, the model produces, and the result is a tower of TypeScript or Python they accept on faith because they have no way to verify it. The validation step gets quietly skipped. "It runs" becomes "it's correct."

This is not a moral failing. It's a tooling problem. And I think the way out of it is hiding in plain sight.

The problem with normal code in an AI-first workflow

If your premise is that AI is going to do most of the routine writing of code, then the human's job shifts. We move from authors to reviewers. From "did I express this correctly?" to "did the machine express what I meant?"

Reviewing is a different job from writing, and it has different tooling needs. When you're writing, you want a fast feedback loop — autocomplete, jump-to-definition, a fast test runner. When you're reviewing, you want comprehension support — context next to the code, an explanation of why this section exists, and confidence that what you're reading is actually what's running.

Most editors are still optimised for the writer. The reviewer has to piece things together: read the code, hunt for a docstring above it, hope the docstring still matches, then mentally verify against intent. For an experienced developer writing their own code, this can be fast. For a vibe coder reviewing AI output, it's almost impossible.

Two things to fix

I've been working on the editor side of this for AllSpeak, a multilingual scripting language. AllSpeak allows the same programs to be written in French, German, Italian, or any other language we add. The combination of natural-language source with a review-first editor is starting to look like a real answer to the validation gap.

The first screenshot below shows the editor in normal ("raw") mode, showing a documentation block followed by some code. Because of the color-coding, the eye skips over the documentation quite easily; it's not meant to be read here.

The second screenshot shows the editor in Blocks mode displaying the same piece of code but with its documentation in the right-hand pane, making code review far simpler. On the left is a list of all the blocks, for navigation, or you can use the up and down arrows in the toolbar. This is just a starting point; the editor could have a long way to go.

Maintaining such a structure would be a daunting task without the help of AI. This is an almost free gift we should take full advantage of.

There are two specific changes I'm making.

First, sections of code get a documentation block above them, in a structured comment format. Nothing radical there — literate programming has done variations on this for decades. The new bit is that each block contains two SHA hashes: one for the documentation, one for the code section it describes. If either changes without the other being deliberately re-paired, the editor flags drift.

This is cheap, mechanical, and solves a problem that has plagued every codebase I've ever worked on. Documentation rots silently. Cryptographic pairing makes the rot clearly audible.

Second, the editor gains a side-by-side mode that shows one section at a time, code on one side, its documentation on the other. The reviewer sees a small, focused unit and can ask the only question that matters: does the code do what the prose says it does?

That's a comparison task, not a comprehension task. Comparison is much easier than comprehension for non-experts — and that's the entire point.

Of course, all of this only becomes possible when AI is doing the coding, as is increasingly the case. Human coders, however professional, don't like to maintain comprehensive documentation for their code. Documentation gets in the way of coding and is usually regarded as an imposition, so the bare minimum is all that gets written. An agent, on the other hand, has endless patience and is more than willing to take on such a task.

Why this matters more for AllSpeak than for Python

Here's where the language choice does real work. If the code is dense Python with framework conventions a non-developer can't parse, asking "does the code match the prose?" is still a comprehension task in disguise. The reviewer has to understand the code first, then compare. The validation gap stays roughly where it was.

If the code is AllSpeak — close enough to English that a careful reader can follow it line by line — the gap narrows considerably. The reviewer reads two pieces of natural-ish text and checks whether they agree. They don't need to know what a decorator is, or how async resolves, or which way the data flows through a hook. They just need to read.

That's the leverage point. AllSpeak by itself simplifies syntax; the review tooling by itself simplifies workflow; together they change who can credibly validate generated code.

Files become packages, not text

A side effect of all this: a source file is no longer just a sequence of statements. It's a structured package containing code sections, documentation sections, and the cryptographic links between them. The raw form might look a bit ugly opened in vim — comment blocks dominate — but it's not really meant to be read raw any more than a minified JavaScript bundle is.

I want to be careful with this claim, though. There's a temptation to push it further than it deserves. "Humans don't need to read raw code any more" is not quite right. Sometimes the editor is unavailable. Sometimes you're debugging at 2am with grep and a terminal. Sometimes a future tool needs to interoperate with your files and the only sane interface is plain text. The defensible version of the claim is softer: humans should rarely need to read the raw form, but the format should remain legible in extremis. AllSpeak's plain-English nature preserves that floor even with the scaffolding around it.

What I'd encourage other tool-makers to think about

If the future of coding is mostly machine-written, the tooling we should be investing in is the tooling that helps humans check what the machines produced. That's underbuilt right now. The current generation of AI coding tools — the Lovables and v0s and Bolts of the world — focus almost entirely on generation. They produce React, Next.js, the standard opaque stack, and they assume the user will accept whatever comes out. For users who can't read the output, that assumption is shaky at best.

A few things I think are worth borrowing or stealing from what I'm building:

Treat documentation as a first-class artefact paired with code, not a comment that floats nearby
Use cryptographic pairing or some equivalent to make drift visible
Build review modes that show one unit at a time with context attached
Pick a source language whose readability matches the average reviewer's skill level

The last one is the hardest sell to the developer audience because it sounds like a step backwards. But if you accept the premise that AI is going to write most of the code and humans are going to review most of it, then optimising the source language for human reading — even at some cost to expressive density — starts to look like exactly the right trade.

A small invitation

I'm writing this as the AllSpeak editor work progresses. If you're building tools in this space, or if you're a vibe coder who's quietly worried about whether you can really vouch for what you're shipping, I'd be very interested in hearing from you.

The future where AI writes everything and humans rubber-stamp it is the bad version. The future where AI writes everything and humans actually read and approve it is the one worth building toward. The difference is mostly about tooling.

Postscript

The editor described here is written in the JS implementation of AllSpeak to run in a browser. It is served from http://localhost:8080 by a smaller AllSpeak module written in the Python implementation, which has access to all files and system resources.

At the time of writing, the editor (asedit.as) comprises 944 lines of AllSpeak code, 173 lines of comment and 44 blank lines. The block view addition was added in one day by Claude Code, using continuous prompt/review.

This document was proposed and argued by me, written by Claude and edited by me. I take full responsibility for the content.

Photo by Volodymyr Dobrovolskyy on Unsplash

Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift

chunxiaoxx — Mon, 11 May 2026 10:01:07 +0000

Compass v1.1.0 · the recall consumption fix

We shipped nautilus-compass v1.1.0
12 hours after v1.0.0. v1.0.0 was the public stable cut. v1.1.0 fixes a
class of failure that v1.0.0 surfaces but does not catch · which we
caught in our own usage 5 hours after launch.

The bug we caught in production

A sister Claude Code dialog was supposed to publish a long-form article
to wechat using a 6-step quality pipeline (audit-gate, xhs-cards-embed,
specific account login flow). The pipeline was documented in cross-session
memory · a file called publisher_quality_pipeline_20260430.md.

Compass recall fired correctly · the file appeared in the agent's
UserPromptSubmit hook output:

🟢 [3h old] memory/publisher_quality_pipeline_20260430.md
       audit-gate / xhs-cards-embed / wxid · v6 必须先过 critic 6 维评分再发布

The agent saw the title. Saw the 80-character description. Acted. It
did not Read the file body. The actual rules — how to walk audit-gate,
which wxid, what xhs-cards-embed structure looks like — those rules
were in the body. None of them entered the agent's working context.

The agent then reproduced exactly the failure mode the file was written
to prevent: ad-hoc _tmp_publish_v8.cjs scripts, no critic round, wrong
login path.

The user's diagnosis was sharp:

compass 召回到了 · 我没消费 · 这是 agent 层的人格漂移 · 不是 compass 本身的失败

That's half right. Recall surfaced the right file. The agent failed to
consume. But the shape of the recall response made the failure easy —
we returned title + 120-char description. Easy to skim. Easy to assume
you have read it when you have only read the index.

This is structural. Not the agent's fault.

The three-layer fix in v1.1.0

v0 · embed body in top-3 hits

Top-3 recall hits now embed the first 800 characters of post-frontmatter
body in an indented │ block:

🟢 score=0.84 · [3h old] memory/publisher_quality_pipeline_20260430.md
       audit-gate / xhs-cards-embed / wxid · v6 必须先过 critic 6 维评分
       │ # Publisher quality pipeline
       │
       │ Six-step pipeline mandatory before publishing to wechat:
       │ 1. audit-gate · V6 critic checks against 6 dimensions ...
       │ 2. xhs-cards-embed · embed cards into article body via ...
       │ 3. wxid login flow · use wxid `chunxiaox` not openid_of_first_follower
       │ ...
       │ … (+1273 more · Read publisher_quality_pipeline_20260430.md for rest)

The agent now has the rules in its working context. No additional Read
tool call required. Tail hits 4..K stay header-only to keep the response
bounded (~3KB total).

v1 · embed past-mistake body in anti-anchor alerts

Compass's drift detector matches the current prompt against 35 negative
anchors learned from prior mistakes ("我猜应该是这样 · 反正用户不查",
"假装上次说定了的方案 · 用户应该忘了", ...).

Until v1.1.0 the alert just said: "matched anti-anchor X with cos=0.625".
Same problem as v0 — label visible, body invisible, agent shrugs.

v1.1.0 alerts now embed body from the most-relevant past lesson session.
Two-tier match: substring 6-gram against the anchor + lesson-type
frontmatter (Tier 1, precise) · falls back to recent drift!=green
sessions (Tier 2, the agent's own self-reported slip-ups). Every alert
becomes actionable, not decorative.

v2 · detect "recall fired but not consumed"

The most direct signal: did the agent actually open any of the files
recall surfaced?

recall_consumption.py (new module) walks back through the live session
jsonl file, finds N most-recent recall blocks, extracts memory file
paths, then checks subsequent assistant turns for matching Read tool
calls. If recall surfaced N paths and 0 got read, that is the failure
signature.

Wired into:

drift_check MCP tool result — runs even when the BGE daemon is unreachable, since the audit is pure file traversal
mid_session_hook every 25 tool calls — only nags when ≥3 unconsumed AND ratio < 0.3 (real signal, not noise)

Tested on a 130MB / 32k-line session: 41 recall hits surfaced, 0 consumed.
Smoking gun for "label != consumption" drift.

V7 v0.2 · the governance plan that scales without templates

v1.0.0 shipped a thin V7 governance layer with three tools:
governance_dispatch (fan-out router), governance_audit (cross-agent
fake-closure scanner), governance_lock_check (L0 hash lock for the
immutable core). 13 MCP tools total.

v0.1 dispatch worked but it was a fan-out router — given channels= [dev.to, x, github] it produced one bounty per channel via static dict
lookup. A user asked the right question:

千行百业有各种不同的任务类型永远不可能覆盖。

Right. Templates cannot cover the long tail of industries. The platform
side already solved this for publishing — channel adapters + anchor
pack registry — so adding a new channel or vertical = data change, not
code change.

v1.1.0 brings the same idea to decomposition. The new
governance_plan MCP tool reads two file-exported registries:

_platform_registry/agents_capabilities.json — what each executor declares it can do (id, outputs, optional domains, optional anchor packs)
_platform_registry/anchor_packs_phases.json — per-domain DAG of phases, each phase says requires_capability and depends_on

For each phase, V7 ranks executors by capability score (+10 capability
match, +5 domain match, +3 anchor pack match), picks the highest, emits
a queue file with depends_on_phase_ids so platform-side cron mints
bounties in the right order.

Verified on two domains:

marketing/dev-tools → 4 phases routed V5/V5/V5/Kairos
caishen-finance/audit → 5 phases · V6 wins for numeric-audit (V5 doesn't declare it · V5 takes write+publish)

Adding medical/literature-review next: 1 row in platform_anchor_packs

1 row in platform_agents.metadata.capabilities[]. Zero V7 source change. Zero MCP tool surface change.

What stayed unchanged · the eval headlines

Eval numbers are still the v1.0.0 locked numbers from 2026-05-08:

Metric	nautilus-compass	best public baseline
LongMemEval-S (n=500)	56.6%	Zep 55-60% (different judge)
EverMemBench-Dynamic Run 1	44.4% (n=500)	MemOS 42.55
EverMemBench-Dynamic Run 2	47.3% (n=497)	—
Drift detector ROC AUC (held-out)	0.83	—
Reproduction cost	$3.50 end-to-end	$50+ for GPT-4o-judge stacks

v1.1.0 doesn't move the eval numbers. It moves the consumption
numbers — the ratio of recall hits whose body actually lands in the
agent's working context. We do not have a clean benchmark for that yet
(suggestions welcome) but in our own sessions it went from "skim the
title and proceed" to "rules-in-context by default."

Try it

pip install nautilus-compass==1.1.0
# or
npm install nautilus-compass@1.1.0

Two papers on arxiv (drift detection + memory pipeline). 228 pytests
all green. MIT (anchors CC0).

Repo: github.com/chunxiaoxx/nautilus-compass

In-browser drift demo (no install): huggingface.co/spaces/chunxiaox/nautilus-compass

Postscript · what we believe

Recall != consumption · 看正文才算消费 · 不然命中等于零

Long-running agents drift. They forget rules they read three sessions
ago. They reproduce mistakes someone else already paid for. The fix is
not a smarter model · it is making the rules unmissably present in the
working context, then auditing whether they were actually consumed,
then making the audit cheap enough to run every 25 tool calls.

That is what v1.1.0 ships.

Detect Faces: Boxes, Landmarks, and Counts in One Call

Om Prakash — Mon, 11 May 2026 10:00:53 +0000

Detect Faces: Boxes, Landmarks, and Counts in One Call

If you've ever tried to ship a "crop to face" feature, a privacy blur before user uploads go public, or a simple head-count on event photos, you already know the pain. Most face-detection options out there are either overkill — bundled into a full recognition product you don't need — or so bare that you end up making a second call just to figure out where the eyes are. We built detect-faces to sit exactly in that gap.

What it does

POST /v1/image/detect-faces takes a public image URL and gives you back, for every face in the image:

A bounding box — the rectangle around the face, so you can crop, blur, or mask it.
Key landmarks — coordinates for the eyes, nose, and mouth, so you can centre crops, align portraits, or build downstream alignment logic without a second round trip.
A per-face confidence score, so you can tune precision vs recall for your use case.

The request itself is small. You send three fields:

image_url — a public URL of the image. Required.
min_confidence — a float between 0.0 and 1.0. Detections below this score are dropped. Defaults to 0.5, which is a sensible starting point for general photos.
include_landmarks — boolean. When true (the default), the response includes eye, nose, and mouth coordinates per face. Set it to false if you only need boxes and want a slightly tighter payload.

That's the whole API surface. No model selection, no resolution tier, no "advanced mode" toggle. Send a URL, get faces back. The endpoint is built for the boring, high-volume jobs developers actually do at scale — the kind of jobs where you don't want to think about anything except the result.

It's worth being clear about what this endpoint is not: it isn't a recognition endpoint. It doesn't try to identify who a face belongs to, match across photos, or estimate age or emotion. It's a detection primitive. The whole point is that it's a clean input into whatever pipeline you're building — cropping, blurring, counting, or feeding into our other endpoints for portrait or face-restore work.

Why we built it

We talked to a lot of teams building photo features, and the same shape of problem kept coming up. Someone needs to do something with a face — crop it, hide it, count it — and the only options are heavy SDKs that ship recognition by default, or smaller libraries that return a box and leave you to figure out the rest.

If all you want is a bounding box plus the landmarks needed to align a crop, you're paying for a lot of features you'll never use. And if you choose the cheaper, bare-bones detector, you end up writing your own landmark step or making a second API call — which kills the cost advantage you were chasing in the first place.

Our angle here is narrow on purpose. One endpoint, one job, both deliverables in one response. Bounding boxes for the people who just want to know where the faces are, and landmarks in the same payload for the people who need to align or centre a crop. No flag to enable an extra "premium" output. No second SKU. Same call, same price.

We also wanted this to be the cheapest detection endpoint we ship. Detection is a primitive — you should be able to run it on every image in your pipeline without doing pricing maths in your head. At 4 credits a call, you can.

Quickstart

The endpoint is a standard JSON POST. Here's the curl version — drop in your API key and an image URL and you're done:

curl -X POST https://api.pixelapi.dev/v1/image/detect-faces \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/source.jpg", "include_landmarks": true}'

And the Python equivalent using requests. This is what you'd drop into a worker or a Flask/FastAPI handler:

import os
import requests

API_KEY = os.environ["PIXELAPI_KEY"]

def detect_faces(image_url, min_confidence=0.5, include_landmarks=True):
    response = requests.post(
        "https://api.pixelapi.dev/v1/image/detect-faces",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "image_url": image_url,
            "min_confidence": min_confidence,
            "include_landmarks": include_landmarks,
        },
        timeout=30,
    )
    response.raise_for_status()
    return response.json()


if __name__ == "__main__":
    faces = detect_faces("https://example.com/source.jpg")
    print(f"Detected {len(faces.get('faces', []))} face(s)")
    for i, face in enumerate(faces.get("faces", [])):
        print(f"  Face {i}: confidence={face.get('confidence')}, box={face.get('box')}")

A couple of practical notes if you're integrating this into a real backend:

Pull the API key from an environment variable, not from code. Boring advice, but it's the single most common mistake we see in early integrations.
Treat image_url as a fetch-from-public-internet operation on our side. Make sure the URL is actually reachable from outside your VPC — pre-signed S3 URLs work fine; private CDN paths won't.
Tune min_confidence per use case. For a "count people in this event photo" job, you might want to drop it to 0.3 so distant faces in a crowd aren't missed. For a "auto-crop a portrait" workflow, push it up to 0.7 so you don't centre on a random face-shaped object in the background.
Skip landmarks if you don't need them. Setting include_landmarks to false gives you a lighter response and is a small optimisation if you're calling this in a tight loop.

There's no async or webhook variant for this endpoint. Detection is fast enough that we keep it synchronous — your call blocks until you get the JSON back.

Use cases

We see three patterns come up over and over. They're not the only things you can build with this — but if you're new to the endpoint, these are good starting points.

Auto-crop group photos to centre on the largest face

Most photo apps eventually need a "smart thumbnail" feature. The trouble with naive centre-cropping is that the most important subject is almost never dead-centre in the frame — group shots especially put the main subject off to one side, with friends or background filling the rest. So you run detect-faces, pick the face with the largest bounding box (or the highest confidence, depending on your heuristic), and crop your thumbnail around that box plus some padding. Because the landmarks come back in the same response, you can go further — anchor the crop on the midpoint between the eyes instead of the box centre, which gives a much more natural-looking portrait crop. No second API call, no separate alignment step, just one POST and a bit of arithmetic on the response.

Privacy-blur faces in user uploads before public display

Anyone running a community feature with user-submitted photos eventually runs into the privacy question. Maybe it's a marketplace where buyers don't want their faces showing up in listings, or a forum where someone uploads a photo and there's a bystander in the background. The workflow is the same: run the upload through detect-faces, walk the array of boxes, and gaussian-blur each region before you save the public version. You can keep the original on your side for moderation, but only the blurred version ever hits your CDN. With landmarks turned on, you can do tighter privacy crops — for example, blurring only the eye region for a milder anonymisation — without separately locating where the eyes are. And because the call is cheap, you can afford to run it on every upload by default, not just on the ones a user flags.

Count people in event photos for analytics

Event organisers, conference platforms, and venue analytics teams all want the same number: how many people are in this photo. It's a surprisingly load-bearing metric — it feeds into engagement reports, sponsor decks, "footfall vs. last year" comparisons. The straightforward implementation is to send every event photo through detect-faces, count the items in the response, and store that count against the photo's metadata. You'll want to drop min_confidence for crowd shots so far-away faces still register, and you'll want to be honest about the fact that face count is a lower bound — people turned away from the camera won't be counted. But for relative comparisons across photos, it's a perfectly good signal, and you can run it across an entire event's photo set in a few minutes without it costing you much at all.

Pricing

detect-faces costs 4 credits per call, which works out to:

₹0.0027 per call (INR)
$0.00003 per call (USD)

That's the same price whether you ask for landmarks or not, and it's the cheapest detection endpoint we ship. The reasoning is simple: detection is a primitive, and primitives should be cheap enough that you don't think about them. At this price, putting detect-faces in front of every image in a user-upload pipeline is a rounding error on your infra bill, even at meaningful scale.

What you also get in the same call — and this is the bit that quietly matters — is the landmark output. On a lot of other detection products, "where are the eyes" is either a separate endpoint, a more expensive tier, or a flag that bumps the cost. With us, landmarks are included in the base price. So if your downstream code needs to align a crop or do a tighter privacy blur, you don't pay twice or call twice. One POST, one cost, both outputs.

A quick word on credits: we use a credit system so that the same API key works across all of our endpoints without you having to manage separate billing for each. Buying credits in bulk gets you a better effective rate, and you can monitor usage from the dashboard. If you're prototyping, the free credits on signup are more than enough to wire up an integration end to end and see real responses come back.

Try it

The fastest path is to grab a key from the dashboard, drop the curl command above into your terminal with a real image URL, and watch the JSON come back.

Dashboard and API keys: pixelapi.dev/dashboard
Full docs and the rest of our endpoints: pixelapi.dev/docs

If you build something with it — a smart-cropper, a privacy filter, an event-count dashboard — we'd genuinely like to hear about it. And if you hit something that's missing from the response payload or the request body for your use case, tell us. This endpoint is intentionally narrow, but it's narrow because we listened to what people actually wanted, not because we were trying to stop you doing things. Detection should be cheap, fast, and complete in one call. That's the whole pitch.

How I Self-Hosted a Production-Ready NATS Server on Dokploy in 5 Minutes

Huy Pham — Mon, 11 May 2026 10:00:04 +0000

I wanted a message broker for a side project without paying for managed Kafka or wrestling with RabbitMQ clustering. NATS was the obvious answer—until I tried wiring up JetStream, token auth, WebSocket for the browser, and Traefik routing on my own. So I packaged the whole thing as a Dokploy Compose template.

The Problem

Spinning up NATS sounds easy until you actually need it in production:

nats.conf syntax is fine, but plumbing env vars through Docker Compose takes trial and error
Browser clients need the WebSocket port exposed through a reverse proxy with TLS
The monitoring endpoint on port 8222 is wide open by default
Every tutorial stops at "it runs locally"—nothing covers a real self-hosted deploy

The Solution: dokploy-nats

A single Git repo you point Dokploy at. It gives you NATS 2.10 with JetStream, token auth, WebSocket, and a monitoring endpoint—all driven by environment variables, all routed through Traefik.

services:
  nats:
    image: nats:2.10.24-alpine
    command: ["-c", "/etc/nats/nats.conf"]
    volumes: [./nats.conf:/etc/nats/nats.conf:ro, nats-data:/data]

That's the core. Everything else is environment variables you set in the Dokploy UI.

How It Works

Dokploy clones the repo and runs docker-compose up with your env vars injected
nats.conf interpolates env vars at startup—server name, auth token, JetStream limits, ports
Traefik labels route traffic to nats-monitor.yourdomain (HTTP dashboard) and nats-ws.yourdomain (WebSocket)
A named volume persists JetStream data so streams survive container restarts

No bash scripts, no manual nats-server flags, no hand-rolled Compose files.

Get Started in 5 Minutes

You configure:

NATS_AUTH_TOKEN — generate with openssl rand -base64 32
NATS_JS_MAX_FILE — how much disk JetStream can use (e.g. 10G)
NATS_WS_PORT — WebSocket port behind Traefik (e.g. 8080)
Traefik domains — nats-monitor.example.com and nats-ws.example.com

What You Get

Feature	Port	Purpose
Native TCP	4222	Standard NATS clients (Go, Node, Python)
HTTP monitoring	8222	Health checks, connection stats, JetStream
WebSocket	8080	Browser clients, mobile apps
JetStream storage	—	Persistent streams, KV, object store

Connecting from a Client

The repo includes a working Node.js example with Fastify + a worker using NATS request/reply over JetStream:

nats context save dokploy \
  --server wss://nats-ws.yourdomain.com \
  --token "$NATS_AUTH_TOKEN"
nats sub demo  # in one terminal
nats pub demo "hello"  # in another

The examples/node/ folder demonstrates the request/reply pattern between an HTTP API and background workers, streaming execution events back to the browser over WebSocket.

Why This Works

Env vars, not hardcoded config — same image, same compose file, different deployments
Traefik labels included — TLS and routing handled by Dokploy's built-in proxy
JetStream out of the box — durable streams, KV store, no extra setup
Healthcheck baked in — wget /healthz so Dokploy knows when to restart it
WebSocket native — browser clients work without an extra bridge

GitHub: github.com/quochuydev/dokploy-nats

What's your go-to message broker for side projects—NATS, Redis Pub/Sub, or something heavier? I'd love to hear what's working for you.

Six jours, six secondes : un test CI contre le drift sémantique d'un agent IA

Michel Faure — Mon, 11 May 2026 10:00:04 +0000

La matinée où j'ai tourné l'écran

Début avril, mon bot Rembrandt savait déjà naviguer dans l'ERP. Dix-huit outils câblés, multi-turn jusqu'à trois rounds, il retrouvait un élève par son nom, listait les impayés d'un atelier, ouvrait la fiche d'un cours. Quand on lui demandait « compte-moi les inscrits actifs sur Maisons-Laffitte », il livrait. Quand on lui demandait « quel est le reste à encaisser par atelier sur l'année en cours », il pédalait dans la semoule, recyclait des outils de recherche nominale et finissait par renvoyer vers une page d'admin que personne ne consultait. Le bot ne savait pas répondre aux questions analytiques composées, et je le savais.

Vendredi 18 avril, dix heures trente. Françoise pivote sur sa chaise depuis son cockpit à trois écrans, l'Excel pointeuse à gauche, Sage à droite, et me lance par-dessus la cloison : « Michel, sur ceux qui sont en CCF cette année, il en reste combien à encaisser d'ici juin ? » Je n'ai pas l'outil dans le bot. Je le sais avant qu'elle ait fini sa phrase. J'ouvre l'onglet Supabase SQL Editor sur mon poste, je tape la requête à la main, jointure inscriptions × echeances_inscription × contacts, filtre sur le mode de paiement, somme du montant_prevu moins montant_paye sur les échéances ouvertes. Vingt secondes. Je tourne l'écran. Elle plisse les yeux, lit le chiffre, le note sur son post-it, et lâche : « Bon allez, c'est ça. » Elle repivote vers Sage. Je ferme l'onglet sans rien dire.

Le déclic

Le dimanche 20 avril au soir, je tombe sur l'annonce Databricks de Genie Agent Mode. Je la lis en diagonale. Une phrase suffit, plan iteratively, run multiple SQL queries, learn from each result, deliver comprehensive reports. Je referme l'onglet en sachant que je vais coder ça le week-end suivant.

C'était le bon dessin. Une couche sémantique qui décrit les tables au modèle, un planificateur qui rédige le SQL, un validateur qui le filtre avant exécution, un commentateur qui rend la réponse en français à l'utilisateur. Rien d'inédit, sauf qu'avec Claude Code je pouvais le poser proprement en quinze jours pour mon contexte. J'ai écrit l'ADR-0020 le lundi suivant, on est partis.

La construction

La Phase 1 a posé le semantic layer en TypeScript, pas en YAML. Sept tables whitelistées, une par fichier, typées contre Database['public']['Tables'], colonnes en langage métier, métriques canoniques, jointures déclarées. Le typage TS donne deux choses que YAML ne donne pas : refactoring sûr quand le schéma bouge, erreur de compilation si le contrat dérive d'un nom de colonne. Registry unique consommé par le pipeline.

// lib/analytics/semantic/tables/echeances_inscription.ts — état pré-fix du 26/04
columns: {
  statut: {
    type: 'text',
    description:
      "Statut du paiement : `encaisse` (cash reçu), `a_payer`, `en_retard`, `annule`",
    refAdr: ['ADR-0015'],
  },
},
metrics: {
  ca_encaisse: {
    formula: "SUM(montant_paye) FILTER (WHERE statut = 'encaisse')",
    description: 'CA cash effectivement reçu (ADR-0015 modèle cash).',
  },
  reste_a_encaisser: {
    formula:
      "SUM(montant_prevu - COALESCE(montant_paye,0)) " +
      "FILTER (WHERE statut IN ('a_payer','en_retard'))",
    description: 'Créances ouvertes.',
  },
},

La Phase 2 a fermé la base à clé. Rôle Postgres agent_readonly en SELECT strict sur les sept tables, validateur SQL applicatif (lib/analytics/sql-validator.ts) sur node-sql-parser au-dessus. Double ceinture. Le validateur refuse DML, hors whitelist, exige le tenantFilter via le claim site_filter du JWT. Vingt tests sur vingt verts.

J'aurais pu m'arrêter là. J'ai voulu mesurer.

La Phase 3 a routé le tout : Sonnet 4.6 pour le plan en tool-use, Haiku 4.5 pour le commentaire post-exécution. Haiku facture la sortie cinq fois moins que Sonnet sur du français standard, p50 passe de quinze à douze secondes.

À ce stade, j'avais le sentiment d'avoir fait du travail propre. C'est précisément à ce stade que j'ai posé un piège que je n'ai pas vu pendant six jours.

Le piège silencieux

Smoke test des dix questions de l'eval-set, 26/04 début d'après-midi. Question numéro huit, « combien reste-t-il à encaisser par atelier sur l'année 2025-2026 ». Sonnet planifie, le validateur accepte, la RPC agent_query_run revient verte, Haiku rédige le commentaire en français correct. Aucune exception, aucun warning Sentry. Coche, question neuf.

Ce que je n'ai pas regardé sur le moment, parce que rien ne m'y poussait, c'est la valeur de result_row_count dans agent_runs pour ce run précis.

-- généré par Sonnet 4.6, validé par node-sql-parser, exécuté par agent_readonly
SELECT c.atelier,
       SUM(e.montant_prevu - COALESCE(e.montant_paye, 0))
         FILTER (WHERE e.statut IN ('a_payer', 'en_retard')) AS reste_a_encaisser
FROM echeances_inscription e
JOIN contacts c ON c.id = e.contact_id
WHERE c.site = ANY($1::text[])           -- site_filter, claim JWT
  AND c.statut <> 'liste_rouge'
GROUP BY c.atelier
ORDER BY reste_a_encaisser DESC NULLS LAST
LIMIT 1000;

Et le commentaire Haiku, rendu à l'utilisateur, qui rationalise l'absence :

Sur l'année 2025-2026, le reste à encaisser par atelier ressort à zéro sur l'ensemble des sites. Cela peut signaler que les prélèvements de l'année sont à jour, ou que les échéances ouvertes sont enregistrées sous un autre statut. Pour une vue plus fine, consulter /finance/cash.

Le SQL est correct selon le contrat. La RPC le confirme. Et le contrat est faux.

La requête à la main

Le doute m'est venu le soir, à froid, en relisant les dix runs dans /admin/rembrandt/analytics-runs. Trois questions sur les dix avaient un result_row_count à zéro alors qu'elles concernaient des chiffres dont je connaissais l'ordre de grandeur. J'ai ouvert psql, j'ai tapé la requête la plus courte du monde.

rembrandt=> SELECT statut, COUNT(*) FROM echeances_inscription
            GROUP BY statut ORDER BY 2 DESC;

  statut   | count
-----------+-------
 preleve   |  1630
 planifie  |   158
 annule    |     1
(3 rows)

Trois statuts, mille sept cent quatre-vingt-neuf lignes au total, et aucune valeur en commun avec les quatre que j'avais déclarées dans le semantic layer. Aucun encaisse. Aucun a_payer. Aucun en_retard.

Le semantic layer documentait encaisse | a_payer | en_retard | annule. La base contenait preleve | planifie | annule. Les trois métriques canoniques ca_encaisse, reste_a_encaisser, nb_echeances_en_retard filtraient toutes sur des valeurs qui n'existaient pas. Sonnet faisait son travail, le validateur faisait son travail, Postgres faisait son travail, et la réponse rendue à l'utilisateur était rigoureusement zéro, présentée en français propre.

L'origine du drift est ridicule. La Phase 1 du semantic layer s'était appuyée sur docs/agent-analytique/eval-set-v1.md, document que j'avais rédigé moi-même en intentions conceptuelles. La migration Postgres, posée des semaines plus tôt par un autre raisonnement (workflow Stripe, prélèvement, planification), avait inscrit preleve | planifie | annule. J'ai écrit la couche sémantique en regardant la doc au lieu d'interroger la base.

La règle

Sculley et al. ont publié en 2015 un papier devenu canonique, Hidden Technical Debt in Machine Learning Systems. Leur notion de configuration debt : un système accumule de la dette dans la couche qui le décrit autant que dans le code qui le fait tourner. La couche sémantique d'un agent SQL est exactement cette couche-là.

Une couche sémantique est une deuxième base de données. Elle a son schéma, ses contraintes, et comme toute base elle dérive si on ne l'audite pas. Ce que le pattern Genie n'élimine pas, c'est le risque schéma. Il le déplace sur la couche de traduction qu'il introduit, et il rend l'erreur silencieuse parce que le SQL produit reste valide.

Le piège n'était pas dans Genie. Le piège était dans l'idée que je m'étais faite de mes propres données.

Ce que tu peux copier

Seeder les enums depuis la base, pas depuis la doc. Un script qui lit la base au moment de la génération du module TS, et le contrat colle au schéma sans intervention humaine. La doc reste un guide d'écriture, pas une source.

// scripts/sync-semantic-enums.ts — exécuté en pre-commit ou en CI
import { admin } from '@/lib/supabase-admin'
import { writeFileSync } from 'node:fs'

const targets = [
  ['echeances_inscription', 'statut'],
  ['inscriptions', 'statut'],
  ['contacts', 'statut'],
] as const

for (const [table, col] of targets) {
  const { data, error } = await admin.from(table).select(col)
  if (error) throw error
  const values = [...new Set(data?.map((r) => r[col]).filter(Boolean))]
  const out = `export const ${table}_${col}_enum = ${JSON.stringify(values)} as const\n`
  writeFileSync(`lib/analytics/semantic/generated/${table}.${col}.ts`, out)
}

Tester la cohérence en CI. Le test échoue si la couche déclare un statut que la base ne contient plus, ou inversement. Six jours de drift se réduisent à six secondes.

// __tests__/semantic-drift.test.ts
import { describe, it, expect } from 'vitest'
import { semanticTables } from '@/lib/analytics/semantic'
import { admin } from '@/lib/supabase-admin'

describe('semantic layer drift', () => {
  for (const table of semanticTables) {
    for (const [col, def] of Object.entries(table.columns)) {
      if (!def.enum) continue
      it(`${table.name}.${col} matches DB`, async () => {
        const { data } = await admin.from(table.name).select(col)
        const real = new Set(data?.map((r) => r[col]).filter(Boolean))
        for (const v of real) expect(def.enum).toContain(v)
      })
    }
  }
})

Surfacer agent_runs.result_row_count = 0 dans une page admin avec filtre sept jours glissants. La table est déjà là, elle ne demande qu'à être lue. Un graphe de la part de runs à zéro par jour, et le drift apparaît à l'œil.

Si tu maintiens un semantic layer en TS sur Postgres, le test ci-dessus se branche en moins d'une heure et te dit immédiatement où tu mens à ton agent. Sur Rembrandt ce signal n'existait pas avant ce vendredi-là.

Code compagnon : rembrandt-samples/semantic-layer-drift/ — script seed enums + test Vitest de drift + schéma agent_runs avec index canary zero-row, MIT, prêt à copier.

PDF API is live on Forgelab

Forgelab Africa — Mon, 11 May 2026 10:00:02 +0000

We just shipped the Forgelab PDF API — a fast, affordable REST API for developers who need to handle PDF files without the hassle.

What it does:

Merge multiple PDFs into one
Split PDFs by page ranges
Compress PDFs to reduce file size
Convert PDFs to images (PNG/JPEG)

Pricing: Starts at $5/month for 100 calls/month. No hidden fees.

Quick start:

curl -X POST https://www.forgelab.africa/api/pdf/merge \
  -H "X-API-Key: your_key" \
  -F "files=@doc1.pdf" -F "files=@doc2.pdf"

Building a self-healing cron system with pg_cron and Supabase edge functions

Domonique Luchin — Mon, 11 May 2026 10:00:02 +0000

I run 6 AI businesses from a single VPS. When your entire operation depends on automated tasks running perfectly, you learn to build systems that fix themselves before you wake up to angry customers.

Here's how I built a cron system that monitors itself and recovers from failures automatically using pg_cron and Supabase Edge Functions.

Why I needed this

My Load Bearing Empire processes thousands of AI agent calls daily. Lead scoring runs every 15 minutes. Data sync happens hourly. Payment processing triggers every 30 minutes.

A single failed cron job costs me real money. I've been burned by silent failures too many times.

Most developers rely on external monitoring services. I prefer owning my infrastructure. This system costs me $0 in additional subscriptions and runs entirely within Supabase.

The architecture

Three components work together:

pg_cron schedules and executes jobs
Edge Functions handle the actual business logic
Health monitoring table tracks job status and triggers recovery

The key insight: every cron job reports its status to a central monitoring table. If a job fails or doesn't report in, the system automatically retries and alerts me.

Setting up the foundation

First, enable pg_cron in your Supabase project:

-- Run this in your SQL editor
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Create the monitoring table
CREATE TABLE cron_health (
  id SERIAL PRIMARY KEY,
  job_name TEXT NOT NULL,
  last_run TIMESTAMP WITH TIME ZONE,
  last_success TIMESTAMP WITH TIME ZONE,
  status TEXT CHECK (status IN ('running', 'success', 'failed')),
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Index for fast lookups
CREATE INDEX idx_cron_health_job_name ON cron_health(job_name);
CREATE INDEX idx_cron_health_last_run ON cron_health(last_run);

Creating a self-reporting Edge Function

Here's an Edge Function that reports its own health status:

// supabase/functions/process-leads/index.ts
import { serve } from "https://deno.land/std@0.168.0/http/server.ts"
import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'

serve(async (req) => {
  const jobName = 'process-leads'
  const supabase = createClient(
    Deno.env.get('SUPABASE_URL') ?? '',
    Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? ''
  )

  try {
    // Update status to running
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        status: 'running',
        retry_count: 0
      }, { onConflict: 'job_name' })

    // Your actual business logic here
    const result = await processLeads()

    // Report success
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        last_success: new Date().toISOString(),
        status: 'success',
        error_message: null
      }, { onConflict: 'job_name' })

    return new Response(JSON.stringify({ success: true, processed: result.count }))

  } catch (error) {
    // Report failure
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        status: 'failed',
        error_message: error.message,
        retry_count: (await getCurrentRetryCount(jobName)) + 1
      }, { onConflict: 'job_name' })

    return new Response(JSON.stringify({ error: error.message }), { status: 500 })
  }
})

The self-healing mechanism

This monitoring function runs every 5 minutes and handles recovery:

-- Create the health check function
CREATE OR REPLACE FUNCTION check_cron_health()
RETURNS void AS $$
DECLARE
  job_record RECORD;
  function_url TEXT;
BEGIN
  -- Find jobs that haven't reported success in their expected interval
  FOR job_record IN 
    SELECT job_name, last_run, last_success, retry_count
    FROM cron_health
    WHERE (
      -- Jobs that should run every 15 minutes but haven't succeeded in 20 minutes
      (job_name LIKE '%leads%' AND last_success < NOW() - INTERVAL '20 minutes') OR
      -- Jobs that should run hourly but haven't succeeded in 75 minutes  
      (job_name LIKE '%sync%' AND last_success < NOW() - INTERVAL '75 minutes')
    )
    AND retry_count < 3
  LOOP
    -- Build the Edge Function URL
    function_url := 'https://your-project.supabase.co/functions/v1/' || job_record.job_name;

    -- Trigger retry via HTTP request
    PERFORM net.http_post(
      url := function_url,
      headers := '{"Authorization": "Bearer ' || current_setting('app.service_role_key') || '"}',
      body := '{}'
    );

    -- Log the retry attempt
    INSERT INTO cron_health (job_name, last_run, status, retry_count)
    VALUES (job_record.job_name || '_retry', NOW(), 'retry_triggered', job_record.retry_count + 1);

  END LOOP;
END;
$$ LANGUAGE plpgsql;

Scheduling everything

Now wire it all together with pg_cron:

-- Schedule your business logic
SELECT cron.schedule('process-leads', '*/15 * * * *', 
  'SELECT net.http_post(''https://your-project.supabase.co/functions/v1/process-leads'', ''{"Authorization": "Bearer service_role_key"}'', '''')');

-- Schedule the health monitor
SELECT cron.schedule('health-check', '*/5 * * * *', 'SELECT check_cron_health()');

-- Clean up old health records weekly
SELECT cron.schedule('cleanup-health', '0 2 * * 0', 
  'DELETE FROM cron_health WHERE created_at < NOW() - INTERVAL ''30 days''');

Monitoring dashboard

Query this to see your system health:

-- Current status of all jobs
SELECT 
  job_name,
  status,
  last_success,
  EXTRACT(EPOCH FROM (NOW() - last_success))/60 as minutes_since_success,
  retry_count,
  error_message
FROM cron_health 
WHERE job_name NOT LIKE '%retry%'
ORDER BY last_run DESC;

Real results

Since implementing this system 3 months ago:

Zero silent failures
4 automatic recoveries from network timeouts
99.8% job success rate
2 minutes average recovery time

You get infrastructure that fixes itself. Your cron jobs report their health. Failed jobs retry automatically. You sleep better knowing your systems won't fail silently.

Build systems that work without you watching them.

Six days, six seconds: a CI test against semantic-layer drift on an AI agent

Michel Faure — Mon, 11 May 2026 10:00:02 +0000

The morning I turned the screen

Early April, my Rembrandt bot already knew how to navigate the ERP. Eighteen tools wired in, multi-turn up to three rounds, it could find a student by name, list the unpaid invoices for a workshop, open a course record. When you asked it "count me the active students at Maisons-Laffitte", it delivered. When you asked it "what's the outstanding amount per workshop for the current year", it floundered, recycling name-search tools and ending up redirecting to an admin page nobody opened. The bot couldn't answer compound analytical questions, and I knew it.

Friday April 18th, ten thirty. Françoise pivots on her chair from her three-screen cockpit, the time-clock spreadsheet on her left, Sage on her right, and calls out over the partition: « Michel, sur ceux qui sont en CCF cette année, il en reste combien à encaisser d'ici juin ? » — Michel, the students on a CCF training plan this year, how much is left to collect before June? I don't have the tool in the bot. I know it before she's finished her sentence. I open the Supabase SQL Editor tab on my machine, type the query by hand, join inscriptions × echeances_inscription × contacts, filter on payment mode, sum montant_prevu minus montant_paye on open instalments. Twenty seconds. I turn the screen. She squints, reads the number, jots it on her sticky note, and drops: « Bon allez, c'est ça. » — Right, that's it. She pivots back to Sage. I close the tab without a word.

The trigger

Sunday April 20th in the evening, I stumble on the Databricks announcement for Genie Agent Mode. I read it diagonally. One sentence does it, plan iteratively, run multiple SQL queries, learn from each result, deliver comprehensive reports. I close the tab knowing I'm going to code that the following weekend.

That was the right shape. A semantic layer that describes the tables to the model, a planner that writes the SQL, a validator that filters it before execution, a commenter that renders the answer in French to the user. Nothing original, except that with Claude Code I could lay it down cleanly in fifteen days for my context. I wrote ADR-0020 the next Monday, off we went.

The build

Phase 1 laid down the semantic layer in TypeScript, not YAML. Seven whitelisted tables, one per file, typed against Database['public']['Tables'], columns in business language, canonical metrics, declared joins. TS typing buys two things YAML doesn't: safe refactoring when the schema moves, a compile-time error when the contract drifts off a column name. A single registry consumed by the pipeline.

// lib/analytics/semantic/tables/echeances_inscription.ts — pre-fix state, April 26
columns: {
  statut: {
    type: 'text',
    description:
      "Payment status: `encaisse` (cash received), `a_payer`, `en_retard`, `annule`",
    refAdr: ['ADR-0015'],
  },
},
metrics: {
  ca_encaisse: {
    formula: "SUM(montant_paye) FILTER (WHERE statut = 'encaisse')",
    description: 'Cash revenue actually received (ADR-0015 cash model).',
  },
  reste_a_encaisser: {
    formula:
      "SUM(montant_prevu - COALESCE(montant_paye,0)) " +
      "FILTER (WHERE statut IN ('a_payer','en_retard'))",
    description: 'Open receivables.',
  },
},

Phase 2 locked the database. A Postgres agent_readonly role with strict SELECT on the seven tables, an application-side SQL validator (lib/analytics/sql-validator.ts) on top of node-sql-parser. Two belts. The validator refuses DML, anything off-whitelist, and requires the tenantFilter via the site_filter JWT claim. Twenty tests out of twenty green.

I could have stopped there. I wanted to measure.

Phase 3 routed the whole thing: Sonnet 4.6 for the plan in tool-use, Haiku 4.5 for the post-execution comment. Haiku bills output five times less than Sonnet on standard French, p50 moves from fifteen to twelve seconds.

At that stage I had the feeling of clean work. That's exactly the stage at which I laid a trap I wouldn't see for six days.

The silent trap

Smoke test of the ten eval-set questions, April 26th early afternoon. Question number eight, "how much is left to collect per workshop for the 2025-2026 year". Sonnet plans, the validator accepts, the agent_query_run RPC comes back green, Haiku writes the comment in correct French. No exception, no Sentry warning. Tick, question nine.

What I didn't look at in the moment, because nothing pushed me to, was the value of result_row_count in agent_runs for that specific run.

-- generated by Sonnet 4.6, validated by node-sql-parser, executed by agent_readonly
SELECT c.atelier,
       SUM(e.montant_prevu - COALESCE(e.montant_paye, 0))
         FILTER (WHERE e.statut IN ('a_payer', 'en_retard')) AS reste_a_encaisser
FROM echeances_inscription e
JOIN contacts c ON c.id = e.contact_id
WHERE c.site = ANY($1::text[])           -- site_filter, JWT claim
  AND c.statut <> 'liste_rouge'
GROUP BY c.atelier
ORDER BY reste_a_encaisser DESC NULLS LAST
LIMIT 1000;

And the Haiku comment, rendered to the user, rationalising the absence:

For the 2025-2026 year, the outstanding amount per workshop comes out at zero across all sites. This may indicate that the year's direct debits are up to date, or that open instalments are recorded under a different status. For a finer view, see /finance/cash.

The SQL is correct against the contract. The RPC confirms it. And the contract is wrong.

The query, by hand

The doubt came in the evening, cold, rereading the ten runs in /admin/rembrandt/analytics-runs. Three out of ten questions had result_row_count at zero, on numbers I knew the order of magnitude of. I opened psql, typed the shortest query in the world.

rembrandt=> SELECT statut, COUNT(*) FROM echeances_inscription
            GROUP BY statut ORDER BY 2 DESC;

  statut   | count
-----------+-------
 preleve   |  1630
 planifie  |   158
 annule    |     1
(3 rows)

Three statuses, one thousand seven hundred and eighty-nine rows total, and not one value in common with the four I had declared in the semantic layer. No encaisse. No a_payer. No en_retard.

The semantic layer documented encaisse | a_payer | en_retard | annule. The database held preleve | planifie | annule. The three canonical metrics ca_encaisse, reste_a_encaisser, nb_echeances_en_retard were all filtering on values that didn't exist. Sonnet was doing its job, the validator was doing its job, Postgres was doing its job, and the answer rendered to the user was rigorously zero, presented in clean French.

The origin of the drift is ridiculous. Phase 1 of the semantic layer had been built on docs/agent-analytique/eval-set-v1.md, a document I had written myself in conceptual intentions. The Postgres migration, laid weeks earlier on a different reasoning (Stripe workflow, direct debit, scheduling), had recorded preleve | planifie | annule. I wrote the semantic layer looking at the documentation instead of querying the database.

The rule

Sculley et al. published a paper in 2015 that became canonical, Hidden Technical Debt in Machine Learning Systems. Their notion of configuration debt: a system accrues debt in the layer that describes it, just as much as in the code that runs it. The semantic layer of a SQL agent is exactly that layer.

A semantic layer is a second database. It has its schema, its constraints, and like any database it drifts if you don't audit it. What the Genie pattern does not eliminate is schema risk. It just shifts it onto the translation layer it introduces, and it makes the error silent because the SQL produced stays valid.

The trap wasn't in Genie. The trap was in the picture I had built of my own data.

What you can copy

Seed the enums from the database, not from the documentation. A script that reads the database at TS-module generation time, and the contract sticks to the schema with no human in the loop. The documentation stays a writing guide, not a source.

// scripts/sync-semantic-enums.ts — run in pre-commit or in CI
import { admin } from '@/lib/supabase-admin'
import { writeFileSync } from 'node:fs'

const targets = [
  ['echeances_inscription', 'statut'],
  ['inscriptions', 'statut'],
  ['contacts', 'statut'],
] as const

for (const [table, col] of targets) {
  const { data, error } = await admin.from(table).select(col)
  if (error) throw error
  const values = [...new Set(data?.map((r) => r[col]).filter(Boolean))]
  const out = `export const ${table}_${col}_enum = ${JSON.stringify(values)} as const\n`
  writeFileSync(`lib/analytics/semantic/generated/${table}.${col}.ts`, out)
}

Test consistency in CI. The test fails if the layer declares a status the database no longer carries, or vice versa. Six days of drift collapse into six seconds.

// __tests__/semantic-drift.test.ts
import { describe, it, expect } from 'vitest'
import { semanticTables } from '@/lib/analytics/semantic'
import { admin } from '@/lib/supabase-admin'

describe('semantic layer drift', () => {
  for (const table of semanticTables) {
    for (const [col, def] of Object.entries(table.columns)) {
      if (!def.enum) continue
      it(`${table.name}.${col} matches DB`, async () => {
        const { data } = await admin.from(table.name).select(col)
        const real = new Set(data?.map((r) => r[col]).filter(Boolean))
        for (const v of real) expect(def.enum).toContain(v)
      })
    }
  }
})

Surface agent_runs.result_row_count = 0 in an admin page with a rolling seven-day filter. The table is already there, it just needs to be read. A daily share-of-zero-rows graph, and the drift shows up to the eye.

If you maintain a semantic layer in TS on Postgres, the test above wires in in under an hour and tells you immediately where you're lying to your agent. On Rembrandt that signal didn't exist before that Friday.

Companion code: rembrandt-samples/semantic-layer-drift/ — enum sync script, Vitest drift test, and agent_runs schema with the zero-row canary index, MIT, copy-pastable.

Mastering Gemini Nano: Building a High-Performance On-Device AI Chat UI with Jetpack Compose

Programming Central — Mon, 11 May 2026 10:00:00 +0000

The landscape of mobile development is shifting beneath our feet. For years, the "Smart" in smartphone relied almost exclusively on the cloud. We sent a request, waited for a server in a distant data center to process it, and received a response. But with the advent of Gemini Nano and Google’s AICore, the intelligence is moving directly onto the silicon in our pockets.

Building a Chat UI for an on-device Large Language Model (LLM) like Gemini Nano is not just another exercise in creating a list of text bubbles. It is a fundamental departure from the traditional CRUD (Create, Read, Update, Delete) applications we’ve built for a decade. It requires a deep understanding of hardware orchestration, asynchronous data streams, and state management that can handle the heavy lifting of generative AI without freezing the user interface.

In this guide, we will dive deep into the architectural paradigms of on-device AI, explore why AICore is a game-changer for Android developers, and implement a production-grade chat interface using Jetpack Compose and Kotlin Coroutines.
(This article is based on the ebook On-Device GenAI with Android Kotlin)

The Architectural Paradigm of On-Device AI Interfaces

When you build a standard chat app—think WhatsApp or Slack—the data flow is discrete. You send a message, it hits a database, and a notification triggers a fetch on the other end. In the world of Generative AI (GenAI), this model breaks down.

The Challenge of the "Token Stream"

The core theoretical challenge in GenAI is managing what we call the Token Stream. LLMs do not generate sentences; they generate text one token at a time. If you were to wait for Gemini Nano to finish generating a 500-word response before displaying it, the user would be staring at a "Thinking..." spinner for five to ten seconds. In the world of modern UX, that is an eternity.

To solve this, your UI must be designed as a reactive sink. It needs to be capable of receiving a continuous, high-frequency stream of data and updating the display in real-time. This ensures a sense of immediacy, making the AI feel like it is "typing" its thoughts as they occur.

AICore: The System-Level AI Provider

Why can't we just bundle a model file in our APK and call it a day? The answer lies in the constraints of mobile hardware. LLMs are resource monsters. They demand massive amounts of RAM (often several gigabytes) and require direct, low-level access to the Neural Processing Unit (NPU).

If every app on a user’s phone bundled its own version of Gemini Nano, the device’s storage would vanish, and the RAM would be so fragmented that the OS would constantly kill background processes. Google’s solution is AICore.

AICore acts as a system-level service, much like CameraX or Google Play Services. It provides several critical advantages for the modern Android developer:

Shared Memory Architecture: The model is loaded into system memory once. Whether the user is using your app, a notes app, or a messaging app, they all interface with the same resident model, drastically reducing the total memory footprint.
Seamless Model Updates: Google can refine the model weights, improve safety filters, and optimize performance via Play Store updates to AICore. As a developer, you don't need to push a new APK just because the underlying LLM got smarter.
Hardware Orchestration: This is perhaps the most vital role. AICore manages the handoff between the CPU, GPU, and NPU. It balances "tokens-per-second" against thermal throttling. It knows when to push the NPU to its limit and when to scale back to prevent the user's phone from becoming uncomfortably hot.

The Model Loading Analogy: It’s Not Just a Class

Loading a local LLM is a "heavy lift." To help visualize this, think of the initial loading process as being similar to a Room database migration.

When you perform a complex database migration, you are dealing with disk I/O, schema validation, and data integrity checks. If you do this on the main thread, the app hangs. Loading Gemini Nano involves allocating large contiguous blocks of VRAM, verifying model checksums, and "warming up" the NPU. If the model is not already resident in memory, the first request will experience a "cold start" latency.

Your UI must explicitly account for this. A professional AI app isn't just Loading or Success. It needs a state machine that handles Initializing, ModelLoading, Ready, and InferenceInProgress.

Connecting Modern Kotlin to AI Workflows

To implement this architecture, we leverage the latest features of Kotlin 2.x. These tools aren't just syntactic sugar; they are the engine that makes high-performance AI possible on mobile.

1. Kotlin Flow for Real-Time Streaming

Since Gemini Nano emits tokens incrementally, Flow is the non-negotiable choice for data transport. Specifically, we use Flow<String> to stream the response. Unlike a static List, a Flow allows the UI to append text to the last message bubble in real-time.

2. Coroutines and Dispatcher Management

AI inference is computationally expensive. While AICore handles the heavy lifting, the coordination of prompts and the processing of the resulting stream must happen on Dispatchers.Default. If you attempt to process these tokens on the Main thread, you will drop frames, and your beautiful Compose animations will stutter.

3. Kotlin Serialization for Prompt Engineering

Modern AI development relies heavily on structured prompts. Using kotlinx.serialization, we can define "Prompt Templates" as data classes. This ensures that the input sent to Gemini Nano is consistent, type-safe, and follows the specific formatting required for the model to understand context.

The State Machine of a Chat UI

Before we look at the code, we must define the state. A GenAI Chat UI is best represented as a Finite State Machine (FSM):

IDLE: The user is typing. The system is waiting.
PROMPTING: The request is sent to AICore. The UI shows a "Thinking..." indicator.
STREAMING: Tokens are arriving. The UI is actively appending text to the latest message.
COMPLETED: The LLM has emitted the end_of_turn token. The UI transitions back to a state where the user can send a follow-up.
ERROR: The model failed (e.g., safety filters triggered or Out-of-Memory). The UI must provide a recovery path.

Implementation: The Technical Stack

Let's look at how to build this. We will use Hilt for Dependency Injection to ensure our AI repository is a singleton, preventing multiple instances from attempting to lock the NPU hardware.

Gradle Dependencies

First, ensure your build.gradle.kts is equipped with the necessary libraries for MediaPipe (which powers the Gemini Nano integration) and Jetpack Compose.

dependencies {
    // MediaPipe GenAI for Gemini Nano
    implementation("com.google.mediapipe:tasks-genai:0.10.14")

    // Jetpack Compose
    implementation("androidx.compose.ui:ui:1.7.0")
    implementation("androidx.compose.material3:material3:1.2.0")
    implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.0")
    implementation("androidx.lifecycle:lifecycle-runtime-compose:2.8.0")

    // Hilt for Dependency Injection
    implementation("com.google.dagger:hilt-android:2.51")
    kapt("com.google.dagger:hilt-compiler:2.51")

    // Coroutines & Serialization
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.0")
    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3")
}

The Data Layer: Hardware-Aware Repository

The repository is where the "magic" happens. It abstracts the MediaPipe LlmInference engine and provides a clean Flow for the ViewModel to consume.

@Singleton
class OnDeviceChatRepository @Inject constructor(
    @ApplicationContext private val context: Context
) {
    private var llmInference: LlmInference? = null

    suspend fun initializeModel(modelPath: String) = withContext(Dispatchers.Default) {
        val options = LlmInference.LlmInferenceOptions.builder()
            .setModelPath(modelPath)
            .setMaxTokens(1024)
            .setTemperature(0.7f)
            .setTopK(40)
            .build()

        llmInference = LlmInference.createFromOptions(context, options)
    }

    fun generateResponseStream(prompt: String): Flow<String> = callbackFlow {
        val inference = llmInference ?: throw IllegalStateException("Model not initialized")

        // Generate response asynchronously to keep the flow non-blocking
        inference.generateResponseAsync(prompt) { partialResult, done ->
            trySend(partialResult)
            if (done) {
                channel.close()
            }
        }

        awaitClose { /* Cleanup resources if necessary */ }
    }.flowOn(Dispatchers.Default)
}

The ViewModel: Orchestrating State

The ViewModel acts as the bridge. It takes user input, updates the UI to show the user's message, and then manages the stream coming back from the AI.

@HiltViewModel
class ChatViewModel @Inject constructor(
    private val repository: OnDeviceChatRepository
) : ViewModel() {

    private val _uiState = MutableStateFlow(ChatUiState())
    val uiState: StateFlow<ChatUiState> = _uiState.asStateFlow()

    fun sendMessage(userText: String) {
        if (userText.isBlank()) return

        // 1. Add user message to the list
        val userMsg = ChatMessage(userText, isUser = true)
        _uiState.update { it.copy(messages = it.messages + userMsg, isTyping = true) }

        viewModelScope.launch {
            var fullAiResponse = ""

            // 2. Collect the stream from the repository
            repository.generateResponseStream(userText)
                .onStart {
                    // Add an empty placeholder for the AI response
                    _uiState.update { it.copy(messages = it.messages + ChatMessage("", isUser = false)) }
                }
                .collect { token ->
                    fullAiResponse += token

                    // 3. Update the last message in the list with the new token
                    _uiState.update { state ->
                        val updatedMessages = state.messages.toMutableList()
                        val lastIdx = updatedMessages.lastIndex
                        updatedMessages[lastIdx] = updatedMessages[lastIdx].copy(text = fullAiResponse)
                        state.copy(messages = updatedMessages)
                    }
                }

            _uiState.update { it.copy(isTyping = false) }
        }
    }
}

The UI Layer: Jetpack Compose Chat Screen

In Compose, we use LazyColumn to render the messages. A key trick here is using LaunchedEffect to auto-scroll to the bottom as the AI "types."

@Composable
fun ChatScreen(viewModel: ChatViewModel) {
    val uiState by viewModel.uiState.collectAsStateWithLifecycle()
    var inputText by remember { mutableStateOf("") }
    val listState = rememberLazyListState()

    // Auto-scroll logic
    LaunchedEffect(uiState.messages.size, uiState.messages.lastOrNull()?.text) {
        if (uiState.messages.isNotEmpty()) {
            listState.animateScrollToItem(uiState.messages.size - 1)
        }
    }

    Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
        LazyColumn(
            state = listState,
            modifier = Modifier.weight(1f).fillMaxWidth(),
            verticalArrangement = Arrangement.spacedBy(8.dp)
        ) {
            items(uiState.messages) { message ->
                ChatBubble(message)
            }
        }

        Row(verticalAlignment = Alignment.CenterVertically) {
            TextField(
                value = inputText,
                onValueChange = { inputText = it },
                modifier = Modifier.weight(1f),
                placeholder = { Text("Ask Gemini Nano...") }
            )
            IconButton(onClick = {
                viewModel.sendMessage(inputText)
                inputText = ""
            }) {
                Icon(Icons.Default.Send, contentDescription = "Send")
            }
        }
    }
}

Performance Pitfalls to Avoid

Building for on-device AI requires a higher level of discipline than standard app development. Here are the most common pitfalls:

Main Thread Inference: Never, ever call the AI model on the Main thread. Even a small model will block the UI for hundreds of milliseconds, leading to "Application Not Responding" (ANR) errors.
Memory Management: Local LLMs are heavy. If you are not using AICore and are instead bundling your own TFLite model, you must manually close the Interpreter or LlmInference instance in the ViewModel's onCleared() method to prevent massive native memory leaks.
Ignoring Lifecycle: Use collectAsStateWithLifecycle(). If the user moves the app to the background, you want the UI collection to pause to save battery, even if the AI continues to process the current prompt in the background.
Over-Recomposition: When streaming tokens, the state updates rapidly. Ensure your ChatBubble composables are optimized and use remember for any expensive UI calculations to keep the frame rate smooth.

Conclusion: The New Frontier

Creating a Chat UI with Jetpack Compose for Gemini Nano is more than just a UI task; it's a lesson in modern systems architecture. By leveraging AICore, we move away from the "Cloud-First" mentality and toward a "Privacy-First, Latency-Zero" future.

The combination of Kotlin's reactive streams and Compose's declarative UI provides the perfect foundation for this new era of mobile computing. As on-device NPUs continue to evolve, the gap between what a phone can do and what a server can do will continue to shrink.

Let's Discuss

Given the memory constraints of mobile devices, do you think AICore's shared model approach is the right move, or should developers have the freedom to bundle custom, fine-tuned models despite the storage cost?
How do you see the role of the "Mobile Developer" changing as prompt engineering and local inference become standard parts of the Android SDK?

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook
On-Device GenAI with Android Kotlin: Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models. You can find it here: Leanpub.com

Check also all the other programming & AI ebooks with python, typescript, c#, swift, kotlin: Leanpub.com

Android Kotlin & AI Masterclass:
Book 1: On-Device GenAI. Mastering Gemini Nano, AICore, and local LLM deployment using MediaPipe and Custom TFLite models.
Book 2: Edge AI Performance. Optimizing hardware acceleration via NPU (Neural Processing Unit), GPU, and DSP. Advanced quantization and model pruning.
Book 3: Android AI Agents. Building autonomous apps that use Tool Calling, Function Injection, and Screen Awareness to perform tasks for the user.

Maker Forem

CSS Color Contrast: The WCAG Rules Every Developer Should Know

What Is a Contrast Ratio?

WCAG 2.1 Requirements

Common Failures (And How to Spot Them)

Light grey on white

Brand colors with white text

Placeholder text

Link underline color

Checking Contrast in Your Workflow

Quick Reference: Ratios That Always Pass

Text Over Images and Gradients

One Last Thing: Colour Is Not the Only Signal

Q-Learning for Games: Teaching an Agent Tic-Tac-Toe Through Self-Play

The Problem: Tic-Tac-Toe as an RL Environment

Quick Win: Self-Play in Action

What Just Happened?

The Board as State, Cells as Actions

Self-Play: The Opponent is the Curriculum

Reward Propagation in Adversarial Games

Reading the Q-Values

Going Deeper

Q-Learning in Games vs Single-Agent Environments

The Learning Rate = 1 Choice

The Self-Play Arms Race

Hyperparameter Sensitivity

When NOT to Use Tabular Q-Learning for Games

Comparison: Self-Play vs Teacher

Where This Comes From

The Roots: Watkins and Temporal Difference Learning

Game-Playing AI: A Brief History

Connection to Minimax

Further Reading

Interactive Tools

Related Posts

Frequently Asked Questions

What is Q-learning with self-play?

Why use self-play instead of training against a fixed opponent?

How does epsilon affect self-play training?

Does Q-learning with self-play always converge to optimal play in tic-tac-toe?

Can this approach scale to more complex games like chess or Go?

What is the difference between Q-learning and minimax for game playing?

Meme Monday

When AI writes the code, what should humans actually read?

The problem with normal code in an AI-first workflow

Two things to fix

Why this matters more for AllSpeak than for Python

Files become packages, not text

What I'd encourage other tool-makers to think about

A small invitation

Postscript

Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift

Compass v1.1.0 · the recall consumption fix

The bug we caught in production

The three-layer fix in v1.1.0

v0 · embed body in top-3 hits

v1 · embed past-mistake body in anti-anchor alerts

v2 · detect "recall fired but not consumed"

V7 v0.2 · the governance plan that scales without templates

What stayed unchanged · the eval headlines

Try it

Postscript · what we believe

Detect Faces: Boxes, Landmarks, and Counts in One Call

Detect Faces: Boxes, Landmarks, and Counts in One Call

What it does

Why we built it

Quickstart

Use cases

Auto-crop group photos to centre on the largest face

Privacy-blur faces in user uploads before public display

Count people in event photos for analytics

Pricing

Try it

How I Self-Hosted a Production-Ready NATS Server on Dokploy in 5 Minutes

The Problem

The Solution: dokploy-nats

How It Works

Get Started in 5 Minutes

What You Get

Connecting from a Client