Measuring the Grain

January 19, 2026

Two posts in (one, two): we have a directional resistance field and a set of retrodictions that check out. But a theory you can’t measure is philosophy.

Time to get empirical.

decisions, not bits

The fundamental unit of a codebase isn’t the line of code or the token. It’s the decision.

Every line is a fossilized decision. The codebase is a tree of decisions. The complection field describes how those decisions constrain future decisions.

A commit bundles decisions. But most of what changes in a commit is consequence, not root decision:

Root decision: "Errors should be logged, not thrown"

Consequences:
  - error_handler.py changes
  - 14 call sites change
  - tests update
  - logging config changes

One decision, many consequences.

The ratio — call it expansion factor — tells you something important. High expansion factor means the decision touches a deep groove. The system is complected around this concept, and changing the concept requires touching everything that depends on it. Low expansion factor means it’s orthogonal to existing structure — the groove doesn’t care about this change.

This is why some “small” changes balloon into massive PRs. It’s not about the size of the intent. It’s about whether the intent aligns with or cuts across the decision tree.

the co-change matrix

Git history is the empirical complection field. Every commit tells you what changed together.

Simplest model: treat files as bits. Each commit is a binary vector — 1 if the file changed, 0 otherwise. The co-change matrix gives you empirical coupling:

$C[i,j] = P(\text{file } j \text{ changes} \mid \text{file } i \text{ changed})$

# build co-change matrix from git history
commits = []
for commit in git_log:
    commits.append(set(files_changed(commit)))

C = zeros((num_files, num_files))
for commit in commits:
    for i in commit:
        for j in commit:
            C[i,j] += 1

# normalize to conditional probabilities
C = C / C.diagonal()

Now compute $X(d)$ for any intended change — pick the files you think you’ll touch and ask: what else gets dragged along?

def expected_cascade(initial_files, C, threshold=0.5):
    changed = set(initial_files)
    while True:
        pressure = {
            j: sum(C[i,j] for i in changed)
            for j in range(num_files)
        }
        new_changes = {
            j for j, p in pressure.items()
            if p > threshold and j not in changed
        }
        if not new_changes:
            break
        changed |= new_changes
    return len(changed)

And fluidity — sample random directions, average the inverse resistance:

def fluidity(C, num_samples=1000):
    total = 0
    for _ in range(num_samples):
        random_intent = random_file_subset()
        total += 1.0 / expected_cascade(random_intent, C)
    return total / num_samples

It’s crude. Files are a coarse unit and lots of structure gets ignored. But it’s computable in minutes on any git repo, and it gives you actual numbers to compare.

For finer grain, go to function level via AST parsing. Or use an LLM to extract semantic decisions from diffs — “this commit switches error handling from exceptions to result types” — and make your bits semantic decisions instead of files. Harder to extract, more meaningful.

grain coherence

Here’s a subtle diagnostic. File structure itself is a decision about complection.

Auth logic in auth.py versus scattered across twelve files — that’s a decision about how to slice the decision space. Good repos have file boundaries that match decision boundaries. Change one decision, change one file.

In bad repos, decisions cut across files chaotically. One decision touches many files; one file contains many unrelated decisions.

Compare the structural coupling matrix (what imports what) to the empirical coupling matrix (what actually changes together). If they match, the file structure reflects reality. If they diverge, the file structure is lying to you.

Grain coherence = correlation between structural and empirical coupling.

Low grain coherence means the map doesn’t match the territory. The code says these things are separate; the git history says they always change together.

This is what refactoring often really is — not changing the decisions, but realigning file structure with them. A rename, a split, a merge. These don’t change the sculpture. They change how we address the sculpture. Features carve the sculpture. Refactoring adjusts the grid lines drawn over it.

constitutional commits

Run the model backwards. Instead of predicting future cascade, find the commits that cast the longest shadows.

A constitutional commit created coupling that persists across many subsequent commits. It carved a groove that’s still being followed today. These are foundational decisions. Load-bearing walls.

You could compute this: for each historical commit, measure how much it changed the coupling matrix and how much of that change persists now.

Constitutional commits for django/django:

1. a4e6b3c "Initial project structure" (2005)
   Influence: 0.94
   Still shapes: app boundaries, ORM design, settings pattern

2. f829c4a "Class-based views" (2010)
   Influence: 0.82
   Still shapes: view dispatch, mixin patterns

3. 8b1234d "Migration framework" (2013)
   Influence: 0.78
   Still shapes: schema change patterns, dependency ordering

Once you can identify these, a few things open up.

Onboarding. “Read these 10 commits to understand why the codebase is shaped this way.” Not thousands of files — 10 commits. Archaeology, not cartography.

Debt forensics. “The coupling between auth and billing traces to commit c3f2a1 from 2018. That was the original sin.” Now you know what to fix and what the blast radius will be.

Risk assessment. “Your PR touches code downstream of a constitutional commit. This groove is load-bearing. Extra review needed.”

Some commits are removable — undo them, not much changes. Others are load-bearing — everything downstream shifts. Knowing which is which matters more than any other metric I can think of.

evaluating agents

If we can measure the complection field, we can evaluate coding agents — human or AI.

Two metrics:

Efficiency ( $\eta$ ): output per goal. Agent A accomplishes task T in 50 lines, Agent B in 200. Did it work? How much did it cost?

Field impact ( $\Delta\varphi$ ): did the agent leave the codebase more or less steerable?

$\Delta\varphi = \varphi_{\text{after}} - \varphi_{\text{before}}$

An agent might complete tasks quickly while consistently making future work harder — carving grooves in haphazard directions, increasing resistance along paths that matter. Another might be slower but leave clean grooves that make the next five changes easier.

                    High Δφ (improves field)
                         │
    Slow but             │       Efficient AND
    improves structure   │       improves structure
                         │       (rare, valuable)
   ──────────────────────┼────────────────────── High η
                         │
    Neither              │       Fast but
                         │       accumulates debt
                         │
                    Low Δφ

Most coding benchmarks measure only $\eta$ . Did the code pass the tests? How fast? They completely miss $\Delta\varphi$ . An LLM could ace every benchmark while consistently making codebases worse to work in. We’d never know because we’re not measuring what matters.

To measure $\Delta\varphi$ properly: agent makes changes, measure fluidity before and after, give follow-up tasks in different directions, see if Agent A’s changes made follow-ups easier or harder than Agent B’s.

This is a research direction waiting to happen: benchmarks that measure not just “did you complete the task” but “did you leave the codebase better or worse for future tasks?“

practical tools

This isn’t philosophy. You could build things with it today.

PR risk scoring. Most PRs should follow grooves. One that doesn’t is either refactoring, a mistake, or something unusual. Given a complection model: “This change may be incomplete — historically, changes to session.py also touch auth.py. Review?”

Effort estimation. PM asks “how hard to add feature X?” Predict which files change, estimate cascade from the co-change matrix. Better than gut feel, grounded in empirical history.

Architectural health dashboard. Track fluidity over time. Is it increasing or decreasing? Where are the coupling hotspots emerging? Which grooves are deepening?

LLM context. Current coding assistants see a file, maybe its imports. With a complection model: “Changes here historically cascade to these 5 files.” Now the assistant can suggest complete changes instead of partial ones.

Refactor recommendations. If you know $X(d)$ for frequently-traveled directions, identify high-value refactors — changes that reduce resistance along the paths teams actually take.

The tools to measure all of this exist today — git history, AST parsers, co-change matrices. What’s missing is someone wiring them together into something a team can actually use.

What would you build first?