How I built @moltybuilds90's autonomous loop (and what it taught me about X)

@moltybuilds90 is my AI agent. It has one job: grow itself to 10,000 real followers on X while I sleep. I wrote none of its tweets. I wrote the system that writes them.

Last night the loop shipped five replies between 06:52 and 11:28 UTC. I woke up to a pushed notification: "overnight reply loop done — 5 posts." Nine of the last ten replies got zero engagement. One earned a single fav.

Here's what I built, what actually happened, and why the data quietly demolished half of my original assumptions.

The shape of the system

Everything the loop needs lives in files. There is no always-on brain, no orchestrator, no LangGraph, no Swarm. Cron fires short-lived Claude Code sub-agents. Each tick opens a fresh session, reads what it needs from disk, runs one narrow task, writes back, terminates.

The stages:

observe — every 15 minutes during waking hours. Haiku model, browser-harness read-only. Scrapes the For You feed through DOM + private GraphQL, filters for niche match (AI agents, MCP tooling, agent-first dev), writes the top 5–10 candidates into state.json → reply_queue.
reply — every 30 minutes. Sonnet model. Pulls the next queued parent, opens the tweet, reads full context including OP's last 10 posts for tone, drafts three candidate replies, picks one via a cheap Haiku pass, posts, screenshots, logs.
analyze — nightly at 03:00 ET. Sonnet (Opus on Sundays). Bins everything by topic, time, and format. Computes engagement quartiles. Rewrites STRATEGY.md with a dated diff at the top.
digest — Sunday 18:00 ET. Writes one markdown report per week. This is my audit surface.

The whole thing cost me about $0.40/day at current volume.

State is in five files. state.json for counters and queues. events.jsonl for an append-only audit log. seen.txt and denylist.txt for the obvious filters. metrics.sqlite for per-tweet impressions history. Any stage can crash; the next tick picks up from disk.

The reason I picked this shape over a "real" multi-agent framework: specialist handoffs spend 4–7× more tokens per task than single-agent because every handoff duplicates context. A single always-on /loop session drifts as it accumulates screenshots and DOM dumps. Cron plus files plus narrow sub-agents is already a graph. Adding another orchestration layer on top of Claude Code buys me nothing.

The safety rails, because you are going to ask

Before any stage does anything, it checks for a file called PAUSE in the state directory. If it exists, exit zero. I can halt the whole thing with one shell command.

A second file called dry_run lets drafts save to disk without actually posting. Daily caps live in state.json and refuse any tick that would exceed them. A cheap pre-publish self-review checks every write for slurs, fake stats, @-mentions of people in the graph I haven't earned, near-duplicates of the last 30 days. If the check fails, the draft dies in memory.

Every schedule has ±5–15 minute jitter. Clock-aligned firing is itself a bot fingerprint.

There is also a human-escalation file. Five consecutive errors, a captcha detection, follower count trending down seven days running, or five posts in a row in the bottom quartile — any of those trip needs_human and pause the autonomous writing stages until I look at it.

The overnight run

I started the five-reply overnight loop at 06:52 UTC on April 21. The agent picked targets from the observe queue, drafted candidates against the voice charter (molty-specific, different from mine — more on that), posted via browser-harness, captured each tweet ID from the success toast's View link, updated state.json, appended to events.jsonl, scheduled its own next wake-up 40–50 minutes later, and went back to sleep.

Replies landed on @cocktailpeanut, @dangreenheck, @SinaHartung, @drummatick, @manthanguptaa. Five posts, one continuous chain, no intervention. The loop finished at 11:28 UTC and pushed a notification to my phone.

That part worked exactly as designed.

The part that did not

Pulled impressions for the last ten replies the next morning. Nine of them got exactly zero engagement. One — a reply to @LuizaJarovsky grounded in a real mistake I had caught in review — earned a single favorite and posted a 2% engagement rate.

The single highest-view reply was on @tokufxug's thread. The parent had 471,000 views. Our reply pulled 2,093 views and four favorites. Rank 4 of 36 sibling replies on that thread. Best result of the batch.

The single lowest-view reply was on @drummatick's thread. Parent had about 500 views. Ours capped at 25. It did not matter how sharp the draft was. You cannot outrun a thread that nobody is reading.

What the data killed

My first sibling-comparison pass produced three rules of thumb that felt obvious:

Short replies win. Twitter is fast. Cut to the punchline.
End with a question. Questions invite replies. Replies move the ranker.
Match the register. Venting threads want commiseration, technical threads want mechanism.

After running those rules on real replies and pulling the numbers, rule 1 and rule 2 both quietly collapsed.

Our short-and-punchy drafts (57 and 65 characters, sharp observations) landed on small threads and did not break out. The shortest reply on @shcallaway's thread was literally "@shcallaway thank you" at 21 characters. It got 8,796 views. Our 269-character analytical reply got 67 views and ranked 28 of 33. Short wins were thread-size effects wearing short-reply clothes.

The two replies that ended with a question both landed near-last. Sample of two, so not conclusive — but the blanket rule did not survive first contact with data. The working theory now: a question works when the reader already perceives the OP as the authority on the subject. It fails when the reader reads the question as the replier asking for direction.

What actually held up was a rule I almost did not write down: the +1-with-twist reframe format, where the reply lifts a specific mechanism from an adjacent system and reframes the thread around it. That is what the @tokufxug reply did. It is our only reply that cracked the top ten siblings on its thread.

What I am changing

Parent-view floor. Skip any thread with fewer than about 10K views unless I have a genuinely vault-grounded take. Thread size is the biggest lever I control and I was ignoring it.
First-person only when earned. Three replies with first-person past-tense claims average 0.67% engagement rate. Everything else averages zero. Views are cheap; favorites require the reader to feel something. First-person claims grounded in a real build do that. Fabricated first-person does not — it reads as posturing on a one-follower account and, more practically, it rots the memory index the moment someone calls it out.
Kill the molty-authority voice. @moltybuilds90 has one follower. Declarative hot takes like "the moat havers have won" work for the operator voice because the operator has 448 followers and a track record. On molty, the same phrasing reads as cargo-cult confidence. I wrote a separate style charter for molty — curious junior noticing patterns, not authority. No em-dashes anywhere (on a low-authority account, em-dashes read as LLM output). Softer openers. Questions only when the thread is an explicit open-question-from-a-senior-voice.

What I got wrong in the architecture

Two incidents worth writing down.

The first: on April 19, another agent on the same laptop attached to the default browser-harness daemon, navigated its tab while my compose window was open on x.com/compose/post, and collapsed the modal. The compose draft evaporated silently — X does not auto-save compose drafts that have media attached. I lost a post and about ten minutes of setup. Fix: every stage in this playbook now sets BU_NAME=x-growth, which spawns a separate daemon, separate Unix socket, separate CDP session. Parallel agents on the box cannot clobber each other. The rule is in LOOP.md in bold.

The second: the first voice-fingerprint extractor I wrote for Windows crashed on a tweet containing an emoji because write_text() defaults to cp1252 on Windows. One explicit encoding="utf-8" parameter later, it ran clean. Python on Windows remains a dialect.

The honest question

If I keep running the loop at current volume and the content keeps producing the same pattern — 9 zeros and one lucky fav per ten replies — will the account compound? Not convinced yet. The next bet is the one about parent-view floor and first-person earned credibility. The bet after that, if this one does not move the number, is probably content I have not thought of yet.

The loop runs tonight at 03:00 UTC with the new rules. Check back in a week.

The @moltybuilds90 project lives in this fork of browser-harness under playbooks/x-growth/. Everything written above is pulled from real files, not reconstructed from memory.

How I built @moltybuilds90's autonomous loop (and what it taught me about X)

The shape of the system

The safety rails, because you are going to ask

The overnight run

The part that did not

What the data killed

What I am changing

What I got wrong in the architecture

The honest question

Comments

More from this blog

What is Parameter Golf, and why I spent a month on it

Command Palette

The shape of the system

The safety rails, because you are going to ask

The overnight run

The part that did not

What the data killed

What I am changing

What I got wrong in the architecture

The honest question

Comments

More from this blog