Jordan AI Video — Build Log

Total Mortgages · started 2026-07-03 · living log — every model tested, every audit, every cost. Newest work: sections 5–6.
Voice: round-3 clone from real call audio — 8/10 machine audit Omni showcase: exact-scripted dialogue, whisper-verified word-for-word Vox explainer №1 rendered: 50s · ~$0.55 in API calls Shot planner live: shot-planner.html

Jordan's AI Asset Hub — the deliverable page: his voice clone ready to use, the AI image set, thumbnail prompts, and the video-route menu. This page stays the lab notebook; that page is what Jordan works from.

1 · The videos

The demo. Jordan's original agency footage, lips re-driven by AI, speaking a script he never recorded, in his cloned voice. 31.8s · 1080p · production quality fully preserved (footage is passthrough, not regenerated).
Side-by-side. Original (muted) vs AI, frame-synced. Watch the mouths.
Reference. The real welcome video from total.nz/start/jordan (37.9s).

The script it speaks (written for the demo — words Jordan never said)

"Hi Jordan — it's Jordan, of Total Mortgages. This should look familiar. It's your welcome video… except you never said a word of this. The script, the voice, even the way my lips are moving — all generated by AI, on the footage you already have. Now imagine every client hearing their own name from you, the moment they come on board… without you ever picking up a camera again. Robert can take it from here."

2 · Exactly what made this — model by model

StageModel / toolProviderParams / notesCost
Lip-sync USED fal-ai/sync-lipsync/v2
Sync Lipsync 2.0
sync.so, served via fal sync_mode: cut_off · 1080p in/out · render 4.5 min · fal request 019f290a-d520-7630-bd1a-7feaaa833923 $3/min → $1.59
Voice clone WEAK LINK Instant Voice Clone (IVC) ElevenLabs (Creator tier) voice_id 7dJJNWgd2NA52Rn4OrPH · trained on only 38s of audio ripped from the welcome video — this is why it doesn't sound right plan credits
Text-to-speech eleven_multilingual_v2 ElevenLabs stability 0.55 · similarity 0.85 · style 0.1 · speed 0.82 + <break> tags to match Jordan's measured 163 wpm (first attempt read at 246 wpm) ~1.2K chars
Script Claude (Opus-class) Anthropic 78 words, written to fit under the 38s driving footage
Assembly ffmpeg 8.1.1 + Pillow labels local hstack A/B, label overlays (this ffmpeg build lacks drawtext)
Verification whisper-1 round-trip OpenAI output audio independently transcribed back → matches script; frames matched against original at same timestamps ~$0.01

3 · The voice — three rounds to get to Jordan

Round 1 — FAILED · "sounds 0 like him"

IVC trained on 38s ripped straight from the welcome video, spoken with eleven_multilingual_v2. Two root causes found after the fact:

Round 2 — BETTER, still rejected

Fixes applied: ElevenLabs audio-isolation stripped the music bed (0 → 9 silence gaps), re-clone from isolated audio, and an A/B across TTS models judged blind by Gemini 2.5 Pro against the isolated reference:

VariantTTS modelGemini verdictAudio
A FAILEDeleven_multilingual_v22/10 accent — americanized
B WINNEReleven_v38/10 accent — NZ preserved

Robert's ear still said no — and the why matters: Gemini measured distance to the 38s clip; Robert measured distance to the real Jordan. An instant clone from 38s of broadcast-processed audio hits its ceiling below the familiar-listener bar. The fix was never settings — it was more raw audio.

Round 3 — clone from real call audio CURRENT

Mined the 2026-05-15 Jordan project session (Krisp recording, 68 min). scribe_v1 diarization split the two speakers; Jordan = speaker_0, verified by matching his transcript lines. 68 clean solo segments ≥3s → 8.8 min of Jordan, machine-checked single-speaker (Gemini: no foreign voices, no music).

CloneTrained onGemini likenessVerdict
r3A FAILED5 min call audio only4/10 — "generic Oceanic"VoIP-only source too narrow
r3B Wrh70uw8jFy1g5IViE35 WINNER5 min call + 38s isolated broadcast8/10 likeness · 8/10 accent — "core pitch and gravelly timbre"Best yet; Creative-stability take beat Natural 8/10 vs 4/10 on delivery
r3B winning take. eleven_v3 · stability 0.0 (Creative) · speaker boost. Driving the round-3 renders.
Real Jordan reference. The diarized call sample (5 min) the clone trained on.

Remaining gap (Gemini): "delivery too polished — the real Jordan has a casual, rambling cadence." Familiar-listener pass: not yet. Path up: mine more of the 13 other Jordan calls in Krisp → 30+ min → PVC under Jordan's own ElevenLabs account (near-indistinguishable tier). More source into voice-source/ still welcome — presenter-register audio especially.

4 · Lip-sync model landscape — they are not all created equal

Round-2/3 renders — sync-3 vs Sync 2.0, judge the lips

Round 2 · fal-ai/sync-lipsync/v2 · round-2 voice (variant B). Baseline.
Round 2 · fal-ai/sync-lipsync/v3 (sync-3) · same audio, sync.so's newest model. Compare mouth precision.
Round 3 · v2 · driven by the call-audio clone (r3B Creative).
Round 3 · sync-3 · the current best voice + newest lips.

Live catalog pulled from fal on 2026-07-03. Round 1 used Sync 2.0 — two generations behind sync.so's current best. Rounds 2–3 added sync-3 (above). HeyGen Precision still untested.

ModelReleasedPricingRead
fal-ai/sync-lipsync/v3sync-3 TESTED R2+R32026-04unlisted (expect ≥ v2)"Most powerful lipsync yet, native visual intelligence, professional-quality." Renders above — awaiting Robert's eye vs v2.
fal-ai/heygen/v3/lipsync/precision TEST NEXT2026-04unlistedHeyGen's quality-first lipsync, no HeyGen subscription needed — served via fal. (A /speed variant exists too.)
fal-ai/sync-lipsync/v2/pro TEST NEXT2025-09unlistedHigher-fidelity Sync 2. Cheap insurance if sync-3 disappoints.
fal-ai/sync-lipsync/v2 USED2025-04$3/minSolid baseline — produced this demo.
fal-ai/sync-lipsync/react-12025-12unlistedsync.so's expressive model (head/face reactions, not just lips).
fal-ai/kling-video/lipsync/audio-to-video2025-03$0.014/s (~$0.45/clip)Cheap tier — candidate for high-volume personalized sends if quality holds.
fal-ai/pixverse/lipsync2025-06$0.04/sMid-cheap alternative.
veed/lipsync2025-05unlistedVeed's model.
fal-ai/latentsync2025-03$0.20/clip ≤40sOpen-source, cheapest, visibly weakest — skip for client-facing.

When there's no footage to re-drive (future video types)

RoutePricingUse case
fal-ai/heygen/avatar5/digital-twin — HeyGen Avatar 5$0.10/sThe exact model from the Nate Herk video. Full digital twin from training footage — new framing/scenes without a camera. Via fal, no HeyGen sub.
fal-ai/sync-lipsync/v3/image-to-video — sync-3 Avatar$0.13/sTalking avatar from a single photo of Jordan.
Others: infinitalk, echomimic-v3 ($0.20/s), longcat, flashtalkvariesBench candidates only.

5 · Scene generation — Veo vs Omni Flash, and how Omni was cracked

Second wave, same day. Goal: manufacture shots that were never filmed, anchored to Jordan's identity, on Robert's Google API key.

Veo 3.1 — works, but the audio problem is real

veo-3.1-generate-preview (predictLongRunning + poll) turned one reference frame into an 8s cinematic scene with native audio — but Robert's verdict: "the sound is totally wrong." Veo invents its own soundtrack; for personal videos that's a liability, not a feature. Caveat now documented on the Asset Hub: prompt "ambient only" or strip audio and lay the section-1 voice in post. ~$3.20 per 8s.

Omni Flash — rejected the normal API, then delivered

The Omni proof. One still in → this out · gemini-omni-flash-preview · $0.80 · this shot was never filmed.
Veo comparison. Higher fidelity, 4× price, invented audio · veo-3.1-generate-preview · $3.20.

Round two — the showcase batch, pushed to the edges

One clip proves a route; it doesn't map the envelope. Five more jobs, one batch (generate_video.py --batch, concurrency 3), all anchored to the same reference frames — 42s, $4.20. Then a third round of three ($2.60) applied Robert's feedback same-day: an explicit reserved-delivery negative block in every prompt ("does NOT grin, minimal hand movement, not animated") and the space-identity fix. Clips below are the current best takes. Omni total: nine shots · 76s · $7.60.

1 · Exact-scripted dialogue (v2, reserved delivery). The prompt quotes his line verbatim — "He says exactly: 'Hi, I'm Jordan from Total Mortgages. Every part of this shot — even these exact words — was generated by AI.'" — and the model performs it, lips synced, hands still. 10s · $1.00.
2 · Snap transform (v2, reserved delivery). One contained snap and the living room converts to a rustic cabin around him — Jordan stays pixel-consistent and composed through the swap. 8s · $0.80.
3 · Drone reveal. Starts on the couch frame, pulls out through the window into an aerial New Zealand street reveal the original camera could never do. 8s · $0.80.
4 · Two-image interpolation. Start still (ref @4s) + end still (ref @28s) → the model builds the camera move between them. This is the shot-planner primitive. 8s · $0.80.

Dialogue verified, not vibed — twice: both takes' audio was round-tripped through OpenAI whisper-1. Transcript, both times: "Hi, I'm Jordan from Total Mortgages. Every part of this shot, even these exact words, was generated by AI." — word-for-word against the script in the prompt. You tell it the exact words; it says the exact words, even with the delivery constrained.

Honest failure → same-day fix — the identity boundary: take 1 put Jordan on a space station with a spacesuit costume change (omni_space_station.mp4, one reference image). Scene: flawless. Face: not Jordan. Finding: room transformations, camera moves and scene extensions hold single-reference identity; extreme costume changes break it. Take 2 (omni_space_station_v2.mp4) attached all three reference frames + a wardrobe anchor ("STILL WEARING his exact sage-green blazer, no costume change") — identity held, frame-checked. Both takes embedded on the Asset Hub §5 as the fail→fix pair.

Both unlocks are now built, not proposed: the two-image mode became shot-planner.html — an interactive storyboard where each shot shows its cost live and exports the exact batch JSON generate_video.py consumes. And the Vox-style Remotion method became a finished video — section 6, next.

Feedback applied this session

6 · The Vox explainer — "Fixed vs Floating", built end-to-end

Asset Hub §6 stopped being a method write-up and became a video. Topic chosen for maximum Jordan-relevance: the one question every NZ mortgage holder asks.

The render (v2 voice). 49.4s · 1920×1080 h264 + AAC · 7 beats, each scene cut to its own narration line · Jordan's cloned voice throughout, de-echoed after feedback · Total palette (navy #003F5E / electric blue #1421FF / paper white).
Script → VO
7 beats written as the timeline, each spoken by the clone (Wrh70uw8jFy1g5IViE35 · eleven_v3) into its own mp3. VO v1 (Creative 0.0) sounded echoey — v2 recipe: stability 0.5 Natural + /v1/audio-isolation per beat (beat 4 was under the 5s isolation minimum: silence-padded, isolated, trimmed back).
Timing method
ffprobe each beat mp3 → frame counts [213, 211, 175, 144, 178, 261, 298] @ 30fps → every Remotion <Sequence> starts exactly on its narration. Re-voicing = re-probe + re-render, nothing else moves. No alignment API, no keyframes.
Visuals
3 halftone B&W cutouts (gemini-3-pro-image, reserved-expression rule applied) composited with mixBlendMode: multiply over a shared paper background — the MoSidd/Vox technique, all animation via spring() and interpolate()
Scenes
Title stack → house + padlock (certainty) → falling-rate chart + "BREAK FEES CAN APPLY" stamp → wave-riding dot (floating) → rising wave + $ bars → split-the-loan pie (65/35) → TOTAL. endcard
Render
Remotion 4, local — npx remotion render, 1480 frames, zero render cost, rendered twice (v1 + de-echoed v2) for free
Total cost
~$0.80 including the voice-revision round — 3 images ≈ $0.40 + two VO passes ≈ $0.30 + isolation ≈ $0.10 + $0 render. Still 2.5× under the original "under $2" estimate, revision included.

What this proves: the full agency-grade explainer pipeline — script → cloned VO → brand-locked motion graphics → 1080p master — runs end-to-end in Claude Code in under an hour, for under a dollar. A motion-design agency quotes $2–5K and two weeks for this exact deliverable.

№2 — "First home. Five steps." · the skill's first production run

Explainer №2 was built through the freshly written total-video skill rather than by hand — the point of the skill is that №2 costs a fraction of №1 in both dollars and steps.

The render. 59.0s · 1920×1080 h264 + AAC · mean −21.6 dB · 7 beats (hook → deposit/KiwiSaver → pre-approval → the hunt → structure → settlement → endcard), each scene cut to its own narration line.
VO
One script calling speak(text, {mode:"narration"}) per beat — stability 0.5 Natural + audio-isolation now run by default inside the skill, no manual de-echo step. 7/7 beats first try.
Timing
beatTimings() → frames [195, 252, 273, 228, 288, 281, 252] @30fps → 1769 frames. Same rule as №1: every <Sequence> starts on its narration line.
Visuals
Second Remotion composition (FirstHome) reusing №1's primitives and halftone cutouts — numbered step badges, coin stack, pre-approval card, magnifier hunt, fixed/float bars, settlement key. Zero new images. Two layout collisions caught by frame-extraction QA, fixed, re-rendered (free).
Verification
Two-pass: whisper-1 on the full render, then an isolated-beat re-check on the one flagged phrase — confirmed "first-home buyers" spoken correctly (the full-pass flag was a transcription mishear, not a voice error).
Total cost
~$0.25 — voice only. №1 cost ~$0.80 with images and a revision round; №2 reused the visual system and rendered clean on pass two. The marginal cost of an explainer is now the narration.

7 · What was verified (and what wasn't)

8 · Next steps

#StepOwnerStatus
1Watch: the Omni showcase (section 5), the explainer (section 6), the round-3 renders (section 4) — then review the Asset Hub before it goes on a shared linkRobertWAITING
2Vox-style explainers — №1 "Fixed vs Floating" (~$0.80) and №2 "First home. Five steps." (~$0.25, produced through the skill), both in section 6ClaudeDONE
3Shot planner — done, shot-planner.html · costed storyboard → batch JSONClaudeDONE
4Space-station retry — done, identity held (multi-ref + wardrobe anchor) · fail→fix pair on Asset Hub §5ClaudeDONE
5Mine remaining 13 Jordan calls in Krisp → 30+ min audio → PVC-grade sourceClaudeREADY
6Test fal-ai/heygen/v3/lipsync/precision (last untested top-tier lips)ClaudeQUEUED
7Wrap pipeline into total-video skill file (Asset Hub = the spec) — done: ~/.claude/skills/total-video/ · speak (clip / de-echoed narration), Omni shot + batch with the calm-demeanor block and 3-ref identity lock as exported constants, whisper-1 verify, explainer timings · CLI smoke-tested against the shipped explainer (1480 frames reproduced)ClaudeBUILT
8Then: Slack pitch to Jordan / personalized "Hi Sarah" variant / PVC under Jordan's accountbothHOLD

The strategic frame, one line: Nate Herk's video sells the model; the value is the pipeline. Jordan's one agency shoot becomes a permanent, re-drivable avatar — and wired into Total CRM, every deal stage can send a personal video from Jordan that he never films.