Jordan AI Video — Build Log

Total Mortgages · started 2026-07-03 · living log — every model tested, every audit, every cost. Newest work: sections 5–6.

Voice: round-3 clone from real call audio — 8/10 machine audit Omni showcase: exact-scripted dialogue, whisper-verified word-for-word Vox explainer №1 rendered: 50s · ~$0.55 in API calls Shot planner live: shot-planner.html

→ Jordan's AI Asset Hub — the deliverable page: his voice clone ready to use, the AI image set, thumbnail prompts, and the video-route menu. This page stays the lab notebook; that page is what Jordan works from.

1 · The videos

The demo. Jordan's original agency footage, lips re-driven by AI, speaking a script he never recorded, in his cloned voice. 31.8s · 1080p · production quality fully preserved (footage is passthrough, not regenerated).

Side-by-side. Original (muted) vs AI, frame-synced. Watch the mouths.

Reference. The real welcome video from total.nz/start/jordan (37.9s).

The script it speaks (written for the demo — words Jordan never said)

"Hi Jordan — it's Jordan, of Total Mortgages. This should look familiar. It's your welcome video… except you never said a word of this. The script, the voice, even the way my lips are moving — all generated by AI, on the footage you already have. Now imagine every client hearing their own name from you, the moment they come on board… without you ever picking up a camera again. Robert can take it from here."

2 · Exactly what made this — model by model

Stage	Model / tool	Provider	Params / notes	Cost
Lip-sync USED	`fal-ai/sync-lipsync/v2` Sync Lipsync 2.0	sync.so, served via fal	`sync_mode: cut_off` · 1080p in/out · render 4.5 min · fal request `019f290a-d520-7630-bd1a-7feaaa833923`	$3/min → $1.59
Voice clone WEAK LINK	Instant Voice Clone (IVC)	ElevenLabs (Creator tier)	voice_id `7dJJNWgd2NA52Rn4OrPH` · trained on only 38s of audio ripped from the welcome video — this is why it doesn't sound right	plan credits
Text-to-speech	`eleven_multilingual_v2`	ElevenLabs	stability 0.55 · similarity 0.85 · style 0.1 · speed 0.82 + `<break>` tags to match Jordan's measured 163 wpm (first attempt read at 246 wpm)	~1.2K chars
Script	Claude (Opus-class)	Anthropic	78 words, written to fit under the 38s driving footage	—
Assembly	ffmpeg 8.1.1 + Pillow labels	local	hstack A/B, label overlays (this ffmpeg build lacks drawtext)	—
Verification	whisper-1 round-trip	OpenAI	output audio independently transcribed back → matches script; frames matched against original at same timestamps	~$0.01

3 · The voice — three rounds to get to Jordan

Round 1 — FAILED · "sounds 0 like him"

IVC trained on 38s ripped straight from the welcome video, spoken with eleven_multilingual_v2. Two root causes found after the fact:

The source was poisoned. The welcome video has a continuous music bed — ffmpeg silencedetect found zero silence gaps. The clone was learning voice+music as one sound.
Wrong TTS model. eleven_multilingual_v2 flattens the NZ accent toward generic American — measured 2/10 accent in the Gemini audit. Model choice mattered more than any setting.

Round 2 — BETTER, still rejected

Fixes applied: ElevenLabs audio-isolation stripped the music bed (0 → 9 silence gaps), re-clone from isolated audio, and an A/B across TTS models judged blind by Gemini 2.5 Pro against the isolated reference:

Variant	TTS model	Gemini verdict	Audio
A FAILED	`eleven_multilingual_v2`	2/10 accent — americanized
B WINNER	`eleven_v3`	8/10 accent — NZ preserved

Robert's ear still said no — and the why matters: Gemini measured distance to the 38s clip; Robert measured distance to the real Jordan. An instant clone from 38s of broadcast-processed audio hits its ceiling below the familiar-listener bar. The fix was never settings — it was more raw audio.

Round 3 — clone from real call audio CURRENT

Mined the 2026-05-15 Jordan project session (Krisp recording, 68 min). scribe_v1 diarization split the two speakers; Jordan = speaker_0, verified by matching his transcript lines. 68 clean solo segments ≥3s → 8.8 min of Jordan, machine-checked single-speaker (Gemini: no foreign voices, no music).

Clone	Trained on	Gemini likeness	Verdict
r3A FAILED	5 min call audio only	4/10 — "generic Oceanic"	VoIP-only source too narrow
r3B `Wrh70uw8jFy1g5IViE35` WINNER	5 min call + 38s isolated broadcast	8/10 likeness · 8/10 accent — "core pitch and gravelly timbre"	Best yet; Creative-stability take beat Natural 8/10 vs 4/10 on delivery

r3B winning take. eleven_v3 · stability 0.0 (Creative) · speaker boost. Driving the round-3 renders.

Real Jordan reference. The diarized call sample (5 min) the clone trained on.

Remaining gap (Gemini): "delivery too polished — the real Jordan has a casual, rambling cadence." Familiar-listener pass: not yet. Path up: mine more of the 13 other Jordan calls in Krisp → 30+ min → PVC under Jordan's own ElevenLabs account (near-indistinguishable tier). More source into voice-source/ still welcome — presenter-register audio especially.

4 · Lip-sync model landscape — they are not all created equal

Round-2/3 renders — sync-3 vs Sync 2.0, judge the lips

Round 2 · fal-ai/sync-lipsync/v2 · round-2 voice (variant B). Baseline.

Round 2 · fal-ai/sync-lipsync/v3 (sync-3) · same audio, sync.so's newest model. Compare mouth precision.

Round 3 · v2 · driven by the call-audio clone (r3B Creative).

Round 3 · sync-3 · the current best voice + newest lips.

Live catalog pulled from fal on 2026-07-03. Round 1 used Sync 2.0 — two generations behind sync.so's current best. Rounds 2–3 added sync-3 (above). HeyGen Precision still untested.

Model	Released	Pricing	Read
`fal-ai/sync-lipsync/v3` — sync-3 TESTED R2+R3	2026-04	unlisted (expect ≥ v2)	"Most powerful lipsync yet, native visual intelligence, professional-quality." Renders above — awaiting Robert's eye vs v2.
`fal-ai/heygen/v3/lipsync/precision` TEST NEXT	2026-04	unlisted	HeyGen's quality-first lipsync, no HeyGen subscription needed — served via fal. (A `/speed` variant exists too.)
`fal-ai/sync-lipsync/v2/pro` TEST NEXT	2025-09	unlisted	Higher-fidelity Sync 2. Cheap insurance if sync-3 disappoints.
`fal-ai/sync-lipsync/v2` USED	2025-04	$3/min	Solid baseline — produced this demo.
`fal-ai/sync-lipsync/react-1`	2025-12	unlisted	sync.so's expressive model (head/face reactions, not just lips).
`fal-ai/kling-video/lipsync/audio-to-video`	2025-03	$0.014/s (~$0.45/clip)	Cheap tier — candidate for high-volume personalized sends if quality holds.
`fal-ai/pixverse/lipsync`	2025-06	$0.04/s	Mid-cheap alternative.
`veed/lipsync`	2025-05	unlisted	Veed's model.
`fal-ai/latentsync`	2025-03	$0.20/clip ≤40s	Open-source, cheapest, visibly weakest — skip for client-facing.

When there's no footage to re-drive (future video types)

Route	Pricing	Use case
`fal-ai/heygen/avatar5/digital-twin` — HeyGen Avatar 5	$0.10/s	The exact model from the Nate Herk video. Full digital twin from training footage — new framing/scenes without a camera. Via fal, no HeyGen sub.
`fal-ai/sync-lipsync/v3/image-to-video` — sync-3 Avatar	$0.13/s	Talking avatar from a single photo of Jordan.
Others: `infinitalk`, `echomimic-v3` ($0.20/s), `longcat`, `flashtalk`	varies	Bench candidates only.

5 · Scene generation — Veo vs Omni Flash, and how Omni was cracked

Second wave, same day. Goal: manufacture shots that were never filmed, anchored to Jordan's identity, on Robert's Google API key.

Veo 3.1 — works, but the audio problem is real

veo-3.1-generate-preview (predictLongRunning + poll) turned one reference frame into an 8s cinematic scene with native audio — but Robert's verdict: "the sound is totally wrong." Veo invents its own soundtrack; for personal videos that's a liability, not a feature. Caveat now documented on the Asset Hub: prompt "ambient only" or strip audio and lay the section-1 voice in post. ~$3.20 per 8s.

Omni Flash — rejected the normal API, then delivered

First attempt failed usefully: gemini-omni-flash-preview via generateContent → HTTP 400 "This model only supports Interactions API." New API surface entirely (POST /v1beta/interactions, Responses-style objects).
Fix: Google ships an official skill for it — google-gemini/gemini-skills → gemini-omni-flash-api — with working scripts (generate_video.py, prep_video.py for 10s/720p normalization, batch mode, turn-by-turn edits via --previous-interaction-id). Needs google-genai>=2.10.0.
The test: one 1080p frame (jordan_ref_4s.jpg) + one sentence of direction → 8s: Jordan rises from the couch, walks to the window, camera arcs front→profile, room extends beyond anything the original camera saw. Identity held from a single frame. Ambient room tone only, as prompted.
Economics: $0.10/s → $0.80 per 8s shot, ~3 min end-to-end. Veo-class motion at a quarter of Veo's price. Also on fal; Google direct is faster.
Limits found in the docs: audio-reference conditioning (voice sample in → voice out) "coming soon"; real-video-upload edits geo-restricted in EEA/UK/some US states (empty result, no error).

The Omni proof. One still in → this out · gemini-omni-flash-preview · $0.80 · this shot was never filmed.

Veo comparison. Higher fidelity, 4× price, invented audio · veo-3.1-generate-preview · $3.20.

Round two — the showcase batch, pushed to the edges

One clip proves a route; it doesn't map the envelope. Five more jobs, one batch (generate_video.py --batch, concurrency 3), all anchored to the same reference frames — 42s, $4.20. Then a third round of three ($2.60) applied Robert's feedback same-day: an explicit reserved-delivery negative block in every prompt ("does NOT grin, minimal hand movement, not animated") and the space-identity fix. Clips below are the current best takes. Omni total: nine shots · 76s · $7.60.

1 · Exact-scripted dialogue (v2, reserved delivery). The prompt quotes his line verbatim — "He says exactly: 'Hi, I'm Jordan from Total Mortgages. Every part of this shot — even these exact words — was generated by AI.'" — and the model performs it, lips synced, hands still. 10s · $1.00.

2 · Snap transform (v2, reserved delivery). One contained snap and the living room converts to a rustic cabin around him — Jordan stays pixel-consistent and composed through the swap. 8s · $0.80.

3 · Drone reveal. Starts on the couch frame, pulls out through the window into an aerial New Zealand street reveal the original camera could never do. 8s · $0.80.

4 · Two-image interpolation. Start still (ref @4s) + end still (ref @28s) → the model builds the camera move between them. This is the shot-planner primitive. 8s · $0.80.

Dialogue verified, not vibed — twice: both takes' audio was round-tripped through OpenAI whisper-1. Transcript, both times: "Hi, I'm Jordan from Total Mortgages. Every part of this shot, even these exact words, was generated by AI." — word-for-word against the script in the prompt. You tell it the exact words; it says the exact words, even with the delivery constrained.

Honest failure → same-day fix — the identity boundary: take 1 put Jordan on a space station with a spacesuit costume change (omni_space_station.mp4, one reference image). Scene: flawless. Face: not Jordan. Finding: room transformations, camera moves and scene extensions hold single-reference identity; extreme costume changes break it. Take 2 (omni_space_station_v2.mp4) attached all three reference frames + a wardrobe anchor ("STILL WEARING his exact sage-green blazer, no costume change") — identity held, frame-checked. Both takes embedded on the Asset Hub §5 as the fail→fix pair.

Both unlocks are now built, not proposed: the two-image mode became shot-planner.html — an interactive storyboard where each shot shows its cost live and exports the exact batch JSON generate_video.py consumes. And the Vox-style Remotion method became a finished video — section 6, next.

Feedback applied this session

Image set v2: "He's a bit more reserved than that — no pointing, no big toothy smile." Regenerated with the constraint hard-coded into the prompt (closed-mouth or gentle smile, relaxed brows, composed posture). thumb_pointing and the grinning house-keys shot retired from the hub; replaced by thumb_arms_folded, thumb_house_keys_calm, headshot_natural. Expression rule now baked into every thumbnail-factory prompt.
Asset Hub restyled to the total.nz brand — light theme, electric blue #1421FF, deep navy #003F5E, grotesk type — ready to sit on a shared link in front of Jordan.
Reserved rule extended to video prompts: "don't make him smiling all the damn time… there needs to be a negative prompt." Dialogue + snap regenerated with an explicit negative block; the rule that fixed the images fixes the video too. v1 takes retired from the hub.
Explainer VO "sounds echoey": diagnosis — eleven_v3 Creative (0.0) reads roomy on narration, and the clone was trained partly on Zoom-call audio. Fix: stability 0.5 (Natural) + an ElevenLabs /v1/audio-isolation pass per beat (beats under 5s get silence-padded, isolated, trimmed back), retimed and re-rendered. Now the standing VO recipe for narration.
"The assets page shows what you get — it is the result": Asset Hub restructured results-first — a §0 reel (explainer, exact-words dialogue, drone reveal) now leads the page before any ingredient.

6 · The Vox explainer — "Fixed vs Floating", built end-to-end

Asset Hub §6 stopped being a method write-up and became a video. Topic chosen for maximum Jordan-relevance: the one question every NZ mortgage holder asks.

The render (v2 voice). 49.4s · 1920×1080 h264 + AAC · 7 beats, each scene cut to its own narration line · Jordan's cloned voice throughout, de-echoed after feedback · Total palette (navy #003F5E / electric blue #1421FF / paper white).

Script → VO

7 beats written as the timeline, each spoken by the clone (Wrh70uw8jFy1g5IViE35 · eleven_v3) into its own mp3. VO v1 (Creative 0.0) sounded echoey — v2 recipe: stability 0.5 Natural + /v1/audio-isolation per beat (beat 4 was under the 5s isolation minimum: silence-padded, isolated, trimmed back).

Timing method

ffprobe each beat mp3 → frame counts [213, 211, 175, 144, 178, 261, 298] @ 30fps → every Remotion <Sequence> starts exactly on its narration. Re-voicing = re-probe + re-render, nothing else moves. No alignment API, no keyframes.

Visuals

3 halftone B&W cutouts (gemini-3-pro-image, reserved-expression rule applied) composited with mixBlendMode: multiply over a shared paper background — the MoSidd/Vox technique, all animation via spring() and interpolate()

Scenes

Title stack → house + padlock (certainty) → falling-rate chart + "BREAK FEES CAN APPLY" stamp → wave-riding dot (floating) → rising wave + $ bars → split-the-loan pie (65/35) → TOTAL. endcard

Render

Remotion 4, local — npx remotion render, 1480 frames, zero render cost, rendered twice (v1 + de-echoed v2) for free

Total cost

~$0.80 including the voice-revision round — 3 images ≈ $0.40 + two VO passes ≈ $0.30 + isolation ≈ $0.10 + $0 render. Still 2.5× under the original "under $2" estimate, revision included.

What this proves: the full agency-grade explainer pipeline — script → cloned VO → brand-locked motion graphics → 1080p master — runs end-to-end in Claude Code in under an hour, for under a dollar. A motion-design agency quotes $2–5K and two weeks for this exact deliverable.

№2 — "First home. Five steps." · the skill's first production run

Explainer №2 was built through the freshly written total-video skill rather than by hand — the point of the skill is that №2 costs a fraction of №1 in both dollars and steps.

The render. 59.0s · 1920×1080 h264 + AAC · mean −21.6 dB · 7 beats (hook → deposit/KiwiSaver → pre-approval → the hunt → structure → settlement → endcard), each scene cut to its own narration line.

One script calling speak(text, {mode:"narration"}) per beat — stability 0.5 Natural + audio-isolation now run by default inside the skill, no manual de-echo step. 7/7 beats first try.

Timing

beatTimings() → frames [195, 252, 273, 228, 288, 281, 252] @30fps → 1769 frames. Same rule as №1: every <Sequence> starts on its narration line.

Visuals

Second Remotion composition (FirstHome) reusing №1's primitives and halftone cutouts — numbered step badges, coin stack, pre-approval card, magnifier hunt, fixed/float bars, settlement key. Zero new images. Two layout collisions caught by frame-extraction QA, fixed, re-rendered (free).

Verification

Two-pass: whisper-1 on the full render, then an isolated-beat re-check on the one flagged phrase — confirmed "first-home buyers" spoken correctly (the full-pass flag was a transcription mishear, not a voice error).

Total cost

~$0.25 — voice only. №1 cost ~$0.80 with images and a revision round; №2 reused the visual system and rendered clean on pass two. The marginal cost of an explainer is now the narration.

7 · What was verified (and what wasn't)

✓ Output audio independently transcribed (OpenAI whisper-1) → matches the demo script word-for-word
✓ Footage is true passthrough — original frames at 4s/16s/28s match AI output shots at same timestamps (multi-shot agency edit preserved)
✓ Duration 31.8s = audio length (cut_off) · 1920×1080 h264 · plays in QuickTime/browser
✓ Omni dialogue clips (v1 and v2) transcribed with whisper-1 → both match the scripted line word-for-word (section 5)
✓ Explainer №2 narration verified two-pass — whisper-1 on the full 59s render, isolated-beat re-check on the one flagged phrase (section 6)
✓ Space-station v2 frame-checked → identity held with multi-ref + wardrobe anchor (v1 drift documented as the boundary)
✓ Explainer v2 render probed: 49.37s · 1480 frames @ 30fps · h264 1920×1080 + AAC (mean −20.8 dB) · beats retimed to the de-echoed VO, frames spot-checked
✓ Shot planner opened headless (Chrome) — board renders, per-shot + board costs compute, batch JSON generates
✗ Lip-timing in motion not machine-verified — needs your eyes before Jordan ever sees it
✗ Voice likeness — known-bad, see section 3
✗ Explainer A/V feel (pacing, VO energy) — machine-verified only; needs a human watch-through

8 · Next steps

#	Step	Owner	Status
1	Watch: the Omni showcase (section 5), the explainer (section 6), the round-3 renders (section 4) — then review the Asset Hub before it goes on a shared link	Robert	WAITING
2	Vox-style explainers — №1 "Fixed vs Floating" (~$0.80) and №2 "First home. Five steps." (~$0.25, produced through the skill), both in section 6	Claude	DONE
3	Shot planner — done, shot-planner.html · costed storyboard → batch JSON	Claude	DONE
4	Space-station retry — done, identity held (multi-ref + wardrobe anchor) · fail→fix pair on Asset Hub §5	Claude	DONE
5	Mine remaining 13 Jordan calls in Krisp → 30+ min audio → PVC-grade source	Claude	READY
6	Test `fal-ai/heygen/v3/lipsync/precision` (last untested top-tier lips)	Claude	QUEUED
7	Wrap pipeline into `total-video` skill file (Asset Hub = the spec) — done: `~/.claude/skills/total-video/` · speak (clip / de-echoed narration), Omni shot + batch with the calm-demeanor block and 3-ref identity lock as exported constants, whisper-1 verify, explainer timings · CLI smoke-tested against the shipped explainer (1480 frames reproduced)	Claude	BUILT
8	Then: Slack pitch to Jordan / personalized "Hi Sarah" variant / PVC under Jordan's account	both	HOLD

The strategic frame, one line: Nate Herk's video sells the model; the value is the pipeline. Jordan's one agency shoot becomes a permanent, re-drivable avatar — and wired into Total CRM, every deal stage can send a personal video from Jordan that he never films.