→ Jordan's AI Asset Hub — the deliverable page: his voice clone ready to use, the AI image set, thumbnail prompts, and the video-route menu. This page stays the lab notebook; that page is what Jordan works from.
| Stage | Model / tool | Provider | Params / notes | Cost |
|---|---|---|---|---|
| Lip-sync USED | fal-ai/sync-lipsync/v2Sync Lipsync 2.0 |
sync.so, served via fal | sync_mode: cut_off · 1080p in/out · render 4.5 min · fal request 019f290a-d520-7630-bd1a-7feaaa833923 |
$3/min → $1.59 |
| Voice clone WEAK LINK | Instant Voice Clone (IVC) | ElevenLabs (Creator tier) | voice_id 7dJJNWgd2NA52Rn4OrPH · trained on only 38s of audio ripped from the welcome video — this is why it doesn't sound right |
plan credits |
| Text-to-speech | eleven_multilingual_v2 |
ElevenLabs | stability 0.55 · similarity 0.85 · style 0.1 · speed 0.82 + <break> tags to match Jordan's measured 163 wpm (first attempt read at 246 wpm) |
~1.2K chars |
| Script | Claude (Opus-class) | Anthropic | 78 words, written to fit under the 38s driving footage | — |
| Assembly | ffmpeg 8.1.1 + Pillow labels | local | hstack A/B, label overlays (this ffmpeg build lacks drawtext) | — |
| Verification | whisper-1 round-trip | OpenAI | output audio independently transcribed back → matches script; frames matched against original at same timestamps | ~$0.01 |
IVC trained on 38s ripped straight from the welcome video, spoken with eleven_multilingual_v2. Two root causes found after the fact:
ffmpeg silencedetect found zero silence gaps. The clone was learning voice+music as one sound.eleven_multilingual_v2 flattens the NZ accent toward generic American — measured 2/10 accent in the Gemini audit. Model choice mattered more than any setting.Fixes applied: ElevenLabs audio-isolation stripped the music bed (0 → 9 silence gaps), re-clone from isolated audio, and an A/B across TTS models judged blind by Gemini 2.5 Pro against the isolated reference:
| Variant | TTS model | Gemini verdict | Audio |
|---|---|---|---|
| A FAILED | eleven_multilingual_v2 | 2/10 accent — americanized | |
| B WINNER | eleven_v3 | 8/10 accent — NZ preserved |
Robert's ear still said no — and the why matters: Gemini measured distance to the 38s clip; Robert measured distance to the real Jordan. An instant clone from 38s of broadcast-processed audio hits its ceiling below the familiar-listener bar. The fix was never settings — it was more raw audio.
Mined the 2026-05-15 Jordan project session (Krisp recording, 68 min). scribe_v1 diarization split the two speakers; Jordan = speaker_0, verified by matching his transcript lines. 68 clean solo segments ≥3s → 8.8 min of Jordan, machine-checked single-speaker (Gemini: no foreign voices, no music).
| Clone | Trained on | Gemini likeness | Verdict |
|---|---|---|---|
| r3A FAILED | 5 min call audio only | 4/10 — "generic Oceanic" | VoIP-only source too narrow |
r3B Wrh70uw8jFy1g5IViE35 WINNER | 5 min call + 38s isolated broadcast | 8/10 likeness · 8/10 accent — "core pitch and gravelly timbre" | Best yet; Creative-stability take beat Natural 8/10 vs 4/10 on delivery |
eleven_v3 · stability 0.0 (Creative) · speaker boost. Driving the round-3 renders.Remaining gap (Gemini): "delivery too polished — the real Jordan has a casual, rambling cadence." Familiar-listener pass: not yet. Path up: mine more of the 13 other Jordan calls in Krisp → 30+ min → PVC under Jordan's own ElevenLabs account (near-indistinguishable tier). More source into voice-source/ still welcome — presenter-register audio especially.
fal-ai/sync-lipsync/v2 · round-2 voice (variant B). Baseline.fal-ai/sync-lipsync/v3 (sync-3) · same audio, sync.so's newest model. Compare mouth precision.Live catalog pulled from fal on 2026-07-03. Round 1 used Sync 2.0 — two generations behind sync.so's current best. Rounds 2–3 added sync-3 (above). HeyGen Precision still untested.
| Model | Released | Pricing | Read |
|---|---|---|---|
fal-ai/sync-lipsync/v3 — sync-3 TESTED R2+R3 | 2026-04 | unlisted (expect ≥ v2) | "Most powerful lipsync yet, native visual intelligence, professional-quality." Renders above — awaiting Robert's eye vs v2. |
fal-ai/heygen/v3/lipsync/precision TEST NEXT | 2026-04 | unlisted | HeyGen's quality-first lipsync, no HeyGen subscription needed — served via fal. (A /speed variant exists too.) |
fal-ai/sync-lipsync/v2/pro TEST NEXT | 2025-09 | unlisted | Higher-fidelity Sync 2. Cheap insurance if sync-3 disappoints. |
fal-ai/sync-lipsync/v2 USED | 2025-04 | $3/min | Solid baseline — produced this demo. |
fal-ai/sync-lipsync/react-1 | 2025-12 | unlisted | sync.so's expressive model (head/face reactions, not just lips). |
fal-ai/kling-video/lipsync/audio-to-video | 2025-03 | $0.014/s (~$0.45/clip) | Cheap tier — candidate for high-volume personalized sends if quality holds. |
fal-ai/pixverse/lipsync | 2025-06 | $0.04/s | Mid-cheap alternative. |
veed/lipsync | 2025-05 | unlisted | Veed's model. |
fal-ai/latentsync | 2025-03 | $0.20/clip ≤40s | Open-source, cheapest, visibly weakest — skip for client-facing. |
| Route | Pricing | Use case |
|---|---|---|
fal-ai/heygen/avatar5/digital-twin — HeyGen Avatar 5 | $0.10/s | The exact model from the Nate Herk video. Full digital twin from training footage — new framing/scenes without a camera. Via fal, no HeyGen sub. |
fal-ai/sync-lipsync/v3/image-to-video — sync-3 Avatar | $0.13/s | Talking avatar from a single photo of Jordan. |
Others: infinitalk, echomimic-v3 ($0.20/s), longcat, flashtalk | varies | Bench candidates only. |
Second wave, same day. Goal: manufacture shots that were never filmed, anchored to Jordan's identity, on Robert's Google API key.
veo-3.1-generate-preview (predictLongRunning + poll) turned one reference frame into an 8s cinematic scene with native audio — but Robert's verdict: "the sound is totally wrong." Veo invents its own soundtrack; for personal videos that's a liability, not a feature. Caveat now documented on the Asset Hub: prompt "ambient only" or strip audio and lay the section-1 voice in post. ~$3.20 per 8s.
gemini-omni-flash-preview via generateContent → HTTP 400 "This model only supports Interactions API." New API surface entirely (POST /v1beta/interactions, Responses-style objects).google-gemini/gemini-skills → gemini-omni-flash-api — with working scripts (generate_video.py, prep_video.py for 10s/720p normalization, batch mode, turn-by-turn edits via --previous-interaction-id). Needs google-genai>=2.10.0.jordan_ref_4s.jpg) + one sentence of direction → 8s: Jordan rises from the couch, walks to the window, camera arcs front→profile, room extends beyond anything the original camera saw. Identity held from a single frame. Ambient room tone only, as prompted.gemini-omni-flash-preview · $0.80 · this shot was never filmed.veo-3.1-generate-preview · $3.20.One clip proves a route; it doesn't map the envelope. Five more jobs, one batch (generate_video.py --batch, concurrency 3), all anchored to the same reference frames — 42s, $4.20. Then a third round of three ($2.60) applied Robert's feedback same-day: an explicit reserved-delivery negative block in every prompt ("does NOT grin, minimal hand movement, not animated") and the space-identity fix. Clips below are the current best takes. Omni total: nine shots · 76s · $7.60.
Dialogue verified, not vibed — twice: both takes' audio was round-tripped through OpenAI whisper-1. Transcript, both times: "Hi, I'm Jordan from Total Mortgages. Every part of this shot, even these exact words, was generated by AI." — word-for-word against the script in the prompt. You tell it the exact words; it says the exact words, even with the delivery constrained.
Honest failure → same-day fix — the identity boundary: take 1 put Jordan on a space station with a spacesuit costume change (omni_space_station.mp4, one reference image). Scene: flawless. Face: not Jordan. Finding: room transformations, camera moves and scene extensions hold single-reference identity; extreme costume changes break it. Take 2 (omni_space_station_v2.mp4) attached all three reference frames + a wardrobe anchor ("STILL WEARING his exact sage-green blazer, no costume change") — identity held, frame-checked. Both takes embedded on the Asset Hub §5 as the fail→fix pair.
Both unlocks are now built, not proposed: the two-image mode became shot-planner.html — an interactive storyboard where each shot shows its cost live and exports the exact batch JSON generate_video.py consumes. And the Vox-style Remotion method became a finished video — section 6, next.
thumb_pointing and the grinning house-keys shot retired from the hub; replaced by thumb_arms_folded, thumb_house_keys_calm, headshot_natural. Expression rule now baked into every thumbnail-factory prompt.#1421FF, deep navy #003F5E, grotesk type — ready to sit on a shared link in front of Jordan.eleven_v3 Creative (0.0) reads roomy on narration, and the clone was trained partly on Zoom-call audio. Fix: stability 0.5 (Natural) + an ElevenLabs /v1/audio-isolation pass per beat (beats under 5s get silence-padded, isolated, trimmed back), retimed and re-rendered. Now the standing VO recipe for narration.Asset Hub §6 stopped being a method write-up and became a video. Topic chosen for maximum Jordan-relevance: the one question every NZ mortgage holder asks.
Wrh70uw8jFy1g5IViE35 · eleven_v3) into its own mp3. VO v1 (Creative 0.0) sounded echoey — v2 recipe: stability 0.5 Natural + /v1/audio-isolation per beat (beat 4 was under the 5s isolation minimum: silence-padded, isolated, trimmed back).ffprobe each beat mp3 → frame counts [213, 211, 175, 144, 178, 261, 298] @ 30fps → every Remotion <Sequence> starts exactly on its narration. Re-voicing = re-probe + re-render, nothing else moves. No alignment API, no keyframes.gemini-3-pro-image, reserved-expression rule applied) composited with mixBlendMode: multiply over a shared paper background — the MoSidd/Vox technique, all animation via spring() and interpolate()npx remotion render, 1480 frames, zero render cost, rendered twice (v1 + de-echoed v2) for freeWhat this proves: the full agency-grade explainer pipeline — script → cloned VO → brand-locked motion graphics → 1080p master — runs end-to-end in Claude Code in under an hour, for under a dollar. A motion-design agency quotes $2–5K and two weeks for this exact deliverable.
Explainer №2 was built through the freshly written total-video skill rather than by hand — the point of the skill is that №2 costs a fraction of №1 in both dollars and steps.
speak(text, {mode:"narration"}) per beat — stability 0.5 Natural + audio-isolation now run by default inside the skill, no manual de-echo step. 7/7 beats first try.beatTimings() → frames [195, 252, 273, 228, 288, 281, 252] @30fps → 1769 frames. Same rule as №1: every <Sequence> starts on its narration line.FirstHome) reusing №1's primitives and halftone cutouts — numbered step badges, coin stack, pre-approval card, magnifier hunt, fixed/float bars, settlement key. Zero new images. Two layout collisions caught by frame-extraction QA, fixed, re-rendered (free).whisper-1 → both match the scripted line word-for-word (section 5)| # | Step | Owner | Status |
|---|---|---|---|
| 1 | Watch: the Omni showcase (section 5), the explainer (section 6), the round-3 renders (section 4) — then review the Asset Hub before it goes on a shared link | Robert | WAITING |
| 2 | Vox-style explainers — №1 "Fixed vs Floating" (~$0.80) and №2 "First home. Five steps." (~$0.25, produced through the skill), both in section 6 | Claude | DONE |
| 3 | Shot planner — done, shot-planner.html · costed storyboard → batch JSON | Claude | DONE |
| 4 | Space-station retry — done, identity held (multi-ref + wardrobe anchor) · fail→fix pair on Asset Hub §5 | Claude | DONE |
| 5 | Mine remaining 13 Jordan calls in Krisp → 30+ min audio → PVC-grade source | Claude | READY |
| 6 | Test fal-ai/heygen/v3/lipsync/precision (last untested top-tier lips) | Claude | QUEUED |
| 7 | Wrap pipeline into total-video skill file (Asset Hub = the spec) — done: ~/.claude/skills/total-video/ · speak (clip / de-echoed narration), Omni shot + batch with the calm-demeanor block and 3-ref identity lock as exported constants, whisper-1 verify, explainer timings · CLI smoke-tested against the shipped explainer (1480 frames reproduced) | Claude | BUILT |
| 8 | Then: Slack pitch to Jordan / personalized "Hi Sarah" variant / PVC under Jordan's account | both | HOLD |
The strategic frame, one line: Nate Herk's video sells the model; the value is the pipeline. Jordan's one agency shoot becomes a permanent, re-drivable avatar — and wired into Total CRM, every deal stage can send a personal video from Jordan that he never films.