Designing the Video Processing Backend: A Conversation With a Senior Engineer
Building SportsSync — Part 11
Why I Needed Help
Ten episodes of building SportsSync with AI tools, and I'd hit a wall I couldn't prompt my way through. The client-side demo works — you can see cycling telemetry overlaid on a YouTube video in real time. But the actual product needs to produce downloadable videos with the overlay baked in. That's server-side video processing, and I have zero backend experience.
I called Tirth Kajer, a full-stack engineer who runs a software development agency in India. We worked together at a previous company as frontend engineers, but he's since moved into full-stack work and now builds MVPs for early-stage startups. If anyone could sketch out a realistic architecture in 30 minutes, it was him.
The Processing Pipeline
The core workflow Tirth outlined is straightforward in concept:
- User submits: YouTube video URL, GPX file, sync point, and desired clip range
- Server receives: Creates a job record in the database, uploads GPX to S3, pushes a job ID to a message queue
- Worker picks up the job: Downloads the YouTube video, converts GPX to time-series data, detects the video's frame rate
- FFmpeg processes: Applies telemetry data as an overlay on each frame, re-encoding the entire video segment
- Output goes to S3: The rendered video with embedded overlays is stored, temporary files are cleaned up
- User downloads: The finished video is available for download or sharing
The key insight: this is an asynchronous operation. Video processing is compute-intensive — a two-minute clip might take several minutes to render. You can't make the user wait. Instead, you submit the job, show a "processing" status, and notify them when it's done.
The Architecture: Start Simple, Scale Later
Tirth's advice was pragmatic: don't build the production architecture on day one.
MVP approach (single EC2 instance):
- One server running everything: API, worker, database
- Docker Compose to manage services
- Jobs tracked in a database table with status flags (pending, processing, complete)
- Worker processes jobs sequentially — first in, first out
- Files stored on the EC2 disk temporarily, final output on S3
Production approach (when you need to scale):
- Application Load Balancer (ALB) distributing requests
- Auto Scaling Group (ASG) managing multiple EC2 instances
- SQS (Simple Queue Service) for job queuing instead of database polling
- S3 for all file storage (input and output)
- RDS for the database (instead of running it on the EC2)
- Separate worker instances from API instances
Dream architecture (post-product-market-fit):
- Kubernetes cluster with containerized workers
- Granular scaling of workers independent of API servers
- GPU-intensive instances (G4) for FFmpeg acceleration
The progression makes sense: start with everything on one box, split services as you grow, containerize when you need fine-grained scaling.
The FFmpeg Challenge
The most technically complex part is the overlay rendering. FFmpeg can re-encode a video with data overlaid on each frame, but the process is specific:
- Detect the video's FPS (typically 30 or 60 for action cameras)
- Sample the GPX time-series data at the same rate — if the video is 30fps, you need 30 telemetry values per second
- Generate overlay frames — transparent images or video with the gauges rendered at each timestamp
- Composite the overlay onto the source video using FFmpeg's filter chain
This is where the CSS gauges from the client-side demo don't directly translate. In the browser, the overlay is HTML/CSS rendered by the browser engine. For server-side rendering, the overlay needs to be generated as image frames or a transparent video that FFmpeg can composite.
The solution Tirth suggested: render the overlay as a separate transparent video, then use FFmpeg to combine them. This keeps the overlay generation separate from the video processing, making it easier to change the gauge design without re-processing the entire video pipeline.
Cost Reality Check
We looked at AWS EC2 pricing during the call. The numbers were sobering:
- C5 instances (compute-optimized): the price varies by size, but even a C5.xlarge is measurable per hour
- G4 instances (GPU-enabled, for FFmpeg acceleration): significantly more expensive
- Storage on S3 is cheap by comparison
- The real cost driver is compute time during video re-encoding
For an MVP with a handful of users, a single C5 or even a T3 instance handles the load. The cost becomes a concern at scale — if hundreds of users are rendering videos simultaneously, the compute bill grows fast. AWS Savings Plans (30-40% discount for 1-3 year commitments) help, but only after you've validated the product.
This pricing reality shapes the product model: rendered videos should expire after 30 days (no permanent storage), and free tier limits should cap the number of renders per month.
Authentication Strategy
A practical detail that's easy to overlook: the backend API needs authentication. Since SportsSync already uses Supabase for the waitlist, the same authentication system can secure the API.
The flow: user logs in via Supabase on the frontend, receives a JWT token, includes that token in API requests to the backend. The backend verifies the token with Supabase before processing any request. Same auth system, no additional infrastructure.
Development Plan
Based on the conversation, my development path is:
- Local development first. Write the Python scripts for video download (yt-dlp), GPX parsing, and FFmpeg overlay generation. Test everything on my laptop.
- Dockerize. Package the API server, worker, and database into a Docker Compose setup. Verify it works as a unit.
- Deploy to a single EC2. Push the Docker setup to AWS. Configure S3 for file storage. Test with real YouTube videos and GPX files.
- Add the queue. Replace database polling with SQS for job management. This enables scaling workers later.
- Split services. Separate the API server from the workers. Add the load balancer.
Steps 1-3 are the MVP. Steps 4-5 come after product-market fit. The architecture supports this progression without requiring a rewrite.
What I Learned About Asking for Help
As a frontend developer for 10+ years, asking a backend engineer to draw me an architecture diagram felt like admitting incompetence. But Tirth's 30 minutes of whiteboarding saved me weeks of trial and error.
The AI tools I've been using are incredible for implementation — give them a clear specification and they produce working code. But they can't replace the judgment of an experienced engineer who knows which AWS services to use, when to scale, and where the cost traps are. Claude can write an FFmpeg command, but it can't tell me whether to use a C5 or G4 instance based on my expected workload.
The lesson: use AI for code, use humans for architecture decisions.
Create cycling shorts with GPS telemetry
Upload your video, sync your GPX data, and generate ready-to-share shorts in minutes.
Try SportsSync — early access