Building an LLM Extraction Pipeline for Vietnamese Sports Listings

2026-05-29

How I built a Go + LLM pipeline that reads raw Vietnamese Facebook posts about badminton and pickleball meetups, extracts structured data, geocodes addresses in Ho Chi Minh City, and feeds it into a Rails app — end to end in under 2 seconds per listing.

SportMatch is a side project I built solo over one month: a map-based platform where badminton and pickleball players in Ho Chi Minh City can find teammates and join games. The manual posting flow was straightforward Rails. The hard part was the other data source — scraping raw text from Vietnamese Facebook groups and turning it into structured, geocoded listings automatically.

A typical post looks like this: "Can 2 nam vang lai san 583 Nguyen Trai, 19h-21h toi nay, trinh trung binh yeu, phi 50k". No JSON. No standard format. Vietnamese slang, relative time references, and a short address that Google Maps may or may not resolve correctly.
The Schema Contract

Before writing a single line of Go or prompting the LLM, I defined a strict JSON Schema: listing_extraction.schema.json. This schema describes every field the LLM must produce — sport, title, start_at in ISO 8601 UTC, skill level range, slots needed, price estimate, location name, and contact info — with types, enums, and required constraints.

The schema contract meant the LLM had one job: produce valid JSON matching this shape. Nothing else. No prose, no explanation. Just the object. This discipline paid off immediately: failures were binary. Either Go's JSON schema validator accepted the output, or it didn't. No ambiguous partial parses.

Prompt Design: The Hard Parts

Two things made prompt engineering genuinely difficult for this domain.

The first was skill level taxonomy. Vietnamese players use a rich, informal spectrum to describe ability: yeu, trung binh yeu, trung binh, trung binh kha, kha, ban chuyen, chuyen nghiep — and several in-between variants. These are not synonyms; they represent distinct rungs on a ladder the local community understands intuitively. I had to enumerate all variants in the system prompt and map them to DB slugs explicitly. Early iterations without this caused the LLM to collapse the entire range into just "trung_binh".

The second was relative time resolution. Posts say "toi nay" (tonight), "CN tuan sau" (next Sunday), or just "19h-21h". The LLM needs a reference date anchored to the Asia/Ho_Chi_Minh timezone to resolve these correctly before converting to UTC for storage. I inject the current local date and time into every user message so the model has a stable anchor. Without it, "toi nay" was being resolved to whatever the model's training data implied — completely wrong.

Validation in Go

The Go scraper uses a JSON schema validation library to check the LLM's response before doing anything else with it. If validation fails — hallucinated fields, wrong enum value, missing required key — the message goes to a dead-letter channel for logging and limited retry. Nothing partial ever reaches Rails.

This boundary was one of the best architectural decisions in the whole project. It kept the Rails ingest endpoint clean — it could trust the payload shape completely.

Geocoding: Cache-First with PostGIS

Short Vietnamese addresses are notoriously ambiguous. "583 Nguyen Trai" could be in Quan 1 or Quan 5. Repeated geocoding API calls for the same address are wasteful and expensive.

I built a geocoding_cache table in PostgreSQL keyed on the normalized location query string. Before calling Google Geocoding API, Rails checks this table. On a miss, it calls Google, stores the result as a geography point, and returns it. Cache hit rate in practice: over 80% after the first week, since many posts reference the same popular courts repeatedly.

Idempotency via source_url

Facebook post URLs are stable. Every scraped listing carries its source_url. A unique constraint on that column means re-running the scraper on the same post is a no-op — Rails returns the existing record. No duplicates accumulate even if the scraper processes the same group multiple times.

The ingest endpoint itself is HMAC-signed using a shared secret between Go and Rails. No authentication token to manage, no OAuth dance — just a signed request header that Rails verifies before processing.

Lessons From One Month of Building This

Prompt iteration is the real cost. Infrastructure takes hours to set up. Getting the LLM to reliably output the right skill level slug for every Vietnamese variant took days of examples and edge-case testing. Budget more time here than you think.

Vietnamese address geocoding has a ceiling. Some posts describe locations vaguely enough that even Google returns a district-level result, not a specific point. I accepted this: listings with a low-confidence coordinate get a nullable point and are hidden from the map feed until a background job retries with a better query.

Go and Rails as a pipeline pair works well. Go handles the concurrency, HTTP fetching, and strict validation cleanly. Rails owns the domain model, geocoding cache, and serving. The HMAC boundary between them is thin and auditable. I would not reach for a message queue for a project at this scale — a direct signed POST is simpler and easier to debug.