We profiled the quest-generation pipeline and found that 60% of latency came from a sequential LLM call chain that could be partially parallelised. We restructured the pipeline so that concept-selection and difficulty-calibration calls run concurrently where the outputs are independent. The remaining latency is LLM inference time, which we cannot compress further without changing model or prompt design.
Measured over 200 quests across 3 production days. Median before: 4.2s (p95: 8.1s). Median after: 2.8s (p95: 5.4s). No change in content quality scores — we A/B tested the restructured pipeline against the sequential one for 48 hours before rolling out fully.
Next step: investigate whether prompt compression reduces p95 without affecting quality. Tracking in the research queue.