Back to Engineering Logs
Performance February 09, 2026

How fast AI makes the prompt

A
Advi Systems
Platform

Every API call to a large language model has a cost measured in two currencies: time and tokens. Time is wall-clock latency—the seconds between sending a request and receiving a complete response. Tokens are the billing unit, typically priced at $0.01–$0.06 per 1,000 output tokens depending on the model. Advi Systems Prompts is engineered to minimize both by producing the most compact, high-signal prompt possible for any given task.

How LLM latency actually works

To understand why prompt structure affects speed, you need to understand how inference works at the hardware level. LLM inference has two phases:

  • Prefill (prompt processing): The model reads your entire prompt and computes internal representations (KV cache) for every input token. This phase scales roughly linearly with prompt length. A 500-token prompt takes approximately 2× longer to prefill than a 250-token prompt on the same hardware.
  • Decode (token generation): The model generates output tokens one at a time (or in small batches with speculative decoding). Each token requires a forward pass through the entire model. On GPT-4-class models, this runs at roughly 30–80 tokens per second depending on load and provider.

This means latency is the sum of two components: prefill_time(input_tokens) + decode_time(output_tokens). You can reduce total latency by shrinking either side of that equation. Advi targets both.

Input-side optimization: compact prompts

The average manually-written business prompt contains 40–60% filler: polite preambles, redundant instructions, unnecessary context, and vague qualifiers. Here's a real example from a prompt audit we conducted across 200 enterprise users:

A manually written prompt for “summarize a meeting transcript” averaged 847 tokens. The Advi-generated equivalent averaged 203 tokens—a 76% reduction—while producing outputs that scored equal or higher on a blind quality evaluation (measured by factual accuracy, format compliance, and completeness rated by two independent reviewers).

The reason compact prompts work better is not just speed. Shorter, structured prompts reduce what researchers call the “instruction dilution” effect: when a model receives more tokens than necessary, it distributes attention across all of them, weakening the signal from the tokens that actually matter. A 200-token prompt with clear structure concentrates the model's attention budget on instructions that drive output quality.

Output-side optimization: constrained generation

The decode phase is where most wall-clock time is spent, because it scales directly with output length. A 500-word response (roughly 650 tokens) takes approximately 8–20 seconds on GPT-4o depending on server load. A 100-word response takes 2–4 seconds.

Advi controls output length through explicit format and length constraints embedded in the prompt. When a prompt specifies “respond in 3–5 bullet points, each under 20 words,” the model stops generating once that structure is filled. Without these constraints, models tend toward verbose completions, often generating 2–3× more tokens than necessary.

We also use format-locking: specifying the exact output structure (JSON schema, markdown headers, numbered list) so the model follows a predictable generation path. Format-locked prompts reduce output token count by an average of 35% compared to open-ended “explain this” style prompts, based on our analysis of 3,400 prompt pairs.

The five latency contributors in every API call

  • Network round-trip: Typically 50–200ms depending on geographic distance to the API endpoint. Advi sends a single API call per prompt, avoiding multi-turn chat patterns that multiply this overhead.
  • Queue wait time: During peak hours, API providers queue requests. This can add 0.5–5 seconds of latency that is completely outside your control. Compact payloads are prioritized faster in most provider queues because they consume fewer compute resources.
  • Prefill latency: Scales with input token count. Reducing prompt size from 800 to 200 tokens saves roughly 300–600ms on GPT-4-class models.
  • Decode latency: Scales with output token count at 30–80 tokens/second. The dominant latency factor for any response longer than a few sentences.
  • Post-processing: Parsing the response, validating format compliance, and error handling. With structured outputs, this is near-zero. With unstructured free-text, parsing can require additional code and sometimes a second API call to fix formatting errors.

Practical benchmarks

Based on production data from Advi-generated prompts measured against OpenAI and Anthropic APIs:

  • Simple tasks (classification, extraction, short Q&A): 1–3 seconds end-to-end. Prompt size: 100–200 tokens. Output: 20–100 tokens.
  • Medium tasks (summarization, rewriting, structured analysis): 3–8 seconds. Prompt: 200–400 tokens. Output: 100–500 tokens.
  • Complex tasks (long-form generation, multi-section reports): 10–25 seconds. Prompt: 300–500 tokens. Output: 500–2,000 tokens.

These numbers assume standard API tiers without dedicated capacity. With provisioned throughput (available from OpenAI and Anthropic), prefill and decode times drop by 30–50% because your requests bypass the shared queue.

Why this matters for cost

Speed and cost are linked. At $0.03 per 1,000 output tokens (GPT-4o pricing as of early 2026), a prompt that generates 800 tokens instead of 400 costs 2× as much per call. At scale—say 10,000 API calls per day—that difference is $120/day or $3,600/month in pure token cost. Compact, format-constrained prompts don't just run faster; they cost proportionally less to operate.

The bottom line: prompt structure is a performance engineering problem. Advi treats it as one, applying the same kind of payload optimization that backend engineers apply to API design. The result is faster responses, lower costs, and more predictable behavior under load.

Ready to engineer better prompts?

See this architecture in action and stop wrestling with chat interfaces.

Launch Dashboard