Skip to main content

Query Engine

The Query Engine is the orchestration layer for the full request flow:

classify → retrieve → rerank → budget → generate

It coordinates retrieval strategy selection, result refinement, context budget control, and final LLM generation in one deterministic pipeline.


Pipeline Stages

  1. Classify
    The query classifier (rule-based or hybrid LLM mode) detects intent and selects retrieval mode plus metadata filters automatically.

  2. Retrieve
    The engine runs the selected retrieval strategy: semantic, hybrid, metadata-only, comparison, or hybrid BM25.

  3. Rerank
    Pluggable rerankers re-score and filter results. Built-in: ScoreThresholdReranker, which drops chunks below a configured confidence threshold.

  4. Budget
    Token budget management estimates prompt size, reserves completion tokens, and trims lowest-relevance chunks to keep the request within model context limits.

  5. Generate
    The final trimmed context is sent to the LLM for answer generation.


Token Budget Management

Before generation, the engine estimates prompt token usage against N_CTX, reserves completion space, then drops low-relevance chunks until the prompt fits.

The /api/chat response includes a token_budget diagnostic object:

  • limit — context window limit used for this request
  • estimated_prompt — estimated prompt tokens after retrieval/reranking
  • reserved_completion — completion token reserve
  • chunks_dropped — number of chunks removed to stay in budget
info

Set RAG_LOG_TOKEN_USAGE=true to log per-request budget diagnostics in runtime logs.


Chat Diagnostics in /api/chat

In addition to the generated answer, chat responses include diagnostics to explain pipeline behavior:

  • token_budget — context-fit decisions and trimming outcome
  • retrieval_settings — effective retrieval mode and related options used
  • timing — stage-level timing information for profiling and troubleshooting

These fields are useful for validating classifier behavior, tuning rerank thresholds, and understanding why chunks were dropped.