Query Engine
The Query Engine is the orchestration layer for the full request flow:
classify → retrieve → rerank → budget → generate
It coordinates retrieval strategy selection, result refinement, context budget control, and final LLM generation in one deterministic pipeline.
Pipeline Stages
-
Classify
The query classifier (rule-based or hybrid LLM mode) detects intent and selects retrieval mode plus metadata filters automatically. -
Retrieve
The engine runs the selected retrieval strategy: semantic, hybrid, metadata-only, comparison, or hybrid BM25. -
Rerank
Pluggable rerankers re-score and filter results. Built-in:ScoreThresholdReranker, which drops chunks below a configured confidence threshold. -
Budget
Token budget management estimates prompt size, reserves completion tokens, and trims lowest-relevance chunks to keep the request within model context limits. -
Generate
The final trimmed context is sent to the LLM for answer generation.
Token Budget Management
Before generation, the engine estimates prompt token usage against N_CTX, reserves completion space, then drops low-relevance chunks until the prompt fits.
The /api/chat response includes a token_budget diagnostic object:
limit— context window limit used for this requestestimated_prompt— estimated prompt tokens after retrieval/rerankingreserved_completion— completion token reservechunks_dropped— number of chunks removed to stay in budget
Set RAG_LOG_TOKEN_USAGE=true to log per-request budget diagnostics in runtime logs.
Chat Diagnostics in /api/chat
In addition to the generated answer, chat responses include diagnostics to explain pipeline behavior:
token_budget— context-fit decisions and trimming outcomeretrieval_settings— effective retrieval mode and related options usedtiming— stage-level timing information for profiling and troubleshooting
These fields are useful for validating classifier behavior, tuning rerank thresholds, and understanding why chunks were dropped.