Query Engine

The Query Engine is the orchestration layer for the full request flow:

classify → retrieve → rerank → budget → generate

It coordinates retrieval strategy selection, result refinement, context budget control, and final LLM generation in one deterministic pipeline.

Pipeline Stages

Classify
The query classifier (rule-based or hybrid LLM mode) detects intent and selects retrieval mode plus metadata filters automatically.
Retrieve
The engine runs the selected retrieval strategy: semantic, hybrid, metadata-only, comparison, or hybrid BM25.
Rerank
Pluggable rerankers re-score and filter results. Built-in: ScoreThresholdReranker, which drops chunks below a configured confidence threshold.
Budget
Token budget management estimates prompt size, reserves completion tokens, and trims lowest-relevance chunks to keep the request within model context limits.
Generate
The final trimmed context is sent to the LLM for answer generation.

Token Budget Management

Before generation, the engine estimates prompt token usage against N_CTX, reserves completion space, then drops low-relevance chunks until the prompt fits.

The /api/chat response includes a token_budget diagnostic object:

limit — context window limit used for this request
estimated_prompt — estimated prompt tokens after retrieval/reranking
reserved_completion — completion token reserve
chunks_dropped — number of chunks removed to stay in budget

info

Set RAG_LOG_TOKEN_USAGE=true to log per-request budget diagnostics in runtime logs.

Chat Diagnostics in `/api/chat`

In addition to the generated answer, chat responses include diagnostics to explain pipeline behavior:

token_budget — context-fit decisions and trimming outcome
retrieval_settings — effective retrieval mode and related options used
timing — stage-level timing information for profiling and troubleshooting

These fields are useful for validating classifier behavior, tuning rerank thresholds, and understanding why chunks were dropped.

Pipeline Stages​

Token Budget Management​

Chat Diagnostics in /api/chat​

Pipeline Stages

Token Budget Management

Chat Diagnostics in `/api/chat`