Resources
Resources

Prompt Engineering
Does System Prompt Placement Matter for LLM-as-a-Judge? A Cross-Model Study
Every team building an LLM-as-a-judge pipeline faces the same question: does it matter where you put your instructions? We tested it across models to get a real answer.

AI Research Engineer

Key Takeaways
No statistically significant placement effects. Across 34 model-task comparisons (Fisher's exact test, a standard test for comparing categorical outcomes), only 3 reached p < 0.05, and 2 of those were driven by parsing issues, not genuine placement effects. Observed deltas of 1-7 percentage points (pp) are consistent with sampling noise at n=100.
Model selection is the dominant factor. Accuracy gaps between models (up to 20pp on the same task) dwarf any observed placement effect.
A minimal system prompt is a safe, practical default. Configuration B (generic role framing in system) never caused meaningful regressions and performed at or near the top across tasks, and carries less risk than putting all instructions in system, which produced the largest observed drops for smaller models (under ~70B parameters, e.g., Llama 3.1 8B (llama-3.1-8b-instant), Claude 3.5 Haiku (claude-3-5-haiku-20241022), GPT-4o mini (gpt-4o-mini)).
Why This Experiment
Should instructions go in the system message or user message? Does a role definition help? These questions come up often, but there's limited cross-model data to answer them.
Common guidance recommends separating "stable rules" (system) from "example-specific context" (user). We ran a prompt-placement experiment across multiple models and judge tasks, keeping the evaluation content constant while varying only where the instructions live.
Research questions:
Do different models react differently to having a system prompt?
What is the effect of a generic system prompt with extensive user instructions?
What happens when the whole eval prompt goes into system or user?
How do models react to best-practice instruction splits?
The goal: quantify whether prompt placement meaningfully improves judge reliability.
Our Approach
We evaluated multiple model families (Claude, GPT, Llama, and Gemini) across four common LLM-as-a-judge task types. In pointwise scoring, the judge assigns a score (0-4) to a single response using a rubric; we report exact-match rate against human labels. In pairwise ranking, the judge chooses the better of two responses for the same input. In safety classification, the judge labels content as safe or unsafe. In feedback evaluation, the judge both selects a winner and generates feedback, which we compare against reference judgments.
We used publicly available evaluation datasets: a 200-example HelpSteer3-derived multi-dimensional set for pointwise scoring (general-assistant responses rated on helpfulness, correctness, coherence, complexity, and verbosity), XSTest for safety classification (n=450; safe/unsafe prompts designed to probe exaggerated safety behavior), and LMSYS MT-Bench Human Judgments for feedback evaluation (n=500; multi-domain pairwise conversations with human preference judgments). For pairwise ranking, we used a combined 620-example set spanning LiveCodeBench, LiveBench, and MMLU-Pro.
For each model-task combination, we ran 100-500 examples per configuration and compared performance across five prompt-placement configurations. We intentionally chose ~100 examples per configuration as a cost/time trade-off to cover more models and tasks, accepting lower statistical power (discussed in Limitations).
Configuration | Name | System Message | User Message |
|---|---|---|---|
A | All User | Empty | Full instructions + rubric + context |
B | Minimal System | Generic role framing | Full rubric + context |
C | Role System | Role definition + task framing | Full rubric + context |
D | Instruction System | Most instructions + rubric | Context only |
E | Full System | Nearly all instructions + rubric | Minimal task input |
Some providers require at least one user message, so all configurations included one. When most instructions moved to system, the user message was reduced to the minimum required task input.
Methodology notes: All runs used temperature 0. This produces a single deterministic output per example, so we're reporting point estimates rather than averages over repeated stochastic runs; in principle, re-running each configuration multiple times at t>0 and averaging would yield tighter estimates (though often impractical at scale). We validated this choice by re-running a subset of pointwise evaluations at temperatures 0.5 and 1.0 on four models (GPT-4o mini (gpt-4o-mini), Claude 3.5 Haiku (claude-3-5-haiku-20241022), Llama 3.3 70B (llama-3.3-70b-versatile), and Gemini 2.5 Flash (gemini-2.5-flash)). Results were consistent across all models: exact-match variance stayed within a few percentage points, and within-±1 accuracy remained stable or improved. Experiments ran February 2026 on then-current model versions.
Model naming: We use provider names alongside the exact API model ids used in the experiment (in parentheses).
Model coverage: We tested 10 models total. Not all models appear in every task due to API availability and sample size requirements. Tables below show representative subsets; full results for all models are available in the experiment repository.
Results
We use Configuration A (all instructions in user, no system prompt) as the baseline for all comparisons. Tables show each model's baseline accuracy, the best-performing alternative configuration, and the delta between them. We highlight the two tasks with the most variation (pairwise ranking and safety classification) and summarize the others briefly.
Overall: Which Configuration Wins?

Distribution of best-performing configurations across 30 model-task combinations (13,897 successful parses across included configurations). This count is not unique examples: most comparisons use ~100 examples per configuration, and statistical power is driven by that per-config n. Only configurations with ≥20 successful samples are included. In 10 of 30 pairs, multiple configurations tied for best accuracy; these are assigned to the lower-letter configuration.
Across model-task pairs, Configuration B was consistently competitive (often tied with the top accuracy) and never showed meaningful regressions. Given the high tie rate (33% of pairs) and the absence of statistically significant differences (see below), any apparent ordering among configurations should be treated as descriptive rather than evidence of a true placement effect.
Pairwise Ranking
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
GPT-5 mini (gpt-5-mini) | 80.4% | C: 87.1% | +6.7 |
Gemini 3 Flash (preview) (gemini-3-flash-preview) | 73.8% | C: 75.3% | +1.5 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 63.0% | A/C: 63.0% | 0.0 (E: -5.0) |
Llama 3.3 70B (llama-3.3-70b-versatile) | 54.0% | A: 54.0% | 0.0 |
Llama 3.1 8B (llama-3.1-8b-instant) | 47.0% | B: 52.0% | +5.0 |
GPT-4o mini (gpt-4o-mini) | 47.0% | E: 53.0% | +6.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 45.0% | B: 52.0% | +7.0 |
7 of 10 models shown. GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on pairwise.
GPT-5 mini (gpt-5-mini) showed the largest observed delta (+6.7pp with Configuration C), and Claude 3.5 Haiku (claude-3-5-haiku-20241022) moved from 45% to 52% with a minimal system prompt. However, neither difference reached statistical significance (Fisher's exact test, p=0.25 and p=0.40 respectively). At n=100 per configuration, deltas of this magnitude are consistent with sampling variation. Gemini 3 Flash (preview) (gemini-3-flash-preview) was relatively stable across configurations (~74%).
Safety Classification
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
Llama 3.3 70B (llama-3.3-70b-versatile) | 97.0% | All: 97.0% | 0.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 93.0% | A/C: 93.0% | 0.0 (E: -4.0) |
GPT-4o mini (gpt-4o-mini) | 91.0% | C/E: 93.0% | +2.0 |
GPT-5 mini (gpt-5-mini) | 90.0% | B: 92.0% | +2.0 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 88.0% | A/B: 88.0% | 0.0 (E: -3.0) |
Llama 3.1 8B (llama-3.1-8b-instant) | 84.0% | B/C/D: 85.0% | +1.0 (E: -5.0) |
6 of 10 models shown. Gemini 2.5 Flash (gemini-2.5-flash) excluded due to insufficient samples; GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on safety.
Safety classification was generally stable for stronger models. Llama 3.3 70B (llama-3.3-70b-versatile) held steady at 97% across all configurations. The pattern changed for smaller models: moving nearly all instructions into the system prompt (Configuration E) produced the largest observed drops — Llama 3.1 8B (llama-3.1-8b-instant) fell from 91% to 85% and Claude 3.5 Haiku (claude-3-5-haiku-20241022) dropped from 93% to 89%. While these regressions are directionally concerning, neither reached statistical significance (p=0.28 and p=0.46 respectively), so we cannot rule out sampling noise at these sample sizes.
Pointwise Scoring and Feedback Evaluation
Pointwise scoring (exact-match against human labels) and feedback evaluation (preference accuracy against reference judgments) both showed minimal placement sensitivity. The largest pointwise delta was GPT-5 mini (gpt-5-mini) gaining +6pp with Configuration C (p=0.43, not significant). All other pointwise and feedback deltas were 1-4pp, well within sampling noise at n=100. No model-configuration pair in either task produced a result distinguishable from the baseline at p < 0.05.
Discussion
The implication for practitioners is that prompt placement is a second-order lever: optimize model choice, rubric quality, and output validation first. As a default, a minimal system message plus detailed user instructions (Configuration B) is a low-risk starting point; if you move substantial instructions into system for caching or governance reasons (Configurations D/E), validate explicitly, especially on safety tasks and smaller models. If you need tighter estimates than these point samples provide, increase n and/or repeat each configuration at t>0 and average.
Practical Recommendations
Quick Reference
Model Size | Recommended Default | Avoid |
|---|---|---|
Large (70B+) | Any config works | - |
Medium (8-70B) | B (minimal system) | E (full-system) |
Small (<8B) | A or B | E (-4-6pp risk) |
Details
Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests. But since the spread across configurations is modest, test a few options for your specific model and task. Keep the system prompt concise (role definition and output schema) and put scenario context, examples, and task-specific evidence in user templates.
Treat Configuration E as a high-variance option. Validate it explicitly before adopting full-system prompts, especially for smaller models or safety-critical workloads. In production, enforce output-format checks so regressions are detected quickly.
Limitations
Statistical power: With 100 samples per configuration, we are underpowered to detect effects smaller than ~15pp at p < 0.05. Power analysis suggests ~2,000 samples per configuration would be needed to reliably detect 3pp effects. This means our null results do not prove placement has zero effect, only that any effect is too small to detect at this scale, which itself is informative for practitioners deciding where to invest optimization effort.
Token/cost implications: Full-system configurations (D, E) place more tokens in the cached system prompt portion, which can reduce per-request costs with providers that offer prompt caching. We did not measure this effect, but it's worth considering for high-volume deployments.
Task dependency: Safety, pairwise, and pointwise tasks do not respond identically to placement changes, so findings from one task type may not transfer. These results also reflect current model versions and may drift with future updates.
Conclusion and Future Work
Across 34 model-task comparisons, no prompt placement configuration produced a statistically significant improvement over the all-user baseline (Fisher's exact test, p < 0.05). The largest observed deltas (6-7pp) had p-values of 0.25-0.43, well within the range of sampling noise at n=100. This absence of significant effects is itself the finding: if placement effects exist, they are small enough to be practically irrelevant compared to model selection, which produces accuracy gaps of up to 20pp on the same task.
What would change these findings?
Longer system prompts. Our configurations ranged from empty to ~90% system. Very long system prompts (10k+ tokens) might interact differently with context caching and attention patterns.
Multi-turn judge scenarios. We tested single-turn judgments. Multi-turn evaluation conversations, where the judge asks clarifying questions, might benefit more from system-level behavioral grounding.
Newer model versions. As providers update instruction hierarchy training, sensitivity to prompt placement may shift. Re-validating periodically is worthwhile.
Higher-stakes tasks. Safety classification showed the clearest model-size effects. Critical applications (content moderation, compliance) may warrant more defensive prompt structures regardless of accuracy deltas.
For teams building LLM-as-a-judge pipelines today: prioritize model selection and prompt content. Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests.
If you're building an LLM judge pipeline in production, Orq offers tooling for running evals and monitoring judge quality over time; get started here.
Why This Experiment
Should instructions go in the system message or user message? Does a role definition help? These questions come up often, but there's limited cross-model data to answer them.
Common guidance recommends separating "stable rules" (system) from "example-specific context" (user). We ran a prompt-placement experiment across multiple models and judge tasks, keeping the evaluation content constant while varying only where the instructions live.
Research questions:
Do different models react differently to having a system prompt?
What is the effect of a generic system prompt with extensive user instructions?
What happens when the whole eval prompt goes into system or user?
How do models react to best-practice instruction splits?
The goal: quantify whether prompt placement meaningfully improves judge reliability.
Our Approach
We evaluated multiple model families (Claude, GPT, Llama, and Gemini) across four common LLM-as-a-judge task types. In pointwise scoring, the judge assigns a score (0-4) to a single response using a rubric; we report exact-match rate against human labels. In pairwise ranking, the judge chooses the better of two responses for the same input. In safety classification, the judge labels content as safe or unsafe. In feedback evaluation, the judge both selects a winner and generates feedback, which we compare against reference judgments.
We used publicly available evaluation datasets: a 200-example HelpSteer3-derived multi-dimensional set for pointwise scoring (general-assistant responses rated on helpfulness, correctness, coherence, complexity, and verbosity), XSTest for safety classification (n=450; safe/unsafe prompts designed to probe exaggerated safety behavior), and LMSYS MT-Bench Human Judgments for feedback evaluation (n=500; multi-domain pairwise conversations with human preference judgments). For pairwise ranking, we used a combined 620-example set spanning LiveCodeBench, LiveBench, and MMLU-Pro.
For each model-task combination, we ran 100-500 examples per configuration and compared performance across five prompt-placement configurations. We intentionally chose ~100 examples per configuration as a cost/time trade-off to cover more models and tasks, accepting lower statistical power (discussed in Limitations).
Configuration | Name | System Message | User Message |
|---|---|---|---|
A | All User | Empty | Full instructions + rubric + context |
B | Minimal System | Generic role framing | Full rubric + context |
C | Role System | Role definition + task framing | Full rubric + context |
D | Instruction System | Most instructions + rubric | Context only |
E | Full System | Nearly all instructions + rubric | Minimal task input |
Some providers require at least one user message, so all configurations included one. When most instructions moved to system, the user message was reduced to the minimum required task input.
Methodology notes: All runs used temperature 0. This produces a single deterministic output per example, so we're reporting point estimates rather than averages over repeated stochastic runs; in principle, re-running each configuration multiple times at t>0 and averaging would yield tighter estimates (though often impractical at scale). We validated this choice by re-running a subset of pointwise evaluations at temperatures 0.5 and 1.0 on four models (GPT-4o mini (gpt-4o-mini), Claude 3.5 Haiku (claude-3-5-haiku-20241022), Llama 3.3 70B (llama-3.3-70b-versatile), and Gemini 2.5 Flash (gemini-2.5-flash)). Results were consistent across all models: exact-match variance stayed within a few percentage points, and within-±1 accuracy remained stable or improved. Experiments ran February 2026 on then-current model versions.
Model naming: We use provider names alongside the exact API model ids used in the experiment (in parentheses).
Model coverage: We tested 10 models total. Not all models appear in every task due to API availability and sample size requirements. Tables below show representative subsets; full results for all models are available in the experiment repository.
Results
We use Configuration A (all instructions in user, no system prompt) as the baseline for all comparisons. Tables show each model's baseline accuracy, the best-performing alternative configuration, and the delta between them. We highlight the two tasks with the most variation (pairwise ranking and safety classification) and summarize the others briefly.
Overall: Which Configuration Wins?

Distribution of best-performing configurations across 30 model-task combinations (13,897 successful parses across included configurations). This count is not unique examples: most comparisons use ~100 examples per configuration, and statistical power is driven by that per-config n. Only configurations with ≥20 successful samples are included. In 10 of 30 pairs, multiple configurations tied for best accuracy; these are assigned to the lower-letter configuration.
Across model-task pairs, Configuration B was consistently competitive (often tied with the top accuracy) and never showed meaningful regressions. Given the high tie rate (33% of pairs) and the absence of statistically significant differences (see below), any apparent ordering among configurations should be treated as descriptive rather than evidence of a true placement effect.
Pairwise Ranking
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
GPT-5 mini (gpt-5-mini) | 80.4% | C: 87.1% | +6.7 |
Gemini 3 Flash (preview) (gemini-3-flash-preview) | 73.8% | C: 75.3% | +1.5 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 63.0% | A/C: 63.0% | 0.0 (E: -5.0) |
Llama 3.3 70B (llama-3.3-70b-versatile) | 54.0% | A: 54.0% | 0.0 |
Llama 3.1 8B (llama-3.1-8b-instant) | 47.0% | B: 52.0% | +5.0 |
GPT-4o mini (gpt-4o-mini) | 47.0% | E: 53.0% | +6.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 45.0% | B: 52.0% | +7.0 |
7 of 10 models shown. GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on pairwise.
GPT-5 mini (gpt-5-mini) showed the largest observed delta (+6.7pp with Configuration C), and Claude 3.5 Haiku (claude-3-5-haiku-20241022) moved from 45% to 52% with a minimal system prompt. However, neither difference reached statistical significance (Fisher's exact test, p=0.25 and p=0.40 respectively). At n=100 per configuration, deltas of this magnitude are consistent with sampling variation. Gemini 3 Flash (preview) (gemini-3-flash-preview) was relatively stable across configurations (~74%).
Safety Classification
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
Llama 3.3 70B (llama-3.3-70b-versatile) | 97.0% | All: 97.0% | 0.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 93.0% | A/C: 93.0% | 0.0 (E: -4.0) |
GPT-4o mini (gpt-4o-mini) | 91.0% | C/E: 93.0% | +2.0 |
GPT-5 mini (gpt-5-mini) | 90.0% | B: 92.0% | +2.0 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 88.0% | A/B: 88.0% | 0.0 (E: -3.0) |
Llama 3.1 8B (llama-3.1-8b-instant) | 84.0% | B/C/D: 85.0% | +1.0 (E: -5.0) |
6 of 10 models shown. Gemini 2.5 Flash (gemini-2.5-flash) excluded due to insufficient samples; GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on safety.
Safety classification was generally stable for stronger models. Llama 3.3 70B (llama-3.3-70b-versatile) held steady at 97% across all configurations. The pattern changed for smaller models: moving nearly all instructions into the system prompt (Configuration E) produced the largest observed drops — Llama 3.1 8B (llama-3.1-8b-instant) fell from 91% to 85% and Claude 3.5 Haiku (claude-3-5-haiku-20241022) dropped from 93% to 89%. While these regressions are directionally concerning, neither reached statistical significance (p=0.28 and p=0.46 respectively), so we cannot rule out sampling noise at these sample sizes.
Pointwise Scoring and Feedback Evaluation
Pointwise scoring (exact-match against human labels) and feedback evaluation (preference accuracy against reference judgments) both showed minimal placement sensitivity. The largest pointwise delta was GPT-5 mini (gpt-5-mini) gaining +6pp with Configuration C (p=0.43, not significant). All other pointwise and feedback deltas were 1-4pp, well within sampling noise at n=100. No model-configuration pair in either task produced a result distinguishable from the baseline at p < 0.05.
Discussion
The implication for practitioners is that prompt placement is a second-order lever: optimize model choice, rubric quality, and output validation first. As a default, a minimal system message plus detailed user instructions (Configuration B) is a low-risk starting point; if you move substantial instructions into system for caching or governance reasons (Configurations D/E), validate explicitly, especially on safety tasks and smaller models. If you need tighter estimates than these point samples provide, increase n and/or repeat each configuration at t>0 and average.
Practical Recommendations
Quick Reference
Model Size | Recommended Default | Avoid |
|---|---|---|
Large (70B+) | Any config works | - |
Medium (8-70B) | B (minimal system) | E (full-system) |
Small (<8B) | A or B | E (-4-6pp risk) |
Details
Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests. But since the spread across configurations is modest, test a few options for your specific model and task. Keep the system prompt concise (role definition and output schema) and put scenario context, examples, and task-specific evidence in user templates.
Treat Configuration E as a high-variance option. Validate it explicitly before adopting full-system prompts, especially for smaller models or safety-critical workloads. In production, enforce output-format checks so regressions are detected quickly.
Limitations
Statistical power: With 100 samples per configuration, we are underpowered to detect effects smaller than ~15pp at p < 0.05. Power analysis suggests ~2,000 samples per configuration would be needed to reliably detect 3pp effects. This means our null results do not prove placement has zero effect, only that any effect is too small to detect at this scale, which itself is informative for practitioners deciding where to invest optimization effort.
Token/cost implications: Full-system configurations (D, E) place more tokens in the cached system prompt portion, which can reduce per-request costs with providers that offer prompt caching. We did not measure this effect, but it's worth considering for high-volume deployments.
Task dependency: Safety, pairwise, and pointwise tasks do not respond identically to placement changes, so findings from one task type may not transfer. These results also reflect current model versions and may drift with future updates.
Conclusion and Future Work
Across 34 model-task comparisons, no prompt placement configuration produced a statistically significant improvement over the all-user baseline (Fisher's exact test, p < 0.05). The largest observed deltas (6-7pp) had p-values of 0.25-0.43, well within the range of sampling noise at n=100. This absence of significant effects is itself the finding: if placement effects exist, they are small enough to be practically irrelevant compared to model selection, which produces accuracy gaps of up to 20pp on the same task.
What would change these findings?
Longer system prompts. Our configurations ranged from empty to ~90% system. Very long system prompts (10k+ tokens) might interact differently with context caching and attention patterns.
Multi-turn judge scenarios. We tested single-turn judgments. Multi-turn evaluation conversations, where the judge asks clarifying questions, might benefit more from system-level behavioral grounding.
Newer model versions. As providers update instruction hierarchy training, sensitivity to prompt placement may shift. Re-validating periodically is worthwhile.
Higher-stakes tasks. Safety classification showed the clearest model-size effects. Critical applications (content moderation, compliance) may warrant more defensive prompt structures regardless of accuracy deltas.
For teams building LLM-as-a-judge pipelines today: prioritize model selection and prompt content. Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests.
If you're building an LLM judge pipeline in production, Orq offers tooling for running evals and monitoring judge quality over time; get started here.
Why This Experiment
Should instructions go in the system message or user message? Does a role definition help? These questions come up often, but there's limited cross-model data to answer them.
Common guidance recommends separating "stable rules" (system) from "example-specific context" (user). We ran a prompt-placement experiment across multiple models and judge tasks, keeping the evaluation content constant while varying only where the instructions live.
Research questions:
Do different models react differently to having a system prompt?
What is the effect of a generic system prompt with extensive user instructions?
What happens when the whole eval prompt goes into system or user?
How do models react to best-practice instruction splits?
The goal: quantify whether prompt placement meaningfully improves judge reliability.
Our Approach
We evaluated multiple model families (Claude, GPT, Llama, and Gemini) across four common LLM-as-a-judge task types. In pointwise scoring, the judge assigns a score (0-4) to a single response using a rubric; we report exact-match rate against human labels. In pairwise ranking, the judge chooses the better of two responses for the same input. In safety classification, the judge labels content as safe or unsafe. In feedback evaluation, the judge both selects a winner and generates feedback, which we compare against reference judgments.
We used publicly available evaluation datasets: a 200-example HelpSteer3-derived multi-dimensional set for pointwise scoring (general-assistant responses rated on helpfulness, correctness, coherence, complexity, and verbosity), XSTest for safety classification (n=450; safe/unsafe prompts designed to probe exaggerated safety behavior), and LMSYS MT-Bench Human Judgments for feedback evaluation (n=500; multi-domain pairwise conversations with human preference judgments). For pairwise ranking, we used a combined 620-example set spanning LiveCodeBench, LiveBench, and MMLU-Pro.
For each model-task combination, we ran 100-500 examples per configuration and compared performance across five prompt-placement configurations. We intentionally chose ~100 examples per configuration as a cost/time trade-off to cover more models and tasks, accepting lower statistical power (discussed in Limitations).
Configuration | Name | System Message | User Message |
|---|---|---|---|
A | All User | Empty | Full instructions + rubric + context |
B | Minimal System | Generic role framing | Full rubric + context |
C | Role System | Role definition + task framing | Full rubric + context |
D | Instruction System | Most instructions + rubric | Context only |
E | Full System | Nearly all instructions + rubric | Minimal task input |
Some providers require at least one user message, so all configurations included one. When most instructions moved to system, the user message was reduced to the minimum required task input.
Methodology notes: All runs used temperature 0. This produces a single deterministic output per example, so we're reporting point estimates rather than averages over repeated stochastic runs; in principle, re-running each configuration multiple times at t>0 and averaging would yield tighter estimates (though often impractical at scale). We validated this choice by re-running a subset of pointwise evaluations at temperatures 0.5 and 1.0 on four models (GPT-4o mini (gpt-4o-mini), Claude 3.5 Haiku (claude-3-5-haiku-20241022), Llama 3.3 70B (llama-3.3-70b-versatile), and Gemini 2.5 Flash (gemini-2.5-flash)). Results were consistent across all models: exact-match variance stayed within a few percentage points, and within-±1 accuracy remained stable or improved. Experiments ran February 2026 on then-current model versions.
Model naming: We use provider names alongside the exact API model ids used in the experiment (in parentheses).
Model coverage: We tested 10 models total. Not all models appear in every task due to API availability and sample size requirements. Tables below show representative subsets; full results for all models are available in the experiment repository.
Results
We use Configuration A (all instructions in user, no system prompt) as the baseline for all comparisons. Tables show each model's baseline accuracy, the best-performing alternative configuration, and the delta between them. We highlight the two tasks with the most variation (pairwise ranking and safety classification) and summarize the others briefly.
Overall: Which Configuration Wins?

Distribution of best-performing configurations across 30 model-task combinations (13,897 successful parses across included configurations). This count is not unique examples: most comparisons use ~100 examples per configuration, and statistical power is driven by that per-config n. Only configurations with ≥20 successful samples are included. In 10 of 30 pairs, multiple configurations tied for best accuracy; these are assigned to the lower-letter configuration.
Across model-task pairs, Configuration B was consistently competitive (often tied with the top accuracy) and never showed meaningful regressions. Given the high tie rate (33% of pairs) and the absence of statistically significant differences (see below), any apparent ordering among configurations should be treated as descriptive rather than evidence of a true placement effect.
Pairwise Ranking
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
GPT-5 mini (gpt-5-mini) | 80.4% | C: 87.1% | +6.7 |
Gemini 3 Flash (preview) (gemini-3-flash-preview) | 73.8% | C: 75.3% | +1.5 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 63.0% | A/C: 63.0% | 0.0 (E: -5.0) |
Llama 3.3 70B (llama-3.3-70b-versatile) | 54.0% | A: 54.0% | 0.0 |
Llama 3.1 8B (llama-3.1-8b-instant) | 47.0% | B: 52.0% | +5.0 |
GPT-4o mini (gpt-4o-mini) | 47.0% | E: 53.0% | +6.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 45.0% | B: 52.0% | +7.0 |
7 of 10 models shown. GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on pairwise.
GPT-5 mini (gpt-5-mini) showed the largest observed delta (+6.7pp with Configuration C), and Claude 3.5 Haiku (claude-3-5-haiku-20241022) moved from 45% to 52% with a minimal system prompt. However, neither difference reached statistical significance (Fisher's exact test, p=0.25 and p=0.40 respectively). At n=100 per configuration, deltas of this magnitude are consistent with sampling variation. Gemini 3 Flash (preview) (gemini-3-flash-preview) was relatively stable across configurations (~74%).
Safety Classification
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
Llama 3.3 70B (llama-3.3-70b-versatile) | 97.0% | All: 97.0% | 0.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 93.0% | A/C: 93.0% | 0.0 (E: -4.0) |
GPT-4o mini (gpt-4o-mini) | 91.0% | C/E: 93.0% | +2.0 |
GPT-5 mini (gpt-5-mini) | 90.0% | B: 92.0% | +2.0 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 88.0% | A/B: 88.0% | 0.0 (E: -3.0) |
Llama 3.1 8B (llama-3.1-8b-instant) | 84.0% | B/C/D: 85.0% | +1.0 (E: -5.0) |
6 of 10 models shown. Gemini 2.5 Flash (gemini-2.5-flash) excluded due to insufficient samples; GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on safety.
Safety classification was generally stable for stronger models. Llama 3.3 70B (llama-3.3-70b-versatile) held steady at 97% across all configurations. The pattern changed for smaller models: moving nearly all instructions into the system prompt (Configuration E) produced the largest observed drops — Llama 3.1 8B (llama-3.1-8b-instant) fell from 91% to 85% and Claude 3.5 Haiku (claude-3-5-haiku-20241022) dropped from 93% to 89%. While these regressions are directionally concerning, neither reached statistical significance (p=0.28 and p=0.46 respectively), so we cannot rule out sampling noise at these sample sizes.
Pointwise Scoring and Feedback Evaluation
Pointwise scoring (exact-match against human labels) and feedback evaluation (preference accuracy against reference judgments) both showed minimal placement sensitivity. The largest pointwise delta was GPT-5 mini (gpt-5-mini) gaining +6pp with Configuration C (p=0.43, not significant). All other pointwise and feedback deltas were 1-4pp, well within sampling noise at n=100. No model-configuration pair in either task produced a result distinguishable from the baseline at p < 0.05.
Discussion
The implication for practitioners is that prompt placement is a second-order lever: optimize model choice, rubric quality, and output validation first. As a default, a minimal system message plus detailed user instructions (Configuration B) is a low-risk starting point; if you move substantial instructions into system for caching or governance reasons (Configurations D/E), validate explicitly, especially on safety tasks and smaller models. If you need tighter estimates than these point samples provide, increase n and/or repeat each configuration at t>0 and average.
Practical Recommendations
Quick Reference
Model Size | Recommended Default | Avoid |
|---|---|---|
Large (70B+) | Any config works | - |
Medium (8-70B) | B (minimal system) | E (full-system) |
Small (<8B) | A or B | E (-4-6pp risk) |
Details
Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests. But since the spread across configurations is modest, test a few options for your specific model and task. Keep the system prompt concise (role definition and output schema) and put scenario context, examples, and task-specific evidence in user templates.
Treat Configuration E as a high-variance option. Validate it explicitly before adopting full-system prompts, especially for smaller models or safety-critical workloads. In production, enforce output-format checks so regressions are detected quickly.
Limitations
Statistical power: With 100 samples per configuration, we are underpowered to detect effects smaller than ~15pp at p < 0.05. Power analysis suggests ~2,000 samples per configuration would be needed to reliably detect 3pp effects. This means our null results do not prove placement has zero effect, only that any effect is too small to detect at this scale, which itself is informative for practitioners deciding where to invest optimization effort.
Token/cost implications: Full-system configurations (D, E) place more tokens in the cached system prompt portion, which can reduce per-request costs with providers that offer prompt caching. We did not measure this effect, but it's worth considering for high-volume deployments.
Task dependency: Safety, pairwise, and pointwise tasks do not respond identically to placement changes, so findings from one task type may not transfer. These results also reflect current model versions and may drift with future updates.
Conclusion and Future Work
Across 34 model-task comparisons, no prompt placement configuration produced a statistically significant improvement over the all-user baseline (Fisher's exact test, p < 0.05). The largest observed deltas (6-7pp) had p-values of 0.25-0.43, well within the range of sampling noise at n=100. This absence of significant effects is itself the finding: if placement effects exist, they are small enough to be practically irrelevant compared to model selection, which produces accuracy gaps of up to 20pp on the same task.
What would change these findings?
Longer system prompts. Our configurations ranged from empty to ~90% system. Very long system prompts (10k+ tokens) might interact differently with context caching and attention patterns.
Multi-turn judge scenarios. We tested single-turn judgments. Multi-turn evaluation conversations, where the judge asks clarifying questions, might benefit more from system-level behavioral grounding.
Newer model versions. As providers update instruction hierarchy training, sensitivity to prompt placement may shift. Re-validating periodically is worthwhile.
Higher-stakes tasks. Safety classification showed the clearest model-size effects. Critical applications (content moderation, compliance) may warrant more defensive prompt structures regardless of accuracy deltas.
For teams building LLM-as-a-judge pipelines today: prioritize model selection and prompt content. Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests.
If you're building an LLM judge pipeline in production, Orq offers tooling for running evals and monitoring judge quality over time; get started here.
Why This Experiment
Should instructions go in the system message or user message? Does a role definition help? These questions come up often, but there's limited cross-model data to answer them.
Common guidance recommends separating "stable rules" (system) from "example-specific context" (user). We ran a prompt-placement experiment across multiple models and judge tasks, keeping the evaluation content constant while varying only where the instructions live.
Research questions:
Do different models react differently to having a system prompt?
What is the effect of a generic system prompt with extensive user instructions?
What happens when the whole eval prompt goes into system or user?
How do models react to best-practice instruction splits?
The goal: quantify whether prompt placement meaningfully improves judge reliability.
Our Approach
We evaluated multiple model families (Claude, GPT, Llama, and Gemini) across four common LLM-as-a-judge task types. In pointwise scoring, the judge assigns a score (0-4) to a single response using a rubric; we report exact-match rate against human labels. In pairwise ranking, the judge chooses the better of two responses for the same input. In safety classification, the judge labels content as safe or unsafe. In feedback evaluation, the judge both selects a winner and generates feedback, which we compare against reference judgments.
We used publicly available evaluation datasets: a 200-example HelpSteer3-derived multi-dimensional set for pointwise scoring (general-assistant responses rated on helpfulness, correctness, coherence, complexity, and verbosity), XSTest for safety classification (n=450; safe/unsafe prompts designed to probe exaggerated safety behavior), and LMSYS MT-Bench Human Judgments for feedback evaluation (n=500; multi-domain pairwise conversations with human preference judgments). For pairwise ranking, we used a combined 620-example set spanning LiveCodeBench, LiveBench, and MMLU-Pro.
For each model-task combination, we ran 100-500 examples per configuration and compared performance across five prompt-placement configurations. We intentionally chose ~100 examples per configuration as a cost/time trade-off to cover more models and tasks, accepting lower statistical power (discussed in Limitations).
Configuration | Name | System Message | User Message |
|---|---|---|---|
A | All User | Empty | Full instructions + rubric + context |
B | Minimal System | Generic role framing | Full rubric + context |
C | Role System | Role definition + task framing | Full rubric + context |
D | Instruction System | Most instructions + rubric | Context only |
E | Full System | Nearly all instructions + rubric | Minimal task input |
Some providers require at least one user message, so all configurations included one. When most instructions moved to system, the user message was reduced to the minimum required task input.
Methodology notes: All runs used temperature 0. This produces a single deterministic output per example, so we're reporting point estimates rather than averages over repeated stochastic runs; in principle, re-running each configuration multiple times at t>0 and averaging would yield tighter estimates (though often impractical at scale). We validated this choice by re-running a subset of pointwise evaluations at temperatures 0.5 and 1.0 on four models (GPT-4o mini (gpt-4o-mini), Claude 3.5 Haiku (claude-3-5-haiku-20241022), Llama 3.3 70B (llama-3.3-70b-versatile), and Gemini 2.5 Flash (gemini-2.5-flash)). Results were consistent across all models: exact-match variance stayed within a few percentage points, and within-±1 accuracy remained stable or improved. Experiments ran February 2026 on then-current model versions.
Model naming: We use provider names alongside the exact API model ids used in the experiment (in parentheses).
Model coverage: We tested 10 models total. Not all models appear in every task due to API availability and sample size requirements. Tables below show representative subsets; full results for all models are available in the experiment repository.
Results
We use Configuration A (all instructions in user, no system prompt) as the baseline for all comparisons. Tables show each model's baseline accuracy, the best-performing alternative configuration, and the delta between them. We highlight the two tasks with the most variation (pairwise ranking and safety classification) and summarize the others briefly.
Overall: Which Configuration Wins?

Distribution of best-performing configurations across 30 model-task combinations (13,897 successful parses across included configurations). This count is not unique examples: most comparisons use ~100 examples per configuration, and statistical power is driven by that per-config n. Only configurations with ≥20 successful samples are included. In 10 of 30 pairs, multiple configurations tied for best accuracy; these are assigned to the lower-letter configuration.
Across model-task pairs, Configuration B was consistently competitive (often tied with the top accuracy) and never showed meaningful regressions. Given the high tie rate (33% of pairs) and the absence of statistically significant differences (see below), any apparent ordering among configurations should be treated as descriptive rather than evidence of a true placement effect.
Pairwise Ranking
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
GPT-5 mini (gpt-5-mini) | 80.4% | C: 87.1% | +6.7 |
Gemini 3 Flash (preview) (gemini-3-flash-preview) | 73.8% | C: 75.3% | +1.5 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 63.0% | A/C: 63.0% | 0.0 (E: -5.0) |
Llama 3.3 70B (llama-3.3-70b-versatile) | 54.0% | A: 54.0% | 0.0 |
Llama 3.1 8B (llama-3.1-8b-instant) | 47.0% | B: 52.0% | +5.0 |
GPT-4o mini (gpt-4o-mini) | 47.0% | E: 53.0% | +6.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 45.0% | B: 52.0% | +7.0 |
7 of 10 models shown. GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on pairwise.
GPT-5 mini (gpt-5-mini) showed the largest observed delta (+6.7pp with Configuration C), and Claude 3.5 Haiku (claude-3-5-haiku-20241022) moved from 45% to 52% with a minimal system prompt. However, neither difference reached statistical significance (Fisher's exact test, p=0.25 and p=0.40 respectively). At n=100 per configuration, deltas of this magnitude are consistent with sampling variation. Gemini 3 Flash (preview) (gemini-3-flash-preview) was relatively stable across configurations (~74%).
Safety Classification
Model | Configuration A | Best Configuration | Delta |
|---|---|---|---|
Llama 3.3 70B (llama-3.3-70b-versatile) | 97.0% | All: 97.0% | 0.0 |
Claude 3.5 Haiku (claude-3-5-haiku-20241022) | 93.0% | A/C: 93.0% | 0.0 (E: -4.0) |
GPT-4o mini (gpt-4o-mini) | 91.0% | C/E: 93.0% | +2.0 |
GPT-5 mini (gpt-5-mini) | 90.0% | B: 92.0% | +2.0 |
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | 88.0% | A/B: 88.0% | 0.0 (E: -3.0) |
Llama 3.1 8B (llama-3.1-8b-instant) | 84.0% | B/C/D: 85.0% | +1.0 (E: -5.0) |
6 of 10 models shown. Gemini 2.5 Flash (gemini-2.5-flash) excluded due to insufficient samples; GPT-5 (gpt-5) and Gemini 3 Pro (preview) (gemini-3-pro-preview) were not evaluated on safety.
Safety classification was generally stable for stronger models. Llama 3.3 70B (llama-3.3-70b-versatile) held steady at 97% across all configurations. The pattern changed for smaller models: moving nearly all instructions into the system prompt (Configuration E) produced the largest observed drops — Llama 3.1 8B (llama-3.1-8b-instant) fell from 91% to 85% and Claude 3.5 Haiku (claude-3-5-haiku-20241022) dropped from 93% to 89%. While these regressions are directionally concerning, neither reached statistical significance (p=0.28 and p=0.46 respectively), so we cannot rule out sampling noise at these sample sizes.
Pointwise Scoring and Feedback Evaluation
Pointwise scoring (exact-match against human labels) and feedback evaluation (preference accuracy against reference judgments) both showed minimal placement sensitivity. The largest pointwise delta was GPT-5 mini (gpt-5-mini) gaining +6pp with Configuration C (p=0.43, not significant). All other pointwise and feedback deltas were 1-4pp, well within sampling noise at n=100. No model-configuration pair in either task produced a result distinguishable from the baseline at p < 0.05.
Discussion
The implication for practitioners is that prompt placement is a second-order lever: optimize model choice, rubric quality, and output validation first. As a default, a minimal system message plus detailed user instructions (Configuration B) is a low-risk starting point; if you move substantial instructions into system for caching or governance reasons (Configurations D/E), validate explicitly, especially on safety tasks and smaller models. If you need tighter estimates than these point samples provide, increase n and/or repeat each configuration at t>0 and average.
Practical Recommendations
Quick Reference
Model Size | Recommended Default | Avoid |
|---|---|---|
Large (70B+) | Any config works | - |
Medium (8-70B) | B (minimal system) | E (full-system) |
Small (<8B) | A or B | E (-4-6pp risk) |
Details
Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests. But since the spread across configurations is modest, test a few options for your specific model and task. Keep the system prompt concise (role definition and output schema) and put scenario context, examples, and task-specific evidence in user templates.
Treat Configuration E as a high-variance option. Validate it explicitly before adopting full-system prompts, especially for smaller models or safety-critical workloads. In production, enforce output-format checks so regressions are detected quickly.
Limitations
Statistical power: With 100 samples per configuration, we are underpowered to detect effects smaller than ~15pp at p < 0.05. Power analysis suggests ~2,000 samples per configuration would be needed to reliably detect 3pp effects. This means our null results do not prove placement has zero effect, only that any effect is too small to detect at this scale, which itself is informative for practitioners deciding where to invest optimization effort.
Token/cost implications: Full-system configurations (D, E) place more tokens in the cached system prompt portion, which can reduce per-request costs with providers that offer prompt caching. We did not measure this effect, but it's worth considering for high-volume deployments.
Task dependency: Safety, pairwise, and pointwise tasks do not respond identically to placement changes, so findings from one task type may not transfer. These results also reflect current model versions and may drift with future updates.
Conclusion and Future Work
Across 34 model-task comparisons, no prompt placement configuration produced a statistically significant improvement over the all-user baseline (Fisher's exact test, p < 0.05). The largest observed deltas (6-7pp) had p-values of 0.25-0.43, well within the range of sampling noise at n=100. This absence of significant effects is itself the finding: if placement effects exist, they are small enough to be practically irrelevant compared to model selection, which produces accuracy gaps of up to 20pp on the same task.
What would change these findings?
Longer system prompts. Our configurations ranged from empty to ~90% system. Very long system prompts (10k+ tokens) might interact differently with context caching and attention patterns.
Multi-turn judge scenarios. We tested single-turn judgments. Multi-turn evaluation conversations, where the judge asks clarifying questions, might benefit more from system-level behavioral grounding.
Newer model versions. As providers update instruction hierarchy training, sensitivity to prompt placement may shift. Re-validating periodically is worthwhile.
Higher-stakes tasks. Safety classification showed the clearest model-size effects. Critical applications (content moderation, compliance) may warrant more defensive prompt structures regardless of accuracy deltas.
For teams building LLM-as-a-judge pipelines today: prioritize model selection and prompt content. Use Configuration B as your default starting point: it never caused meaningful regressions and performed at or near the top across tasks in our tests.
If you're building an LLM judge pipeline in production, Orq offers tooling for running evals and monitoring judge quality over time; get started here.

