Research

From experiments to insights: Orq.ai research hub

Filter by theme:

Large Language Models

Generative AI

Prompt Engineering

RAG-as-a-Service

Three red-teaming frameworks, one judge panel, 5,624 attacks: what we learned

What 5,624 adversarial attacks across Evaluatorq, PromptFoo, and DeepTeam tell us about attack quality, judge calibration, and where framework choice actually matters.

Weak judges, strong panel: an ensemble approach to LLM eval

Your eval pipeline has a judge. That judge has biases. Adding two more judges fixes that, but only if they disagree on the right things. If they agree on everything, you bought one verdict three times. Here's how to pick a panel that earns its cost.

Copy-Pasting Your Prompt Twice Makes LLMs Smarter. We Tested 13 Models to Confirm It.

The cheapest accuracy upgrade in LLM engineering still works on GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Pro. Here is what 334 runs taught us.

System Prompt Placement for LLM-as-a-Judge

Every team building an LLM-as-a-judge pipeline faces the same question: does it matter where you put your instructions? We tested it across models to get a real answer.

Prompt Optimization: How to Make Smaller Models Punch Above Their Weight

We explored using prompt optimization to make cheaper, smaller language models perform as well as expensive, more capable ones. We achieved up to 4x performance improvements for trace classification, while learning important lessons about overfitting along the way.

Can a 14B Model Match a 100B+ Model? We Fine-Tuned 8+ Models to Find Out

Key takeways from fine-tuning 8+ language models on a text classification task, from tiny 0.6B models to 14B behemoths.

Filter by theme:

Large Language Models

Generative AI

Prompt Engineering

RAG-as-a-Service

Three red-teaming frameworks, one judge panel, 5,624 attacks: what we learned

What 5,624 adversarial attacks across Evaluatorq, PromptFoo, and DeepTeam tell us about attack quality, judge calibration, and where framework choice actually matters.

Weak judges, strong panel: an ensemble approach to LLM eval

Your eval pipeline has a judge. That judge has biases. Adding two more judges fixes that, but only if they disagree on the right things. If they agree on everything, you bought one verdict three times. Here's how to pick a panel that earns its cost.

Copy-Pasting Your Prompt Twice Makes LLMs Smarter. We Tested 13 Models to Confirm It.

The cheapest accuracy upgrade in LLM engineering still works on GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Pro. Here is what 334 runs taught us.

System Prompt Placement for LLM-as-a-Judge

Every team building an LLM-as-a-judge pipeline faces the same question: does it matter where you put your instructions? We tested it across models to get a real answer.

Prompt Optimization: How to Make Smaller Models Punch Above Their Weight

We explored using prompt optimization to make cheaper, smaller language models perform as well as expensive, more capable ones. We achieved up to 4x performance improvements for trace classification, while learning important lessons about overfitting along the way.

Can a 14B Model Match a 100B+ Model? We Fine-Tuned 8+ Models to Find Out

Key takeways from fine-tuning 8+ language models on a text classification task, from tiny 0.6B models to 14B behemoths.

Filter by theme:

Large Language Models

Generative AI

Prompt Engineering

RAG-as-a-Service

Three red-teaming frameworks, one judge panel, 5,624 attacks: what we learned

What 5,624 adversarial attacks across Evaluatorq, PromptFoo, and DeepTeam tell us about attack quality, judge calibration, and where framework choice actually matters.

Weak judges, strong panel: an ensemble approach to LLM eval

Your eval pipeline has a judge. That judge has biases. Adding two more judges fixes that, but only if they disagree on the right things. If they agree on everything, you bought one verdict three times. Here's how to pick a panel that earns its cost.

Copy-Pasting Your Prompt Twice Makes LLMs Smarter. We Tested 13 Models to Confirm It.

The cheapest accuracy upgrade in LLM engineering still works on GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Pro. Here is what 334 runs taught us.

System Prompt Placement for LLM-as-a-Judge

Every team building an LLM-as-a-judge pipeline faces the same question: does it matter where you put your instructions? We tested it across models to get a real answer.

Prompt Optimization: How to Make Smaller Models Punch Above Their Weight

We explored using prompt optimization to make cheaper, smaller language models perform as well as expensive, more capable ones. We achieved up to 4x performance improvements for trace classification, while learning important lessons about overfitting along the way.

Can a 14B Model Match a 100B+ Model? We Fine-Tuned 8+ Models to Find Out

Key takeways from fine-tuning 8+ language models on a text classification task, from tiny 0.6B models to 14B behemoths.

Create an account and start building today.

Create an account and start building today.

Create an account and start building today.

Create an account and start building today.