Eval Frameworks

Build systematic evaluations, not guesswork

Define custom evaluation metrics tailored to your LLM application.

Run evals on production data, catch regressions early, and iterate confidently with structured testing instead of ship-and-see.

Talk to an expert Start Building

Try LangSmith free. No credit card required.

LangSmith evaluation framework interface with test metrics

How LangSmith evaluation frameworks work

Define your evaluators

Build custom evals with Python, LLM-as-judge, or human feedback. Target the metrics your product actually cares about—accuracy, latency, cost, safety, or domain-specific requirements.

Run evals on your data

Test offline on datasets or online against production traffic. Automatically surface regressions and compare performance across prompt versions and model changes.

Ship with confidence

Gate deployments on evaluation thresholds. Close the feedback loop from production data to training datasets, turning real usage into continuous improvement.

See how it works

LangSmith powers top engineering teams, from AI startups to global enterprises

Evaluation at Scale

Leading AI teams trust LangSmith for evaluation and quality assurance

50M+

LLM Calls Traced

1B+

Events Ingested per Day

100K+

Monthly active orgs in LangSmith SaaS

Get Started

LangSmith Evaluation Framework Platform

Define, execute, and scale evaluations for continuous quality improvement

LangSmith captures complete execution traces from your LLM applications. These traces become the source of truth for your evaluation datasets, so you're always testing against real production patterns.

Connect with our team to see how

LangSmith trace interface showing LLM call details

Built for Enterprise

Security and compliance at scale

LangSmith meets the demanding security, performance, and collaboration requirements of large organizations building AI applications at scale.

Granular permissions

Role-based access control with org-level permissions and project isolation to meet your security and compliance requirements.

SOC 2 Type II

Third-party security certification with comprehensive security controls.

Trust center

Self-hosted deployment

Self-hosting options to maintain full control over your AI data and meet strict compliance requirements.

Why top AI teams choose LangSmith for evaluation

Define what matters to you

Create custom evaluators with code, LLM-as-judge, or human feedback. Measure accuracy, latency, cost, safety—whatever your product requires.

Catch regressions before production

Run evals offline on test datasets or online against live traffic. Surface issues automatically so you ship with confidence.

Works with any framework

LangSmith evaluation works with OpenAI, Anthropic, open-source models, or custom implementations. Bring any LLM stack.

Customers

"Working with LangSmith on the Elastic AI Assistant had a significant positive impact on the overall pace and quality of our development and shipping experience. We couldn't have delivered the product experience our customers now have without LangSmith—and we couldn't have done it at the same pace without it."

James Spiteri, Director of Security Product Management at Elastic

Read case study

"What we really needed was a more structured way to test new approaches, something better than just shipping and seeing what happened. LangSmith gave us a scientific, structured way to understand what was actually working. We could run pairwise evaluations and understand why accuracy jumped from 70% to 80%. Our engineers love the intuitive debugging experience."

Yusuke Kaji, General Manager of AI for Business Development at Rakuten