Observability & Evaluation

12 open-source agents tracked · Ranked by trust score · Updated 2026-05-30 20:03 UTC

HVTracker independently evaluates 12 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is Weights & Biases Weave with a trust score of 84.9/100 (Grade B). Other leading projects include Langfuse and Promptfoo.

Agents

Avg Trust

118.4k

Total Stars

Grade A

#	Agent	Trust	Stars	Language
1	Weights & Biases Weave B Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.	84.9	1.1k	Python
2	Langfuse B 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte	74.7	28.2k	TypeScript
3	Promptfoo B Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C	73.8	21.7k	TypeScript
4	Arize Phoenix B AI Observability & Evaluation	73.2	9.9k	Python
5	Giskard B 🐢 Open-Source Evaluation & Testing library for LLM Agents	70.0	5.4k	Python
6	LangWatch B The platform for LLM evaluations and AI agent testing	68.0	3.3k	TypeScript
7	DeepEval B The LLM Evaluation Framework	61.9	15.8k	Python
8	Helicone C 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓	60.3	5.8k	TypeScript
9	Ragas B Supercharge Your LLM Application Evaluations 🚀	53.4	14.1k	Python
10	TruLens C Evaluation and Tracking for LLM Experiments and AI Agents	39.5	3.4k	Python
11	Agenta C The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one	38.6	4.2k	TypeScript
12	AgentOps C Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame	32.2	5.6k	Python