Fragmented tooling
TodayQuality, safety, and ethics live in separate libraries with incompatible conventions.
A single, extensible library that houses all three concerns under one scientific contract.
Open-source metrics that are paper-backed, reproducible, and language-agnostic. Evaluate, protect, and improve AI behaviour with open, auditable tools.
“An evaluation metric is only as useful as the evidence behind it.”
The metric's definition, assumptions, and validation basis are directly connected to published research.
Every score can be independently verified, reproduced, and confidently cited in your own work.
See the methodological strength behind every number instead of trusting a polished dashboard.
Many AI teams work with dashboards full of scores (faithfulness 0.87, toxicity 0.03, bias 0.12) without enough context to understand what those numbers truly represent or how much confidence they deserve.
Gaussia was built to close that gap. By requiring every metric to be grounded in verifiable scientific evidence, it turns evaluation outputs into claims that are clearer, more reproducible, and more trustworthy.
The metric's definition, assumptions, and validation basis are directly connected to published research.
Every score can be independently verified, reproduced, and confidently cited in your own work.
See the methodological strength behind every number instead of trusting a polished dashboard.
The current evaluation ecosystem has four problems no single tool solves on its own, and all four point back to the same two pillars: paper-first and community-first.
Quality, safety, and ethics live in separate libraries with incompatible conventions.
A single, extensible library that houses all three concerns under one scientific contract.
Scores are just numbers; the originating paper, methodology, and validation data are rarely exposed.
Every metric ships with the paper title, authors, year, DOI/arXiv link, and a ready-to-cite BibTeX entry.
Metric logic is tied to a specific runtime or stack, making it hard to apply consistently across environments.
The metric definition lives in the scientific source; implementations follow the same spec across languages.
Evaluation assumes the target is an “AI system” rather than the observable behaviour.
Gaussia's modules evaluate any behaviour — model output, human response, or hybrid interaction — by the same criteria.
Every metric in Gaussia is provably linked to a peer-reviewed source, and the link is immutable and visible to every user.
Anyone opens a Discussion in the Proposals category, cites the peer-reviewed work, and explains the problem the metric solves.
Reviewers examine the proposal publicly — the debate is visible, dissent is recorded, decisions are traceable.
Any language can implement from the paper. The code must include a metadata block mapping the implementation to the methodology.
After two reviewer approvals the PR is merged. The paper joins the official framework and authors are credited in the code and citation list.
All steps are public, version-controlled, and require no vendor approval. Self-hosted usage with zero outbound telemetry.
The methodology always precedes the code. A paper becomes a conversation, a conversation becomes a formal RFC, and only then does it become an implementation — every step in the open, every step traceable.
Every metric page lists the paper reference, the validation dataset, and a one-line SDK call. Pick your module, the lineage ships with every score.
How good is the output?
Is it safe and fair?
How can I make it better?
All SDKs are MIT-licensed, install locally, and run without any outbound telemetry.
All contributions stay under the MIT licence. Reviewers receive permanent citation credit. The debate is public, and the record is immutable.
Open a Discussion in Proposals. Include the paper citation, problem statement, and target SDKs.
Fork the repo, copy template/ to papers/YYYY-MM-your-title/, write in LaTeX, add figures and references.bib.
Target main, fill the PR template, link the original discussion, and ensure the paper compiles.
At least two reviewers approve on novelty, soundness, clarity, and feasibility.
An implementation issue opens in the relevant SDK repo; authors are credited in code and citation list.