Hey HN! Creator here. I recently found myself writing evaluations for actual production LLM projects, and kept facing the same dilemma: either reinvent the wheel or use a heavyweight commercial system with tons of features I don't need right now.
Then it hit me - evaluations are just (kind of) tests, so why not write them as such using pytest?
That's why I created pytest-evals - a lightweight pytest plugin for building evaluations. It's intentionally not a sophisticated system with dashboards (and not suitable as a "robust" solution). It's minimalistic, focused, and definitely not trying to be a startup
# Predict the LLM performance for each case
@pytest.mark.eval(name="my_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag, classifier):
# Run predictions and store results
eval_bag.prediction = classifier(case["Input Text"])
eval_bag.expected = case["Expected Classification"]
eval_bag.accuracy = eval_bag.prediction == eval_bag.expected
# Now let's see how our app performing across all cases...
@pytest.mark.eval_analysis(name="my_classifier")
def test_analysis(eval_results):
accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)
print(f"Accuracy: {accuracy:.2%}")
assert accuracy >= 0.7 # Ensure our performance is not degrading
Would love to hear your thoughts and if you find this useful, a GitHub star would be appreciated
Originally I had a (very large and unfriendly) conftest file, but it was quite challenging to collaborate with other team members and was quite repetitive. So I wrapped it as a plugin, added some more functionalities and thats it.
This plugin wraps some boilerplate code in a way that is easy to use specially for the eval use-case. It’s minimalistic by design. Nothing big or fancy
Hey HN! Creator here. I recently found myself writing evaluations for actual production LLM projects, and kept facing the same dilemma: either reinvent the wheel or use a heavyweight commercial system with tons of features I don't need right now.
Then it hit me - evaluations are just (kind of) tests, so why not write them as such using pytest?
That's why I created pytest-evals - a lightweight pytest plugin for building evaluations. It's intentionally not a sophisticated system with dashboards (and not suitable as a "robust" solution). It's minimalistic, focused, and definitely not trying to be a startup
Would love to hear your thoughts and if you find this useful, a GitHub star would be appreciatedThe pytest-evals README mentions that it's built on pytest-harvest, which works with pytest-xdist and pytest-asyncio.
pytest-harvest: https://smarie.github.io/python-pytest-harvest/ :
> Store data created during your pytest tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes
Yeah, pytest-harvest is a pretty cool plugin.
Originally I had a (very large and unfriendly) conftest file, but it was quite challenging to collaborate with other team members and was quite repetitive. So I wrapped it as a plugin, added some more functionalities and thats it.
This plugin wraps some boilerplate code in a way that is easy to use specially for the eval use-case. It’s minimalistic by design. Nothing big or fancy