Caliper is designed to auto instrument LLM calls within Python, it monkey patches the OpenAI and Anthropic SDKs (Got plans to add LiteLLM so you can use any provider you want to) so it's almost completely invisible to you as the developer and for basic metrics can slot in as a single init() at start.
It can also gather custom metadata about a call, this can be any KV pairs you want, both pre and post request.
```python
import caliper
import anthropic
caliper.init(target="s3") # This is all that's required for basic observability, no changes needed to LLM calls for basic metrics
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "What is 2 + 2?"}],
caliper_metadata={"campaign": "q4"}, # Pre request metadata
)print(response.content[0].text)
caliper.annotate(sentiment="positive") # Post request metadata
```
You can use this to track effectiveness of model changes, tracking them against difference user tiers. Maybe your free tier users don't notice if you use a cheaper model but you paying users do? How do you know if a recent system prompt change was effective? You can track the version of the prompt in metadata and compare post request rating annotations between prompt versions.
It has a dev mode which logs locally, it can also send files to S3. The SDK has a background queue and worker which flushes in batches that are configurable in size and time between flushes. It exports to S3 as batched JSON files to readily to integrate into most data engineering pipelines or you can just query directly with a tool like DuckDB.
PyPi: https://pypi.org/project/caliper-sdk/
Edits: formatting and PyPi Link
0 comments