AI StrategyJun 1, 2026 · 13 min readUpdated Jul 6, 2026

LLM Cost Tracking With Langfuse and Finout: AI Gross Margin

You can watch an LLM feature work and still have no idea what it costs you per customer. LLM Ops is two jobs, not one: Langfuse tells you what every model call did and what it cost, and Finout drops that cost into the same bill as your cloud, allocated per team, product, and customer. Wire them together and an AI feature stops being a mystery line on the OpenAI invoice and starts being a P&L you can defend.

Oshri Cohen

Chief Product & Technology Officer

$/featurePer customer

You shipped an AI feature, and it works. You can watch it answer questions, call tools, draft the email, summarize the ticket. What you cannot do, what almost no team can do on the day they ship, is answer the one question your CFO will eventually ask: what does this thing cost us per customer, and are we still making money on it? Your AWS bill breaks down by service, by team, by environment. Your LLM spend arrives as a single, undifferentiated OpenAI or Anthropic invoice with a big number on it and no way to tell whose feature, whose customer, or whose runaway prompt produced it.

That blind spot is not a billing problem. It's an observability problem wearing a finance costume. You can't allocate a cost you never measured, and you can't measure an LLM call you never traced. So LLM Ops done properly splits into two jobs. The first is visibility: what did every model call do, was it any good, and what did it cost? The second is accountability: whose budget does that cost belong to, which product, which customer, what margin? That's two tools, with Langfuse owning the first and Finout the second. The entire game is the bridge between them, and the metadata discipline that makes the bridge worth building.

What Langfuse actually gives you

Langfuse is an open-source LLM engineering platform, tracing, prompt management, evaluations, and analytics in one place. It is the system of record for everything your AI feature does at runtime. Before you can think about cost allocation, you need to understand the full surface area it captures, because every one of these features is either a measurement you'll bill against or a hook you'll allocate by.

Tracing and observability, the unit of truth

The core primitive is the trace: a hierarchical record of a single request through your application. Inside it sit spans (your own steps, retrieval, business logic) and generations (the actual model calls). Each generation captures the model name, the full input and output, latency, and token usage broken into input and output. This is the difference between "the feature feels slow and expensive" and "this one retrieval step fans out into nine model calls, and call number six is 80% of the latency." You can't optimize, and you certainly can't cost-account, what you can't see at the level of the individual call.

Token and cost tracking, the number that matters

Langfuse attaches a cost in USD to every generation, derived from token counts and a built-in pricing table for the common models, and you can define custom prices for fine-tuned, self-hosted, or newly released models it doesn't know yet. This is the raw material for everything downstream. It means cost stops being a monthly surprise on a provider invoice and becomes a property of each individual call, rollable up by anything you've tagged that call with.

Sessions and users, cost with a name on it

Set a user_id and a session_id on your traces and Langfuse will group cost and behavior by end user and by conversation. Now "what does the AI cost" becomes "what does this customer cost," and "this twelve-turn conversation cost forty cents while the median is two" becomes a thing you can actually find. For any product where one heavy user can quietly eat the margin of fifty light ones, this is not optional.

Metadata and tags, the allocation hooks

This is the feature that turns Langfuse from a debugging tool into a financial instrument, and the one teams under-use the most. Every trace can carry arbitrary tags and a metadata object: which feature produced it, which customer or account, which team owns it, which environment it ran in, which pricing tier the user is on. These fields are the seams along which Finout will later cut the bill. If a cost isn't tagged at the moment it's incurred, no downstream system can honestly allocate it. The breakdown is decided here, at instrumentation time, or it isn't decided at all.

Prompt management, because a prompt change is a cost change

Langfuse manages, versions, and serves your prompts, with labels that let you deploy a new version to production without redeploying your application, and aggressive caching so you pay no latency for the privilege. The reason this belongs in a cost conversation: a single "helpful" prompt edit that adds three few-shot examples can double your input tokens on every call. When prompts are versioned and attached to traces, a cost spike has a culprit, you can point at the exact prompt version that changed the economics, instead of guessing.

Evaluations, the quality side of the ledger

Spending less only counts if the answers stay good, so Langfuse covers the other half: LLM-as-a-judge evaluators, code-based evaluators, user-feedback scores, human annotation and manual labeling, and Datasets you run Experiments against to test changes systematically before they ship. It's the same eval discipline I argue for in testing non-deterministic agents, pointed at the invoice instead of the incident. This matters financially because the optimization question is rarely "how do I spend less". It's "can I move this feature to a cheaper model and hold quality?" You can only answer that when cost and an eval score sit on the same trace.

Dashboards, the Metrics API, and the plumbing

On top of all this, Langfuse exposes dashboards and, critically for the integration, a programmatic API. The Daily Metrics API returns per-day aggregates: date, trace and observation counts, total cost, and usage broken down by model, filterable by trace name, user, and tags. The newer Metrics API v2 lets you define the view, metrics, dimensions, and time window of an arbitrary query. There are first-class Python and JS/TS SDKs, native OpenTelemetry support, and, as of June 2026, drop-in integrations for LangChain, the OpenAI SDK, and LiteLLM. And because it's open source and self-hostable, your prompts and traces, often sensitive, can live inside your own perimeter, which your security and finance people will both thank you for.

Traces, spans, generations: the full hierarchical record of every request, with model, input/output, latency, and input/output tokens per call.
Cost in USD per generation: built-in model pricing plus custom price definitions for the models it doesn't know.
Sessions and users: cost and behavior grouped by conversation and by end user.
Tags and metadata: arbitrary dimensions, feature, customer, team, environment, tier, that become your allocation keys.
Prompt management: versioned, label-deployed prompts so a cost change has a named cause.
Evaluations: LLM-as-judge, code evals, human annotation, datasets and experiments, quality on the same trace as cost.
Metrics API and SDKs: Daily Metrics and Metrics API v2, OpenTelemetry-native, with LangChain / OpenAI SDK / LiteLLM integrations.
Open source and self-hostable: keep traces and prompts inside your own perimeter.

Langfuse turns one opaque model invoice into a per-call, per-feature, per-customer ledger. That ledger is invisible to your CFO until it lands in the same place as the rest of the bill.

Instrument for allocation, not just for debugging

Here is the mistake that quietly kills the whole project. Most teams instrument Langfuse to debug, to chase a latency spike or a bad answer, and they tag just enough to do that. Then finance asks for cost per product line and the traces don't carry product line, so the answer is a six-week backfill that never happens. Instrument as if finance is reading every trace, because eventually they are.

Concretely, that means treating a short, boring set of fields as mandatory on every trace your application emits. Decide the vocabulary once, agree with finance on the dimensions they actually allocate by, and enforce it in a thin wrapper around your LLM client so no engineer can forget. The cost of getting this right is a few lines of plumbing. The cost of getting it wrong is that none of the rest works.

feature, the product surface that made the call ("ticket-summary", "sales-copilot"). Your most important allocation key.
customer_id / account_id, who the cost belongs to, for per-customer margin and chargeback.
team, the internal owner, for showback and budget accountability.
environment, prod vs. staging vs. eval, so you never bill a customer for your own test runs.
tier, the pricing plan, so you can see whether free users are eating the token budget of paying ones.
user_id and session_id, set as first-class fields, not buried in metadata, so per-user and per-conversation rollups work natively.

What Finout does that Langfuse doesn't

Langfuse is exhaustive about your AI and silent about everything else. It has no idea what your EC2, RDS, Kubernetes, or Datadog spend is, and it shouldn't. Finout is the opposite: it's a FinOps platform whose entire job is to pull every cost your company incurs into one unified bill, they call it the MegaBill, and then re-slice it however the business needs. It already ingests cloud and Kubernetes spend, and as of June 2026 it ingests AI provider costs from OpenAI, Anthropic, Bedrock, Vertex, and Azure OpenAI alongside them.

Two Finout capabilities make it the right home for LLM cost. The first is Virtual Tags: a patented allocation layer that lets you assign any cost to a team, product, environment, or customer using whatever metadata is available, retroactively, with no code change. The second is Custom Cost Input: a dead-simple way to push arbitrary cost into the MegaBill via CSV or API, no heavy integration required. The CSV speaks a tiny, fixed schema, a source, a usage date, a service name, a cost in USD, and a free-form metadata object, and that schema is the exact shape of the bridge we're about to build. (CostGuard, Finout's waste-detection layer, then hunts that unified bill for idle and oversized spend, AI included.)

Langfuse measures the AI. Finout makes it a line in the same P&L as your AWS, so "total cost of ownership per feature" stops being a slide and becomes a number.

The bridge: from Langfuse trace to Finout line item

The integration is a small, scheduled job, a cron task or a daily Lambda, that does four things. It is the least glamorous and most valuable two hundred lines of code in the whole stack.

1. Pull yesterday's cost from Langfuse, already grouped

Each morning, the job queries Langfuse's Metrics API v2 (or the Daily Metrics API) for the previous day's spend, grouped by the dimensions you committed to at instrumentation time, feature, customer, environment, model. Langfuse hands back, per group, the total cost in USD and the token usage. Because you tagged properly, you get back a tidy breakdown of exactly the cuts finance cares about rather than a single lump sum.

2. Shape it into Finout's custom-cost schema

Map each row from the Langfuse response onto Finout's Custom Cost columns. The transform is mechanical: source becomes a constant like "langfuse" so the spend is identifiable in the MegaBill; usage_date is the day you queried; service_name is the feature (or the model, if you want a model-level view); cost is the USD total; and the metadata object carries every allocation key you pulled, feature, customer_id, team, environment, tier, model. That metadata is the whole point: it's what Virtual Tags will read.

source → a fixed label (e.g. "langfuse") so AI spend is traceable to its origin in the bill.
usage_date → the date of the metrics window, so it lands on the right day of the MegaBill.
service_name → the feature or model, your primary grouping in Finout's views.
cost → the Langfuse total cost in USD for that group.
metadata → the key-value allocation object: feature, customer_id, team, environment, tier, model.

3. Push it into the MegaBill

Send the rows to Finout through Custom Cost Input, a CSV upload or the API. Make the job idempotent per date: re-running it for a given day should replace, not duplicate, that day's rows, so a retried Lambda never inflates the bill. That's it for the moving parts. From here on, the LLM cost lives in Finout exactly like any cloud cost, with a date, a service, a dollar amount, and a metadata payload waiting to be allocated.

4. Allocate it with Virtual Tags

In Finout, build Virtual Tags that read the metadata you shipped, a "Product" tag keyed off feature, a "Customer" tag off customer_id, a "Team" tag off team. Now your LLM spend slices by team, product, and customer right next to your EC2, RDS, and Kubernetes costs, in the same views, under the same allocation rules. Because Virtual Tags apply retroactively and codelessly, you can re-cut history the day you change your mind about how to group things, without re-instrumenting your code or replaying old traces.

One refinement worth the effort: let Finout also ingest the raw OpenAI and Anthropic provider invoices directly, as it already can. Use those provider totals as the source of truth for how much, and your Langfuse-derived custom costs as the source of truth for the breakdown. Reconcile the two, they should land within a small percentage of each other, and any gap is itself a signal: untraced calls, a missing tag, a rogue script hitting the API outside your instrumented paths.

The provider invoice is the truth for the total. Langfuse is the truth for the breakdown. Finout is where the two finally meet, and where a gap between them becomes a question worth asking.

What the CFO finally sees

Once the bridge is running, the conversation with finance changes from defensive hand-waving to a shared dashboard. The questions that used to end in a shrug now have answers that update every morning.

Gross margin per AI feature, revenue attributed to a feature minus the model cost it actually incurred, not a guess.
Cost per customer, and per tier, including the uncomfortable truth about whether your free tier is subsidized by a handful of heavy users.
Chargeback and showback by team, every team sees its own AI spend in the same place it sees its cloud spend, so budgets mean something.
Unit economics that hold as you scale, cost per thousand requests for a feature, tracked over time, so growth doesn't quietly become a loss.
Attributable cost regressions, when spend spikes, you can trace it to the feature, the customer, and, because prompts are versioned, the exact change that caused it.

The discipline that makes it work

None of this is hard engineering. It's discipline applied in the right four places. Instrument every trace with the allocation metadata finance agreed to, enforced in a wrapper so it can't be skipped. Version your prompts so a cost change always has a named cause. Automate the daily pull-shape-push and keep it idempotent. Build the Virtual Tags once and let them re-cut history forever. Reconcile the provider bill against the Langfuse breakdown so you trust both numbers.

What I keep coming back to is this: an AI feature is a product with unit economics, and the most expensive, most variable line on your cloud bill deserves at least the financial rigor you give the cheapest, most predictable one. Probably more, because it's the one that moves. It's the same principle that keeps token spend from becoming a vanity metric: attribute cost to outcomes, never to raw usage. So treat LLM Ops as the P&L for the part of your product you understand the least and spend the most on, not as a dashboard for engineers to admire. Langfuse gives you the visibility, Finout gives you the accountability, and the bridge between them is engineering in direct service of the business. Build it, and the next time your CFO asks what the AI costs, you don't change the subject. You send a link.

Wiring up this kind of cost accountability is a standard part of the AI strategy work I do with companies. If your model invoice is still one big number, let's talk →