Companies spent the last two years trying to get AI into production. Now a different conversation is starting to happen inside engineering and finance teams.

How much does it actually cost to run AI at scale?

That question gets complicated very quickly.

Training large models still gets most of the attention. But for many enterprises, the bigger operational challenge is ongoing inference, experimentation, GPU utilization and unpredictable consumption patterns. AI workloads behave very differently from traditional cloud workloads. And many FinOps practices were never designed for this kind of infrastructure demand.

This matters because AI usage is growing fast. Goldman Sachs estimates that global AI infrastructure spending could reach between $4 trillion and $8 trillion over the next several years as companies invest in data centers, chips, networking and power infrastructure.

That level of investment changes how enterprises think about cloud economics.

Token costs add up faster than most teams expect

For years, cloud optimization focused heavily on areas like compute sizing, storage efficiency and reserved instance planning. AI introduces a different kind of operational pressure. Token usage can fluctuate heavily. GPU resources are expensive and often underutilized. AI teams experiment constantly. And newer AI systems increasingly rely on continuous inference and orchestration instead of occasional workloads.

The result is a cloud consumption model that becomes difficult to forecast once AI adoption starts spreading across teams.

One area where this becomes obvious is AI token pricing.

Many enterprises still underestimate how dramatically token costs can vary across models. Small differences may look manageable during pilot projects. At production scale, those differences compound quickly. The FinOps Foundation recently published a detailed breakdown of how token pricing actually works across AI systems, including how costs vary based on input tokens, output tokens, context windows and usage patterns. 

This becomes even more important as organizations move beyond simple chatbot deployments.

More AI activity means more infrastructure pressure

AI systems are becoming more operationally complex. Enterprises are now managing retrieval systems, orchestration layers, vector databases, autonomous workflows and multi model environments. McKinsey recently noted that AI infrastructure is becoming a critical business capability that extends far beyond software alone.Agentic AI powered solutions are adding another layer of pressure. These systems perform tasks continuously instead of responding to isolated prompts. That means more inference activity, more API calls and more persistent compute consumption. McKinsey also highlighted how agentic AI systems are increasing orchestration complexity and making infrastructure management more dynamic. This creates a challenge for traditional FinOps models.

Many organizations still approach AI infrastructure with cloud optimization strategies built for predictable workloads. AI workloads are rarely predictable. Usage spikes can happen suddenly. Experimentation expands rapidly across teams. Model selection decisions may be driven more by hype than operational efficiency.

And in many environments, visibility remains limited.

Bigger models are not always the smartest choice

GPU utilization is becoming a major concern. AI infrastructure is expensive enough that idle or poorly utilized resources create significant operational waste. Some enterprises are now reconsidering where AI workloads should run altogether. Recent studies also pointed to growing interest in private AI infrastructure because organizations want better control over governance, cost predictability and resource allocation
Another interesting trend is happening around model size.

For a while, enterprise AI conversations focused heavily on using the largest available models. That thinking is starting to evolve. Smaller language models are becoming increasingly practical for targeted enterprise use cases. In many scenarios, companies are finding that lightweight models provide acceptable performance with significantly lower infrastructure costs and lower latency.

That changes the economics considerably.

Instead of relying on a single large model for every workload, organizations are beginning to think more carefully about workload aware AI model selection. Some tasks may justify premium reasoning models. Others may work perfectly well with smaller and cheaper alternatives.

This is where AI cost optimization becomes more strategic than tactical.

The conversation is no longer limited to reducing cloud bills after deployment. Enterprises are starting to evaluate how AI architecture decisions affect long term operational efficiency. Model routing, inference optimization, caching and workload allocation are becoming important business decisions because infrastructure costs scale very quickly once AI usage expands.

AI spending is finally getting boardroom attention 

Many organizations approved AI experimentation budgets over the last two years without fully understanding what operational scaling would look like. That is beginning to change. Leadership teams now want visibility into AI ROI, infrastructure efficiency and ongoing operating costs.

And they should.

AI infrastructure demand is growing faster than many organizations expected. AI optimized data centers can now cost between $15 million and $20 million per megawatt because of GPU density, cooling requirements and infrastructure complexity. Those economics eventually affect enterprise decision making.

This does not mean organizations should slow down AI adoption. But it does mean AI deployment strategies need more operational discipline than many companies currently have. AI projects that look manageable during experimentation can become very expensive once usage scales across products, employees and customers.

FinOps teams are now being asked to solve problems that barely existed a few years ago. They need visibility into token consumption, inference efficiency, GPU allocation and workload behavior across increasingly distributed AI environments.

That requires a broader view of cloud and AI optimization. The organizations that handle this well will probably be the ones that understand how to balance performance, cost efficiency and operational scale before complexity becomes difficult to control.

The article was originally published in Forbes Technology Council

Speak with our advisors to learn how you can take control of your Cloud Cost