In the first blog of this series, we reviewed the complexity of developing foundational large language models (LLMs) and the vast resources required that place development outside the scope of most enterprises. Now, we’ll review the options available to all enterprises in order to leverage these foundational LLMs for their own private data centers.
Even a small AI data center of 128 GPUs can cost millions of dollars to deploy, so investing in efficiency is critical to cost containment. The investment approach is driven by what Juniper calls the ABCs of AI data centers: Applications, Build vs. Buy, and Cost. In this blog, we look specifically at how application needs influence the AI consumption model for enterprise AI investment.
Application complexity
When planning your AI investment, it is important to first understand the goals, objectives, and outcomes expected from your AI application. Is your use case more generic, like an AI-powered support assistant, document analyzer, or technical documentation assistant to enhance the customer experience? Or is it a more customized and differentiating AI application that is industry- or organization-specific? Increasing levels of customization add complexity to the already complicated development process, which influences the consumption model for the underlying LLM or application.
McKinsey Consulting identifies three approaches for deploying an AI application: Makers, Takers, and Shapers.
Makers are the big fish in the pond
Makers are the few companies in the world with the financial means and expertise to develop their own foundational LLMs trained on internet data. These are companies such as Google (Gemma, Gemini), Meta (Llama), OpenAI (GPT), Anthropic (Claude), Mistral (Large, Nemo), and Amazon (Titan). Most enterprises where LLM development is not a core competency will follow the path as a Taker or Shaper.
Takers leverage existing AI applications out of the box
Enterprises deploying less complex services like a generic customer chatbot or connecting natural language processing (NLP) to an existing database need not customize an off-the-shelf LLM. Enterprises can “take” an existing AI application that is based on a pre-trained LLM, whether it is licensed or open-sourced, and deploy that model for inferencing.
Today, these applications are table stakes for most enterprises. That means there may be little competitive differentiation, but the reduced complexity streamlines deployment and eliminates unnecessary spend if the application achieves the desired outcomes. From Hugging Face, a repository of AI libraries, enterprises have access to over 400,000 pre-trained LLMs, 150,000 AI applications, and 100,000 datasets, so there are many options for enterprises to deploy AI with speed and efficiency.
Shapers make the LLMs their own
For enterprises needing increased competitive differentiation or customized workflow applications, off-the-shelf LLMs or applications may not suffice. A Shaper takes a pre-trained LLM and “shapes” the model by fine-tuning it using their own proprietary datasets. The result is an LLM that provides very specific and accurate responses to any prompt. Applications that would benefit from this model include, but aren’t limited to:
- Workflow automation where an LLM is refined to do a specific job function to make the job easier or less mundane
- AI assistance to compare internal policy documents, regulatory rules, or legal amendments to identify differences that need special considerations
- Copilots trained on specific operating systems, CLIs, and documentation to simplify code development or document searches
Leveraging RAG inferencing
As covered in the first blog of this series, inferencing systems deliver trained AI applications to end users and devices. Depending on the size of the model, inferencing can be deployed on a single GPU or server or as a multi-node deployment where the application is distributed across multiple servers for increased scale and performance.
A relatively new innovation called Retrieval Augmented Generation (RAG) offers enterprises an interesting technique to customize AI model development and/or deployment. RAG augments a pre-trained LLM by providing supplemental data obtained from an external data source. Using RAG, a vector embedding is obtained from the user query. The closest matches to that vector embedding are used to query data from an external data source. Once obtained, those most relevant chunks of text or data, along with the original prompt, are provided to the LLM for inference. By providing locally sourced data that is relevant to the original prompt, the LLM can provide answers utilizing that additional data along with its own understanding. Without LLM retraining, RAG delivers a specific and accurate response to the customer or device query.
RAG nestles nicely between Taker and Shaper models, giving enterprises the means to use off-the-shelf LLMs without finetuning. However, because there is now an additional data source (often a vector database) that must be accessed alongside the original prompt query, the network connections between the front-end, the external data source, and the LLM must be high performance with very low end-to-end latency.
With the LLM consumption models defined, enterprises must choose the deployment model for their training and inference models. In the next blog in this series, we will review ‘Build vs. Buy’ options, as well as the associated cost considerations for each.