I went to Ray Summit a couple weeks ago to learn more about Ray and all the interesting use cases it enables / supports. Adding to the fervor in the generative AI space, the theme was “The LLM and Generative AI Conference for Developers” (but obviously still focused on Ray).
Years ago, back when Nvidia V100 was the fastest GPU, I worked on AI training systems + distributed training at Determined AI, a deep learning training platform which was later acquired by HPE. Currently, at Tecton, I work on the Feature Engineering team, where we help our users build the real-time context they feed into their models, whether they’re classical machine learning models, deep learning models, or new-school generative AI models.
In this post, I put together a list of interesting announcements, learnings, and my insights from the conference.
Anyscale Endpoints makes open source LLMs more accessible
Anyscale is leaning more directly into the LLM space by providing “Anyscale Endpoints” for serving open source LLM models. As we saw with OpenAI, powerful models behind an API makes this technology significantly more accessible. I’m excited about this because being able to have that equivalent pattern with open source models is going to make the choice to build with open source models much more practical.
My bet: Open source models + fine tuning + inference behind an API will be a strong alternative to closed-source models, particularly for those interested in data governance and security or those who want more control/ownership over their models.
Ray for serving models is gaining steam
Although using Ray for AI training was nearly universal for those at Ray Summit, Ray Serve usage wasn’t nearly as common. Part of this is that serving live production traffic is super challenging, and introducing new tooling is much riskier than introducing new training tools or batch inference. While less universal, there were some really interesting talks on how companies are leveraging Ray Serve for their next generation of serving platforms.
For example, LinkedIn built a cool tool on top of Ray Serve to power a new system called Proxima Inference Graph. And DoorDash built their new model serving platform on top of Ray Serve as well to enable more flexible model serving, and it powers new LLM-based models.
My bet: The flexibility of Ray Serve will enable it to continue gaining popularity for providing the cutting edge of model serving.
Has generative AI changed predictive ML?
Generative AI + LLMs have not really changed how companies like DoorDash and Uber are approaching their predictive ML use cases like fraud detection, ETA prediction, and recommendations, which are established use cases core to their business. However, they are focusing on leveraging LLMs for a new class of generative AI use cases like a support chat bot.
My bet: The tried-and-true infrastructure patterns that enable predictive ML will continue powering those use cases, even as ML model patterns evolve.
Generative AI is expensive
Cost optimization is important to many enterprises. And while it’s quick to get started with using generative AI, actual meaningful generative AI use cases are not easy to implement in a cost-effective way. The gap in expertise from building a use case with RAG + LLM API and building a domain-specific model is vast. You also start to have a lot of different infrastructure problems specific to where we are today: GPU shortage + tooling in its infancy.
My bet: Over time, we will see startups and enterprises turn to cheaper models as some of the novelty wears off and the bills start coming in. No, you don’t need GPT-4 to answer “What is the meaning of life?” in your docs chat bot!
Retrieval augmented generation (RAG) vs. fine tuning
As some say, “RAG for facts, fine-tuning for form.” However, the “right” balance of prompting vs. fine-tuning vs. RAG is going to be use-case specific (and a bit of an open question). When fine-tuning, dataset quality + training/serving consistent formatting are both crucial to get right. (From the session, “Lessons From Fine-Tuning Llama-2.”)
My bet: The best implementations are going to use a combination of all three: fine-tuning for form/domain specificity, prompting for tone/focus, and RAG for facts/real-time data. Tools and infrastructure for evaluating LLM applications + managing the data/context that serve as inputs to these models will become increasingly important.
Task-specific LLMs vs. general LLMs?
Your scale is going to determine if smaller task-specific LLMs are more cost effective than larger, more general ones. You’ll have to trade off your training vs. serving costs based on your use case.
Your latency/deployment requirements are going to determine if you need to use a smaller task-specific LLM. You can’t put a 70B parameter model on a phone, and no one wants to sit around for seconds waiting for their generation to happen.
My bet: Just like we have numerous examples today of how to implement RAG with different vector stores and libraries, I would expect the patterns for modeling your data and fine-tuning a model to emerge so that it becomes just as easy. Smaller, task-specific LLMs will become a common practice.
Interested in reading more? Check out this post learn how you can use both generative AI and LLMs to increase customer satisfaction or this post to learn more about the concerns surrounding generative AI and how the industry is addressing them.