- Published on
Exploring Self-Hosted LLM Solutions for Solopreneurs and Small Tech Businesses
- Authors
- Name
- Ivan Foong
- @vonze21
Exploring Self-Hosted LLM Solutions for Solopreneurs and Small Tech Businesses
Introduction
In today's dynamic AI landscape, Large Language Models (LLMs) stand as critical tools for innovation in businesses, especially for solopreneurs and small tech enterprises. While cloud-based LLMs have been the norm, like GPT-4 that powers OpenAI's ChatGPT, the shift towards self-hosting open sourced models is gaining momentum. Self-hosting not only ensures greater control over data privacy and security but also significantly reduces the risk of vendor lock-in. This approach empowers businesses with the flexibility to adapt and evolve in the fast-paced tech world.
Key Takeaways
- Understanding the importance and benefits of self-hosted LLMs.
- Introduce popular self-hosted LLM solutions like SkyPilot, Text-Generation-WebUI, Litellm, Huggingface TGI, and OpenLLM.
- Additional considerations for choosing and implementing a self-hosted LLM solution.
Understanding Self-Hosted LLMs: An Overview
LLMs, integral to applications from chatbots to content creation, traditionally relied on cloud-based platforms. Self-hosting offers businesses complete control over their data, ensuring enhanced privacy and security, a necessity in sectors handling sensitive information.
Advantages of Self-Hosting LLMs
Self-hosted LLMs offer a multitude of benefits, paramount among them being:
- Security and Privacy: Full control over sensitive data, mitigating risks associated with cloud-based services.
- Customization and Scalability: Tailored solutions that can be scaled as required.
- Reducing Risks with No Vendor Lock-in: One of the crucial advantages of self-hosting is the elimination of vendor lock-in. This approach allows businesses to maintain flexibility and independence, ensuring they are not overly reliant on a single provider's ecosystem. This significantly reduces risk, providing businesses with the agility to adapt to new technologies and vendors as they evolve.
Exploring Popular Self-Hosted LLM Solutions
Several solutions stand out in the realm of self-hosted LLMs:
1. SkyPilot
SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
SkyPilot abstracts away cloud infra burdens:
- Launch jobs & clusters on any cloud
- Easy scale-out: queue and run many jobs, automatically managed
- Easy access to object stores (S3, GCS, R2)
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
SkyPilot Deployment
- Create a directory from anywhere on your machine:
$ mkdir hello-sky
$ cd hello-sky
- Copy the following YAML into a
hello_sky.yaml
file:
resources:
# Optional; if left out, automatically pick the cheapest cloud.
cloud: aws
# 1x NVIDIA V100 GPU
accelerators: V100:1
# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .
# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "Running setup."
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "Hello, SkyPilot!"
conda env list
- To launch a cluster and run a task, use
sky launch
:
$ sky launch -c mycluster hello_sky.yaml
You can find out more at SkyPilot Github
2. Text-Generation-WebUI
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
Features:
- 3 interface modes: default (two columns), notebook, and chat
- Multiple model backends: transformers, llama.cpp (through llama-cpp-python), ExLlama, ExLlamaV2, AutoGPTQ, GPTQ-for-LLaMa, CTransformers, AutoAWQ
- Dropdown menu for quickly switching between different models
- LoRA: load and unload LoRAs on the fly, train a new LoRA using QLoRA
- Precise instruction templates for chat mode, including Llama-2-chat, Alpaca, Vicuna, WizardLM, StableLM, and many others
- 4-bit, 8-bit, and CPU inference through the transformers library
- Use llama.cpp models with transformers samplers (llamacpp_HF loader)
- Multimodal pipelines, including LLaVA and MiniGPT-4
- Extensions framework
- Custom chat characters
- Markdown output with LaTeX rendering, to use for instance with GALACTICA
- OpenAI-compatible API server with Chat and Completions endpoints
Text-Generation-WebUI Deployment
- Clone or download the repository.
- Run the
start_linux.sh
,start_windows.bat
,start_macos.sh
, orstart_wsl.bat
script depending on your OS.
You can find out more at Text-Generation-WebUI Github
3. Litellm
Simplifies LLM API calls with an easy installation process and supports major LLM providers.
LiteLLM manages:
- Translating inputs to the provider's completion and embedding endpoints
- Guarantees consistent output, text responses will always be available at
['choices'][0]['message']['content']
- Exception mapping - common exceptions across providers are mapped to the OpenAI exception types.
- Load-balance across multiple deployments (e.g. Azure/OpenAI)
Litellm Deployment
- Start litellm in 1 line with their provided CLI program
litellm
$ litellm --model huggingface/bigcode/starcoder
You can find out more at LiteLLM Github
4. Huggingface TGI
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
- Quantization with bitsandbytes and GPT-Q
- Safetensors weight loading
- Watermarking with A Watermark for Large Language Models
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
- Stop sequences
- Log probabilities
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
Provides an official Docker container for quick setup and supports private or gated models.
Huggingface TGI Deployment
$ model=HuggingFaceH4/zephyr-7b-beta
$ volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
$ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.1 --model-id $model
You can find out more at Huggingface TGI Github
5. OpenLLM
OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.
Key features include:
- 🚂 State-of-the-art LLMs: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.
- 🔥 Flexible APIs: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.
- ⛓️ Freedom to build: First-class support for LangChain, BentoML, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
- 🎯 Streamline deployment: Automatically generate your LLM server Docker images or deploy as serverless endpoints via ☁️ BentoCloud, which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.
- 🤖️ Bring your own LLM: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (
LLM.tuning()
) is coming soon. - ⚡ Quantization: Run inference with less computational and memory costs with quantization techniques such as LLM.int8, SpQR (int4), AWQ, GPTQ, and SqueezeLLM.
- 📡 Streaming: Support token streaming through server-sent events (SSE). You can use the
/v1/generate_stream
endpoint for streaming responses from LLMs. - 🔄 Continuous batching: Support continuous batching via vLLM for increased total throughput.
OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs.
OpenLLM Deployment
$ docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
You can find out more at OpenLLM Github
Additional Considerations for Choosing the Right Solution
Selecting the ideal self-hosted LLM stack requires balancing specific requirements against scalability and flexibility. Operational monitoring and telemetry are also vital, especially for businesses scaling their applications.
Conclusion
Self-hosted LLM solutions offer enhanced data security and customized scalability for solopreneurs and small tech businesses. Exploring these solutions could be a game-changer for business operations and strategic goals.
I'll be creating a series of articles, deep diving into each of these solutions. Let's continue to explore together how we can reduce the third party risk in our venture!