Exploring Self-Hosted LLM Solutions for Solopreneurs and Small Tech Businesses

Introduction

In today's dynamic AI landscape, Large Language Models (LLMs) stand as critical tools for innovation in businesses, especially for solopreneurs and small tech enterprises. While cloud-based LLMs have been the norm, like GPT-4 that powers OpenAI's ChatGPT, the shift towards self-hosting open sourced models is gaining momentum. Self-hosting not only ensures greater control over data privacy and security but also significantly reduces the risk of vendor lock-in. This approach empowers businesses with the flexibility to adapt and evolve in the fast-paced tech world.

Key Takeaways

Understanding the importance and benefits of self-hosted LLMs.
Introduce popular self-hosted LLM solutions like SkyPilot, Text-Generation-WebUI, Litellm, Huggingface TGI, and OpenLLM.
Additional considerations for choosing and implementing a self-hosted LLM solution.

Understanding Self-Hosted LLMs: An Overview

LLMs, integral to applications from chatbots to content creation, traditionally relied on cloud-based platforms. Self-hosting offers businesses complete control over their data, ensuring enhanced privacy and security, a necessity in sectors handling sensitive information.

Advantages of Self-Hosting LLMs

Self-hosted LLMs offer a multitude of benefits, paramount among them being:

Security and Privacy: Full control over sensitive data, mitigating risks associated with cloud-based services.
Customization and Scalability: Tailored solutions that can be scaled as required.
Reducing Risks with No Vendor Lock-in: One of the crucial advantages of self-hosting is the elimination of vendor lock-in. This approach allows businesses to maintain flexibility and independence, ensuring they are not overly reliant on a single provider's ecosystem. This significantly reduces risk, providing businesses with the agility to adapt to new technologies and vendors as they evolve.

Exploring Popular Self-Hosted LLM Solutions

Several solutions stand out in the realm of self-hosted LLMs:

1. SkyPilot

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.

SkyPilot abstracts away cloud infra burdens:

Launch jobs & clusters on any cloud
Easy scale-out: queue and run many jobs, automatically managed
Easy access to object stores (S3, GCS, R2)

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

SkyPilot Deployment

Create a directory from anywhere on your machine:

$ mkdir hello-sky
$ cd hello-sky

Copy the following YAML into a hello_sky.yaml file:

resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  cloud: aws
  # 1x NVIDIA V100 GPU
  accelerators: V100:1

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
  echo "Running setup."

# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
  echo "Hello, SkyPilot!"
  conda env list

To launch a cluster and run a task, use sky launch:

$ sky launch -c mycluster hello_sky.yaml

You can find out more at SkyPilot Github

2. Text-Generation-WebUI

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

Features:

3 interface modes: default (two columns), notebook, and chat
Multiple model backends: transformers, llama.cpp (through llama-cpp-python), ExLlama, ExLlamaV2, AutoGPTQ, GPTQ-for-LLaMa, CTransformers, AutoAWQ
Dropdown menu for quickly switching between different models
LoRA: load and unload LoRAs on the fly, train a new LoRA using QLoRA
Precise instruction templates for chat mode, including Llama-2-chat, Alpaca, Vicuna, WizardLM, StableLM, and many others
4-bit, 8-bit, and CPU inference through the transformers library
Use llama.cpp models with transformers samplers (llamacpp_HF loader)
Multimodal pipelines, including LLaVA and MiniGPT-4
Extensions framework
Custom chat characters
Markdown output with LaTeX rendering, to use for instance with GALACTICA
OpenAI-compatible API server with Chat and Completions endpoints

Text-Generation-WebUI Deployment

Clone or download the repository.
Run the start_linux.sh, start_windows.bat, start_macos.sh, or start_wsl.bat script depending on your OS.

You can find out more at Text-Generation-WebUI Github

3. Litellm

Simplifies LLM API calls with an easy installation process and supports major LLM providers.

LiteLLM manages:

Translating inputs to the provider's completion and embedding endpoints
Guarantees consistent output, text responses will always be available at ['choices'][0]['message']['content']
Exception mapping - common exceptions across providers are mapped to the OpenAI exception types.
Load-balance across multiple deployments (e.g. Azure/OpenAI)

Litellm Deployment

Start litellm in 1 line with their provided CLI program litellm

$ litellm --model huggingface/bigcode/starcoder

You can find out more at LiteLLM Github

4. Huggingface TGI

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Provides an official Docker container for quick setup and supports private or gated models.

Huggingface TGI Deployment

$ model=HuggingFaceH4/zephyr-7b-beta
$ volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
$ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.1 --model-id $model

You can find out more at Huggingface TGI Github

5. OpenLLM

OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.

Key features include:

🚂 State-of-the-art LLMs: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.
🔥 Flexible APIs: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.
⛓️ Freedom to build: First-class support for LangChain, BentoML, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
🎯 Streamline deployment: Automatically generate your LLM server Docker images or deploy as serverless endpoints via ☁️ BentoCloud, which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.
🤖️ Bring your own LLM: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (LLM.tuning()) is coming soon.
⚡ Quantization: Run inference with less computational and memory costs with quantization techniques such as LLM.int8, SpQR (int4), AWQ, GPTQ, and SqueezeLLM.
📡 Streaming: Support token streaming through server-sent events (SSE). You can use the /v1/generate_stream endpoint for streaming responses from LLMs.
🔄 Continuous batching: Support continuous batching via vLLM for increased total throughput.

OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs.

OpenLLM Deployment

$ docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm

You can find out more at OpenLLM Github

Additional Considerations for Choosing the Right Solution

Selecting the ideal self-hosted LLM stack requires balancing specific requirements against scalability and flexibility. Operational monitoring and telemetry are also vital, especially for businesses scaling their applications.

Conclusion

Self-hosted LLM solutions offer enhanced data security and customized scalability for solopreneurs and small tech businesses. Exploring these solutions could be a game-changer for business operations and strategic goals.

I'll be creating a series of articles, deep diving into each of these solutions. Let's continue to explore together how we can reduce the third party risk in our venture!