How to Intermediate · 3 min read

How to use open source models to reduce costs

Quick answer
Use open source models like llama.cpp or Stable Diffusion locally or on cost-effective hardware to avoid expensive API calls. Combine them with lightweight quantization and efficient runtimes to reduce compute costs while maintaining performance.

PREREQUISITES

  • Python 3.8+
  • pip install llama-cpp-python stable-diffusion diffusers torch
  • Basic knowledge of Python and AI model inference

Setup open source environment

Install necessary Python packages to run open source models locally. Use llama-cpp-python for LLMs like Llama 3, and diffusers for image generation models like Stable Diffusion. Ensure you have a compatible GPU or CPU setup.

bash
pip install llama-cpp-python diffusers torch
output
Collecting llama-cpp-python...
Collecting diffusers...
Successfully installed llama-cpp-python diffusers torch

Step by step usage example

Run a local LLM inference with llama-cpp-python using a quantized GGUF model to reduce memory and compute costs. This example loads a 4-bit quantized Llama 3 model and generates text from a prompt.

python
from llama_cpp import Llama
import os

model_path = os.path.expanduser('~/.models/llama-3.1-8b.Q4_K_M.gguf')
llm = Llama(model_path=model_path, n_ctx=2048, n_gpu_layers=10)

prompt = "Explain how open source models reduce AI costs."
output = llm.create_chat_completion(messages=[{"role": "user", "content": prompt}], max_tokens=128)
print(output['choices'][0]['message']['content'])
output
Open source models reduce AI costs by enabling local inference without recurring API fees, leveraging efficient quantization to lower hardware requirements, and allowing customization to optimize performance for specific tasks.

Common variations

  • Use vLLM or Ollama for scalable local serving with streaming support.
  • Run image generation models like Stable Diffusion locally with diffusers to avoid cloud costs.
  • Use 4-bit or 8-bit quantization to reduce VRAM and speed up inference.
  • Combine open source models with cloud APIs for hybrid cost optimization.
python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

image = pipe("A futuristic cityscape at sunset").images[0]
image.save("output.png")
print("Image saved as output.png")
output
Image saved as output.png

Troubleshooting tips

  • If you get out-of-memory errors, reduce n_gpu_layers or use smaller quantized models.
  • Ensure your GPU drivers and CUDA toolkit are up to date for best performance.
  • For CPU-only setups, expect slower inference; consider smaller models or cloud bursts.
  • Verify model files are correctly downloaded and compatible with your inference library.

Key Takeaways

  • Run open source models locally to eliminate API usage fees and reduce costs.
  • Use quantized models (4-bit/8-bit) to lower hardware requirements and speed up inference.
  • Combine local inference with cloud APIs for flexible, cost-effective AI solutions.
Verified 2026-04 · llama-3.1-8b, runwayml/stable-diffusion-v1-5
Verify ↗