Code Advanced medium · 8 min

Spaces: demo deployment

What you will learn

Deploy a Hugging Face model to a public Spaces instance with a web interface in under 50 lines of code.

Why this matters

Turning a local model into a shareable, production-ready demo is how you get feedback from users, validate model quality on real data, and build credibility without maintaining infrastructure. Spaces handles deployment, scaling, and hosting: you focus on the model logic.

Skip if: Do not use Spaces if you need real-time inference at scale (>1000 concurrent users), custom GPU allocation, or complete control over infrastructure (use AWS SageMaker, GCP Vertex, or Kubernetes instead). Spaces is for demos, internal tools, and proof-of-concepts: not production APIs.

Explanation

Hugging Face Spaces is a free hosting platform for machine learning demos that automatically pulls your code from a Git repo (or lets you write inline) and deploys it with a web interface. Mechanically: you create a requirements.txt and app.py file in a Git repo, push it to huggingface.co/spaces, and Spaces automatically installs dependencies and runs your code with Gradio or Streamlit as the UI layer. The platform handles Docker containerization, GPU allocation (if needed), and HTTPS: you never touch infrastructure. When to use: use Spaces to share models with stakeholders, gather qualitative feedback, demo a new capability, or build internal tools. It's not a replacement for production APIs because it has rate limits and no SLA, but it's perfect for getting models in front of people in minutes.

Analogy

Spaces is like Heroku for machine learning: you push code to a Git repo, declare dependencies, and the platform figures out the rest. Except Spaces is free for public demos and comes with a UI builder built-in.

Code

python

import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

def generate_text(prompt, max_length):
    try:
        result = text_generator(
            prompt,
            max_length=int(max_length),
            num_return_sequences=1,
            do_sample=True,
            temperature=0.7,
            top_p=0.95
        )
        return result[0]["generated_text"]
    except Exception as e:
        return f"Error: {str(e)}"

with gr.Blocks(title="GPT-2 Text Generator") as demo:
    gr.Markdown("# GPT-2 Text Generator")
    gr.Markdown("Enter a prompt and watch GPT-2 generate text.")
    
    with gr.Row():
        with gr.Column():
            prompt_input = gr.Textbox(
                label="Prompt",
                placeholder="Once upon a time",
                lines=3
            )
            max_length_input = gr.Slider(
                label="Max Length",
                minimum=20,
                maximum=200,
                step=10,
                value=100
            )
            submit_btn = gr.Button("Generate")
        
        with gr.Column():
            output = gr.Textbox(label="Generated Text", lines=6)
    
    submit_btn.click(
        fn=generate_text,
        inputs=[prompt_input, max_length_input],
        outputs=output
    )

if __name__ == "__main__":
    demo.launch(share=False)

Output

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://xxxxx.gradio.live

(Gradio interface launches in browser with text input, slider, and button. Clicking "Generate" with prompt "Once upon a time" produces GPT-2 output like: "Once upon a time, there was a young boy who lived in a small village...")

What just happened?

The code loaded GPT-2 model and tokenizer into memory, wrapped it in a text-generation pipeline, created a Gradio interface with input fields (prompt text box, max_length slider, submit button) and output field, and launched a local web server. When you click Generate, the prompt goes through the pipeline, and the generated text appears in the output box. The `.launch()` call starts the server and optionally generates a public shareable link if you pass `share=True`.

Common gotcha

Developers often forget that Spaces has memory and timeout limits: if your model is too large (>10GB for free tier) or inference takes >60 seconds, Spaces will crash the container. Quantize your model with `BitsAndBytesConfig` and profile inference time locally before deploying. Also, `device_map='auto'` works locally but Spaces may have different GPU availability, so test on a CPU-only environment first.

Error recovery

OutOfMemoryError

Your model is too large for the Space's allocated GPU (T4 on free tier, 15GB). Use transformers' quantization: `BitsAndBytesConfig(load_in_8bit=True)` or switch to a smaller model like `distilgpt2`.

RuntimeError: CUDA out of memory

Gradio is running multiple inference calls in parallel. Add `max_queue_size=1` to `gr.Blocks()` to serialize requests, or reduce `max_length` in the pipeline call.

FileNotFoundError: requirements.txt

Spaces cannot find your dependencies. Make sure `requirements.txt` is in the root of the repo and contains `gradio`, `transformers`, `torch`, and `accelerate`. Push the file to Git before deploying.

HTTPError 401 when pushing to Spaces

You need to authenticate with Hugging Face. Run `huggingface-cli login`, paste your token from huggingface.co/settings/tokens, then push to the Spaces repo.

Experienced dev note

The biggest mistake is treating Spaces like a production API. It's not. Rate limits are aggressive (100 requests/hour on free tier), inference has a 60-second timeout, and GPU allocation is not guaranteed. Use Spaces to validate your model idea and gather user feedback, then move to a proper inference service (Replicate, Together.ai, or your own k8s cluster) for production load. Also: Gradio's `share=True` link expires after 72 hours, so don't rely on those for permanent URLs: always use the huggingface.co/spaces URL instead.

Check your understanding

You deploy a model to Spaces and it works fine locally, but on the public Spaces instance, inference suddenly times out for prompts longer than 50 tokens. What's the most likely cause, and how would you fix it?

Show answer hint

The free Spaces tier has a 60-second timeout per inference call. Longer sequences take more time. The fix is either reduce `max_length`, use a quantized model (faster inference), or switch to a GPU-tier Space (paid). Don't just add a longer timeout: that won't help on Spaces.

VERSION In transformers < 4.30, `device_map='auto'` required manual import of `accelerate` and wasn't reliable. In transformers 5.5.x (April 2026), `device_map='auto'` is the standard and works out-of-the-box. Also, Gradio 4.x+ (required for modern Spaces) changed the event API; older `.then()` chaining is replaced with `.click(fn=..., inputs=..., outputs=...)` syntax shown here.

Next, learn how to optimize inference latency on Spaces by quantizing your model with <code>BitsAndBytesConfig</code> and profiling with <code>torch.profiler</code> to handle cold starts and keep response times under 5 seconds.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.