Spaces: demo deployment
Why this matters
Turning a local model into a shareable, production-ready demo is how you get feedback from users, validate model quality on real data, and build credibility without maintaining infrastructure. Spaces handles deployment, scaling, and hosting: you focus on the model logic.
Explanation
Hugging Face Spaces is a free hosting platform for machine learning demos that automatically pulls your code from a Git repo (or lets you write inline) and deploys it with a web interface. Mechanically: you create a requirements.txt and app.py file in a Git repo, push it to huggingface.co/spaces, and Spaces automatically installs dependencies and runs your code with Gradio or Streamlit as the UI layer. The platform handles Docker containerization, GPU allocation (if needed), and HTTPS: you never touch infrastructure. When to use: use Spaces to share models with stakeholders, gather qualitative feedback, demo a new capability, or build internal tools. It's not a replacement for production APIs because it has rate limits and no SLA, but it's perfect for getting models in front of people in minutes.
Analogy
Spaces is like Heroku for machine learning: you push code to a Git repo, declare dependencies, and the platform figures out the rest. Except Spaces is free for public demos and comes with a UI builder built-in.
Code
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16
)
text_generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device=0 if torch.cuda.is_available() else -1
)
def generate_text(prompt, max_length):
try:
result = text_generator(
prompt,
max_length=int(max_length),
num_return_sequences=1,
do_sample=True,
temperature=0.7,
top_p=0.95
)
return result[0]["generated_text"]
except Exception as e:
return f"Error: {str(e)}"
with gr.Blocks(title="GPT-2 Text Generator") as demo:
gr.Markdown("# GPT-2 Text Generator")
gr.Markdown("Enter a prompt and watch GPT-2 generate text.")
with gr.Row():
with gr.Column():
prompt_input = gr.Textbox(
label="Prompt",
placeholder="Once upon a time",
lines=3
)
max_length_input = gr.Slider(
label="Max Length",
minimum=20,
maximum=200,
step=10,
value=100
)
submit_btn = gr.Button("Generate")
with gr.Column():
output = gr.Textbox(label="Generated Text", lines=6)
submit_btn.click(
fn=generate_text,
inputs=[prompt_input, max_length_input],
outputs=output
)
if __name__ == "__main__":
demo.launch(share=False) Running on local URL: http://127.0.0.1:7860 Running on public URL: https://xxxxx.gradio.live (Gradio interface launches in browser with text input, slider, and button. Clicking "Generate" with prompt "Once upon a time" produces GPT-2 output like: "Once upon a time, there was a young boy who lived in a small village...")
What just happened?
The code loaded GPT-2 model and tokenizer into memory, wrapped it in a text-generation pipeline, created a Gradio interface with input fields (prompt text box, max_length slider, submit button) and output field, and launched a local web server. When you click Generate, the prompt goes through the pipeline, and the generated text appears in the output box. The `.launch()` call starts the server and optionally generates a public shareable link if you pass `share=True`.
Common gotcha
Developers often forget that Spaces has memory and timeout limits: if your model is too large (>10GB for free tier) or inference takes >60 seconds, Spaces will crash the container. Quantize your model with `BitsAndBytesConfig` and profile inference time locally before deploying. Also, `device_map='auto'` works locally but Spaces may have different GPU availability, so test on a CPU-only environment first.
Error recovery
OutOfMemoryErrorRuntimeError: CUDA out of memoryFileNotFoundError: requirements.txtHTTPError 401 when pushing to SpacesExperienced dev note
The biggest mistake is treating Spaces like a production API. It's not. Rate limits are aggressive (100 requests/hour on free tier), inference has a 60-second timeout, and GPU allocation is not guaranteed. Use Spaces to validate your model idea and gather user feedback, then move to a proper inference service (Replicate, Together.ai, or your own k8s cluster) for production load. Also: Gradio's `share=True` link expires after 72 hours, so don't rely on those for permanent URLs: always use the huggingface.co/spaces URL instead.
Check your understanding
You deploy a model to Spaces and it works fine locally, but on the public Spaces instance, inference suddenly times out for prompts longer than 50 tokens. What's the most likely cause, and how would you fix it?
Show answer hint
The free Spaces tier has a 60-second timeout per inference call. Longer sequences take more time. The fix is either reduce `max_length`, use a quantized model (faster inference), or switch to a GPU-tier Space (paid). Don't just add a longer timeout: that won't help on Spaces.