Cheat Sheet intermediate · 8 min read

Grok Cheat Sheet — xAI's Real-Time Reasoning Model

version Grok-3

xAI's real-time reasoning model with live web access

XAI_API_KEY

Mental model

Reasoning model with web access; trade speed for intelligence.

It's like asking a researcher who can browse Wikipedia live, vs. asking an encyclopedia reader. The researcher takes longer but gives you today's facts and can explain the logic.

Key Concepts

Chain-of-Thought (CoT)

Grok exposes its reasoning steps before answering, letting you audit logic and catch errors.

Real-Time Web Access

Grok fetches live web data for current events, stock prices, sports scores, and recent news within seconds of answering.

Reasoning vs. Inference Trade-off

Slower response time (3–8s vs. 0.5s) but superior problem-solving on complex math, coding, and multi-step logic.

Token Efficiency

Grok uses ~40% fewer tokens than o1 for equivalent reasoning tasks, reducing per-request cost significantly.

Jailbreak Resistance

Training on adversarial examples makes Grok difficult to manipulate into off-policy behavior; safety measures are baked in.

Xai Grok Comparison

Model	Speed	Reasoning	Web Access	Cost/1M tokens	Best For
Grok-3	3–8s avg	Deep (o1-like)	✓ Real-time	$5–8	Complex reasoning + current data
o1-pro	2–5s avg	Deep	✗ Static cutoff	$20	Pure reasoning without web
gpt-4o	0.5–1s avg	Moderate	✗ Static cutoff	$2.50	Speed-critical + high volume
Claude 3.5 Sonnet	1–2s avg	High	✗ Static cutoff	$3	Coding + general intelligence

Xai Grok Patterns

01 Basic Grok Query with Reasoning

Real-time answer needed; reasoning depth required

python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url="https://api.x.ai/v1"
)

response = client.chat.completions.create(
    model="grok-3",
    messages=[
        {
            "role": "user",
            "content": "What's the current Bitcoin price and why did it move today?"
        }
    ],
    temperature=1,  # Grok requires T=1 for reasoning
    max_tokens=8192
)

print(response.choices[0].message.content)

output Grok fetches live BTC data, explains market drivers, cites recent news.

temperature=1 is mandatory for Grok reasoning. Setting T<1 disables reasoning chain and returns standard inference only.

02 Query Recent News with Context

Need current events; static knowledge insufficient

python

response = client.chat.completions.create(
    model="grok-3",
    messages=[
        {
            "role": "user",
            "content": "Summarize the latest developments in quantum computing (last 48 hours). Show sources."
        }
    ],
    temperature=1,
    max_tokens=6000
)

for chunk in response.choices[0].message.content.split("Source:"):
    print(chunk.strip())

output Grok returns latest news with timestamps and source links.

Web latency adds 2–4s per request. Batch web-heavy queries; cache results for 24h if possible.

03 Complex Math or Code with Step-by-Step

Multi-step logic; need to verify correctness

python

response = client.chat.completions.create(
    model="grok-3",
    messages=[
        {
            "role": "user",
            "content": "Solve this system: 3x + 2y = 7, 5x - y = 4. Show work.\n\nThen write a Python function to solve any 2x2 system generically."
        }
    ],
    temperature=1,
    max_tokens=4000
)

print(response.choices[0].message.content)

output Grok shows algebraic steps, then provides tested code.

Reasoning mode is slower on trivial problems (adds 1–2s overhead). Use gpt-4o for simple queries; reserve Grok for genuinely hard problems.

04 Multi-Turn Conversation

Iterative refinement; building on prior answers

python

messages = [
    {"role": "user", "content": "Explain PageRank algorithm."}
]
response = client.chat.completions.create(
    model="grok-3",
    messages=messages,
    temperature=1,
    max_tokens=3000
)

messages.append({
    "role": "assistant",
    "content": response.choices[0].message.content
})
messages.append({
    "role": "user",
    "content": "Now explain how Google uses it today in 2026. Still relevant?"
})

response = client.chat.completions.create(
    model="grok-3",
    messages=messages,
    temperature=1,
    max_tokens=2000
)

print(response.choices[0].message.content)

output Grok references current state of SEO and web ranking.

Conversation history is NOT cached. Each turn re-fetches web data for context. Cost scales with conversation length.

Key Request Parameters

Grok API

Parameter	Default	Type	Notes
`model`	grok-3	string	Only grok-3 supported; grok-2 deprecated Feb 2026
`temperature`	1.0	float	MUST be 1.0 for reasoning. No deviation allowed.
`max_tokens`	8192	int	Grok reasoning outputs avg 2000–4000; max 16384
`top_p`	0.95	float	Typical range 0.9–1.0; ignored if T≠1
`web_search`	true	bool	Enable real-time web access; disable to use cached knowledge only
`timeout`	60s	int	Grok reasoning can take 8–10s; set timeout ≥15s

Common Errors & Fixes

01 temperature must be 1.0 for reasoning mode

Cause: Grok enforces T=1 to ensure chain-of-thought consistency. Lowering T collapses reasoning to standard inference.

Fix:

python

Always set temperature=1 explicitly:

response = client.chat.completions.create(
    model="grok-3",
    messages=messages,
    temperature=1.0,  # Required
    max_tokens=8000
)

02 API timeout after 60 seconds

Cause: Grok reasoning + web fetch can exceed default timeout. Requests library defaults to 60s.

Fix:

python

Increase timeout and add retry logic:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=2, min=2))
def query_grok(prompt):
    return client.chat.completions.create(
        model="grok-3",
        messages=[{"role": "user", "content": prompt}],
        temperature=1,
        max_tokens=8000,
        timeout=90
    )

response = query_grok("Complex reasoning task...")

03 Rate limit: 60 req/min (free tier), 500 req/min (pro)

Cause: Web fetching is expensive; Grok has strict rate limits to manage infrastructure cost.

Fix:

python

Implement exponential backoff and request queuing:

import time
from collections import deque

class GrokQueue:
    def __init__(self, max_per_min=60):
        self.queue = deque()
        self.max_per_min = max_per_min
        self.window_start = time.time()
    
    def submit(self, prompt):
        now = time.time()
        if now - self.window_start > 60:
            self.queue.clear()
            self.window_start = now
        
        if len(self.queue) >= self.max_per_min:
            wait_time = 60 - (now - self.window_start)
            print(f"Rate limited. Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
            self.window_start = time.time()
        
        return client.chat.completions.create(
            model="grok-3",
            messages=[{"role": "user", "content": prompt}],
            temperature=1
        )

04 web_search=true returns stale data (>24h old)

Cause: Web index lag; Grok caches crawled pages for performance. Real-time means 'recent' not 'live-streamed'.

Fix:

python

For breaking news, specify recency requirement in prompt:

response = client.chat.completions.create(
    model="grok-3",
    messages=[{
        "role": "user",
        "content": "Breaking: Did [EVENT] happen in the last 2 hours? Search NOW. Confirm timestamp."
    }],
    temperature=1,
    web_search=True
)

Production Gotchas

⚠ Reasoning overhead on simple queries

Grok adds 2–3s latency even for trivial questions ("What year was 1984 published?"). For simple lookups, use gpt-4o (0.5s). Reserve Grok for genuinely complex reasoning.

⚠ temperature=1 is non-negotiable

Any T<1 disables reasoning chain entirely. The model will still work but behaves like gpt-4o without the web access. This is a silent failure: no error thrown.

⚠ Web data is not sources are optional

Grok cites sources for web-fetched facts but doesn't always include them in structured format. Parse markdown links from response text; don't rely on a 'sources' field.

⚠ Reasoning content is not always shown

By default, only the final answer is returned. Use 'show_reasoning' parameter to expose chain-of-thought (adds 500–1000 tokens to output).

⚠ Jailbreaking attempts are logged

Grok flags adversarial prompts and reports them. Using Grok for red-teaming requires explicit opt-in. Heavy jailbreak attempts trigger account review.

⚠ Cost scales with reasoning depth

Complex queries output 4000+ tokens at $5–8/M tokens. A single 'prove P=NP' query can cost $0.02. Budget for higher per-request costs than gpt-4o.

Complete production example: Real-time market analysis with caching and error handling

python

import os
import json
import time
from datetime import datetime, timedelta
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential

# Initialize
client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url="https://api.x.ai/v1"
)

# Simple in-memory cache
query_cache = {}
CACHE_TTL = 3600  # 1 hour

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=2, min=4))
def query_grok_with_cache(prompt: str, use_web: bool = True) -> str:
    """Query Grok with caching and retry logic."""
    
    cache_key = hash((prompt, use_web))
    
    # Check cache
    if cache_key in query_cache:
        cached_time, cached_response = query_cache[cache_key]
        if datetime.now() - cached_time < timedelta(seconds=CACHE_TTL):
            print(f"[CACHE HIT] {prompt[:50]}...")
            return cached_response
    
    print(f"[QUERY] {prompt[:50]}... (timeout: 90s)")
    
    try:
        response = client.chat.completions.create(
            model="grok-3",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,  # REQUIRED for reasoning
            max_tokens=6000,
            timeout=90
        )
        
        content = response.choices[0].message.content
        
        # Cache result
        query_cache[cache_key] = (datetime.now(), content)
        
        return content
    
    except Exception as e:
        if "rate_limit" in str(e).lower():
            print("[RATE_LIMITED] Backing off...")
            raise
        raise

# Example: Multi-turn market analysis
if __name__ == "__main__":
    print("=== Grok Real-Time Market Analysis ===")
    
    # Turn 1: Get current data
    analysis = query_grok_with_cache(
        "What's the current state of AI chip stocks (NVIDIA, Intel, AMD)? "
        "Include today's price movement and market sentiment."
    )
    print(f"\n[ANALYSIS]\n{analysis}\n")
    
    # Turn 2: Follow-up with latest trends
    trends = query_grok_with_cache(
        "Based on recent AI chip developments, which sector is most likely to grow 50% in 2026?"
    )
    print(f"[TRENDS]\n{trends}\n")
    
    # Turn 3: Cached (returns instantly)
    cached_again = query_grok_with_cache(
        "What's the current state of AI chip stocks (NVIDIA, Intel, AMD)? "
        "Include today's price movement and market sentiment."
    )
    print(f"[CACHED RESULT]\n{cached_again}")

Verified 2026-04 · vGrok-3 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.