What Is the Max Tokens Parameter?
When you're building with LLM APIs, one question comes up constantly: how do you get responses that are long enough to be useful but short enough to be affordable?
Max tokens and stop sequences are your two main tools for this job. Max tokens sets a numerical limit on output length. Stop sequences tell the model to quit generating when it hits certain text patterns. Together, they give you fine-grained response length control over every API call.
The max tokens parameter sets the maximum number of tokens the model can generate in a single response. Once the model produces that many tokens, it stops, even if the answer isn't complete.
A few key points to understand:
This only counts output tokens. Your input prompt has its own token count, and the two are separate. The model's context window size limitations define how much total content (input plus output) the model can handle at once. Max tokens specifically caps the output portion.
Models have their own maximum output limits. GPT-5.2 maxes out at 128,000 output tokens. Claude Sonnet 4.5 supports up to 64,000 output tokens (128K with beta header). Gemini 2.5 Pro caps at 64,000. These are hard ceilings, and setting max_tokens higher than the model allows does nothing.
The model might stop before hitting your limit. If the model reaches what it considers a natural ending point, it will stop generating even if you've set max_tokens to 10,000. The parameter is a cap, not a target.
Here's what a basic API call looks like with max_tokens set:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain how neural networks learn."}
]
)
In this case, Claude will generate up to 1,024 tokens. If the explanation needs more, it gets cut off.
How Max Tokens Affects Your Costs
Every token in an API response costs money. OpenAI, Anthropic, and Google all charge per token for both input and output, with output tokens typically costing more.
For Claude Sonnet 4.5, output tokens cost $15 per million tokens. For GPT-5.2, it's $14 per million. These numbers add up fast when you're making thousands of API calls.
If you know your use case only needs short answers (like a customer support bot giving quick responses), setting a lower max token value directly reduces your bill. Conversely, setting it too low can truncate AI responses and frustrate users.
This is where batch requests for cost efficiency can also help. Combining multiple prompts into batched calls often saves 50% compared to individual requests.
What Are Stop Sequences?
Stop sequences are strings that tell the model to immediately stop generating text when encountered. Unlike max tokens, which is a blunt numerical limit, stop sequences give you pattern-based control.
Most APIs let you specify multiple stop sequences (OpenAI allows up to 4). When the model generates any of these strings, it halts output right there.
Here's an example:
response = openai.chat.completions.create(
model="gpt-5.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a conversation between a customer and support agent."}
],
stop=["Customer:", "\n\n\n"]
)
In this case, the model generates the support agent's response and stops when it's about to write "Customer:" or hits three consecutive newlines. This prevents the model from continuing the conversation beyond one turn.
The stop sequence itself typically doesn't appear in the output. The model cuts off right before generating it. If you need the sequence in your final output, you'll have to append it manually.
Common Use Cases for Stop Sequences
Structured outputs. When you're generating JSON or XML, you can use closing tags as stop sequences. For example, setting as a stop sequence ensures the model doesn't add anything after your structured data ends. For more predictable outputs, explore getting structured JSON outputs using response format parameters.
Conversational interfaces. In chatbot applications, you might want the model to respond as one character and stop before generating dialogue for another. Setting the other character's name as a stop sequence handles this cleanly.
List generation. If you're generating numbered lists, you can use the next number as a stop sequence. Want a list of 10 items? Set "11." as a stop sequence.
Code generation. When generating functions or code blocks, using delimiters like triple backticks or specific end markers prevents the model from adding unwanted explanations after the code. Many AI coding helpers and assistants use this technique internally.
Understanding finish_reason: Did Your Response Get Truncated?
Every API response includes a field called finish_reason (or stop_reason in Anthropic's API) that tells you why the model stopped generating.
Common values include:
- stop or end_turn: The model finished naturally or hit a stop sequence you specified
- length: The model hit your max_tokens limit and was forced to stop
- tool_calls or tool_use: The model wants to call a function or tool
- content_filter: Safety filters blocked further output
The "length" value is your red flag. It means the response was truncated because it ran out of tokens, not because the model finished its thought.
Here's how to check for this in your code:
if response.choices[0].finish_reason == "length":
print("Warning: Response was truncated due to token limit")
When you get a "length" finish_reason, you have a few options: increase max_tokens for future requests, implement continuation logic to get the rest of the response, or redesign your prompt to encourage shorter answers.
Setting Max Tokens: Best Practices
Match the parameter to your use case. For quick Q&A responses, 150 to 300 tokens often suffices. For detailed explanations, 1,000 to 2,000 tokens works well. For long-form content like articles or code generation, you might need 4,000 or more.
Leave buffer room. If you think a response needs about 800 tokens, set max_tokens to 1,000 or 1,200. This prevents truncation while still controlling costs.
Calculate against your context window. Remember that input tokens plus output tokens cannot exceed the model's context window. If you're working with a 400K context model and your input is 300K tokens, you only have 100K left for output. Understanding context window size limitations helps you plan accordingly.
Default behavior varies by provider. OpenAI models don't have a fixed default for max_tokens in chat completions, and will often generate until a natural stopping point. Anthropic's API requires you to specify max_tokens explicitly. Always check the documentation.
Setting Stop Sequences: Best Practices
Choose unique patterns. If you set "the" as a stop sequence, the model will stop almost immediately. Pick sequences that genuinely indicate the end of useful output, like closing tags, double newlines, or specific marker phrases.
Test for edge cases. A stop sequence might accidentally appear within legitimate content. If you're generating a list about weekends and your stop sequence is "END", watch out for words like "weekend" triggering early stops.
Combine with max_tokens for safety. Stop sequences work great for structured outputs, but always pair them with a reasonable max_tokens value as a fallback. If your stop sequence never appears (maybe due to model unpredictability), max_tokens prevents runaway generation.
Consider tokenization effects. Stop sequences operate at the token level, not raw strings. A complex stop sequence might not match if the tokenizer splits it differently than you expect. Simpler stop sequences (short strings, single characters like newlines) are more reliable.
How These Parameters Interact with Other Settings
Response length isn't just about max tokens and stop sequences. Other API parameters influence output length too.
Temperature affects output indirectly. Higher temperature settings for AI creativity can lead to more verbose, exploratory responses. Lower temperatures produce more focused, often shorter outputs.
Top-p sampling similarly impacts verbosity. With top-p sampling for varied outputs, you're controlling how many token candidates the model considers at each step, which influences the style and length of responses.
System prompts can instruct the model to be concise or verbose. Adding "Keep your response under 200 words" to your system message often works as well as any parameter setting.
For a full breakdown, see our comprehensive LLM API parameters guide.
Handling Truncated Responses
When responses get cut off due to hitting the output token limit, you have several strategies.
Increase max_tokens. The simplest fix. If you're consistently hitting limits, bump up the value. Just watch your costs.
Implement continuation logic. Detect the "length" finish_reason, then send a follow-up request that includes the truncated output and asks the model to continue. This works well for long-form content generation.
def get_complete_response(messages, max_tokens=2000):
full_response = ""
while True:
response = client.chat.completions.create(
model="gpt-5.2",
messages=messages,
max_tokens=max_tokens
)
full_response += response.choices[0].message.content
if response.choices[0].finish_reason != "length":
break
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Continue from where you left off."})
return full_response
Chunk your requests. For document processing or analysis, break input into smaller pieces and process them separately. Then combine results.
Summarize first. If you're working with very long inputs, summarize the content first to reduce the output length needed.
For applications sending many requests, handling API throttling limits becomes important to avoid rate limit errors during continuation loops.
Token Limit API Differences Across Providers
Different providers handle the token limit API parameter differently. Here's what to know.
OpenAI uses max_tokens for standard chat completions. For newer reasoning models, they've introduced max_completion_tokens which separately counts reasoning tokens. The naming changed because reasoning models use internal tokens you don't see in the output.
Anthropic requires max_tokens on every API call. There's no default. They also support extended thinking modes where budget_tokens controls how much internal reasoning the model can do, separate from your visible output limit.
Google's Gemini uses maxOutputTokens (camelCase instead of snake_case) and has different limits depending on which tier you're using.
If you're using abstractions like LiteLLM or LangChain, they typically translate max_tokens to whatever the underlying provider expects. But it's worth checking when you encounter unexpected behavior.
When working with any provider, consider streaming responses in real time for long outputs. Streaming lets users see responses as they generate, reducing perceived latency even when outputs are large.
Real-World Token Limits by Model (2026)
Here's what you're working with for current flagship models:
GPT-5.2: 400K context window, 128K max output tokens
Claude Sonnet 4.5: 200K context window (1M with beta header), 64K max output tokens
Gemini 2.5 Pro: 1M to 2M context window, 64K max output tokens
These numbers change with model updates, so check official documentation for your specific use case. Understanding tokens and tokenization explained helps you estimate whether your requests will fit.
When to Use Max Tokens vs. Stop Sequences
Use max tokens when:
- You need a hard numerical limit on output length
- You're controlling costs for high-volume applications
- You want simple, predictable behavior
- You're working with open-ended prompts where stop sequences wouldn't make sense
Use stop sequences when:
- You're generating structured data (JSON, XML, code blocks)
- You're building conversational interfaces with clear turn boundaries
- You need the model to stop at logical content boundaries, not arbitrary lengths
- You're creating lists or sequences with natural delimiters
Use both when:
- You want structured output but need a fallback limit
- You're unsure if the stop sequence will always appear
- You're optimizing for both cost control and output quality
Common Mistakes to Avoid
Setting max_tokens too low for the task. If you ask for a detailed explanation but cap tokens at 100, you'll get a truncated mess. Match your limits to your expectations.
Forgetting that max_tokens costs money. Setting max_tokens to 100,000 "just in case" means potentially paying for 100,000 tokens even if you only needed 500. Start conservative and increase based on actual needs.
Using common words as stop sequences. Stop sequences like periods, common words, or short strings trigger far too often. Use unique markers instead.
Ignoring finish_reason. Always check why the model stopped. A "length" finish_reason means your output was cut short, and you should handle it appropriately.
Not testing with realistic prompts. A prompt that works fine with max_tokens=500 in testing might need 2,000 tokens with real user inputs. Test with varied examples.
Conclusion
Max tokens and stop sequences are foundational tools for building reliable LLM applications. Max tokens gives you numerical control over output length, directly impacting both response completeness and cost. Stop sequences provide pattern-based stopping, perfect for structured outputs and conversational interfaces.
Use them together for best results. Set stop sequences for logical content boundaries, then add max_tokens as a safety net. Always check finish_reason to catch truncation issues before they frustrate users.
Getting these parameters right isn't glamorous work, but it's what separates polished AI products from frustrating ones. Master response length control, and you'll build applications that generate exactly what users need without burning through your API budget.



