Complete API documentation for Z.ai (Zhipu AI) Open Platform.
- Getting Started
- Authentication
- API Endpoints
- Models
- Chat Completions API
- Special Features
- Error Handling
- SDKs
- Rate Limits
Z.ai Platform's general API endpoint:
https://api.z.ai/api/paas/v4
For GLM Coding Plan subscribers, use the dedicated coding endpoint:
https://api.z.ai/api/coding/paas/v4
- Get API Key: Visit Z.AI Open Platform to register and create an API key
- Choose Model: Select from available models based on your needs
- Make API Call: Use HTTP API or SDK to make requests
Z.ai API uses HTTP Bearer Token authentication.
Authorization: Bearer YOUR_API_KEYFor higher security scenarios, generate JWT tokens:
import jwt
import time
def generate_token(apikey: str, exp_seconds: int):
id, secret = apikey.split(".")
payload = {
"api_key": id,
"exp": int(round(time.time() * 1000)) + exp_seconds * 1000,
"timestamp": int(round(time.time() * 1000)),
}
return jwt.encode(
payload,
secret,
algorithm="HS256",
headers={"alg": "HS256", "sign_type": "SIGN"}
)Content-Type: application/json
Accept-Language: en-US,en
Authorization: Bearer YOUR_API_KEY| Endpoint | Method | Description |
|---|---|---|
/chat/completions |
POST | Generate chat completions |
/embeddings |
POST | Generate text embeddings |
/images/generations |
POST | Generate images |
/videos/generations |
POST | Generate videos |
/files |
POST | Upload files |
/files/{file_id} |
GET | Retrieve file |
/files/{file_id} |
DELETE | Delete file |
/audio/transcriptions |
POST | Audio-to-text transcription |
/moderations |
POST | Content moderation |
| Model | Context Length | Description |
|---|---|---|
glm-4.7 |
128K | Latest flagship model with enhanced programming and multi-step reasoning |
glm-4.6 |
200K | Superior coding, long-context, and reasoning capabilities |
glm-4.5 |
128K | Balanced performance across domains |
glm-4-32b-0414-128k |
128K | Large parameter model with extended context |
| Model | Context Length | Description |
|---|---|---|
glm-4.6v |
128K | Flagship multimodal model with native tool use |
glm-4.6v-flashx |
128K | Lightweight, high-speed vision processing |
glm-4.6v-flash |
128K | Lightweight, free basic vision tasks |
glm-ocr |
- | Specialized OCR for text extraction |
| Model | Description |
|---|---|
glm-image |
General image generation |
cogview-4 |
High-quality visual generation |
| Model | Description |
|---|---|
cogvideox-3 |
Latest video generation |
vidu-q1 |
Fast, efficient video creation |
vidu-2 |
Enhanced quality and features |
| Model | Description |
|---|---|
glm-asr-2512 |
Speech recognition / Audio-to-text |
POST /chat/completions
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model ID to use (e.g., glm-4.7) |
messages |
array | Yes | Array of message objects |
temperature |
number | No | Sampling temperature (0.0-2.0), default: 1.0 |
max_tokens |
integer | No | Maximum tokens to generate (1-32768) |
top_p |
number | No | Nucleus sampling threshold (0.0-1.0) |
stream |
boolean | No | Enable streaming responses, default: false |
thinking |
object | No | Configure thinking mode |
tools |
array | No | Array of tool definitions |
tool_choice |
string | No | Tool use strategy: auto, none, required |
response_format |
object | No | Structured output format |
stop |
string/array | No | Stop sequences |
presence_penalty |
number | No | Presence penalty (-2.0 to 2.0) |
frequency_penalty |
number | No | Frequency penalty (-2.0 to 2.0) |
{
"role": "system|user|assistant",
"content": "string or array"
}For vision models, content can be an array:
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}{
"type": "function",
"function": {
"name": "function_name",
"description": "Function description",
"parameters": {
"type": "object",
"properties": {
"param_name": {
"type": "string",
"description": "Parameter description"
}
},
"required": ["param_name"]
}
}
}{
"thinking": {
"type": "enabled"
}
}{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1629900000,
"model": "glm-4.7",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Response text"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 200,
"total_tokens": 300,
"prompt_cache_hit_tokens": 0,
"prompt_cache_miss_tokens": 100
}
}{
"id": "chatcmpl-xxx",
"object": "chat.completion.chunk",
"created": 1629900000,
"model": "glm-4.7",
"choices": [
{
"index": 0,
"delta": {
"content": "Partial response"
},
"finish_reason": null
}
]
}| Value | Description |
|---|---|
stop |
Model completed normally |
length |
Max tokens reached |
tool_calls |
Model requested tool call |
content_filter |
Content was filtered |
Enable the AI's reasoning process to be shown:
{
"model": "glm-4.7",
"messages": [...],
"thinking": {
"type": "enabled"
}
}Response includes reasoning_content field:
{
"choices": [{
"message": {
"content": "Final answer",
"reasoning_content": "Step-by-step thinking process"
}
}]
}Define tools for the model to call:
{
"model": "glm-4.7",
"messages": [...],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}{
"model": "glm-4.7",
"messages": [...],
"tools": [...],
"stream": true,
"stream_options": {
"include_usage": true
}
}Request JSON-formatted responses:
{
"model": "glm-4.7",
"messages": [...],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "analysis",
"schema": {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"]
},
"score": {
"type": "number",
"minimum": 0,
"maximum": 1
}
},
"required": ["sentiment", "score"],
"additionalProperties": false
}
}
}
}Reduce token usage for repeated content:
{
"model": "glm-4.7",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant.",
"cache_control": {"type": "enabled"}
},
...
]
}{
"error": {
"message": "Invalid authentication",
"type": "invalid_request_error",
"param": null,
"code": "invalid_api_key"
}
}| Code | Description |
|---|---|
| 200 | Success |
| 400 | Bad Request - Invalid parameters |
| 401 | Unauthorized - Invalid API key |
| 403 | Forbidden - Insufficient permissions |
| 404 | Not Found |
| 429 | Rate Limit Exceeded |
| 500 | Internal Server Error |
| 503 | Service Unavailable |
| Type | Description |
|---|---|
invalid_request_error |
Invalid request parameters |
invalid_api_key |
Invalid or expired API key |
rate_limit_error |
Rate limit exceeded |
insufficient_quota |
Insufficient account quota |
content_filter |
Content was filtered |
from zai import ZaiClient
client = ZaiClient(api_key="your-api-key")
# Basic chat
response = client.chat.completions.create(
model="glm-4.7",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
# Streaming
for chunk in client.chat.completions.create(
model="glm-4.7",
messages=[...],
stream=True
):
print(chunk.choices[0].delta.content, end="")from openai import OpenAI
client = OpenAI(
api_key="your-zai-api-key",
base_url="https://api.z.ai/api/paas/v4/"
)
completion = client.chat.completions.create(
model="glm-4.7",
messages=[{"role": "user", "content": "Hello!"}]
)import ai.z.openapi.ZaiClient;
import ai.z.openapi.service.model.*;
ZaiClient client = ZaiClient.builder()
.ofZAI()
.apiKey("your-api-key")
.build();
ChatCompletionCreateParams request = ChatCompletionCreateParams.builder()
.model("glm-4.7")
.messages(Arrays.asList(
ChatMessage.builder()
.role(ChatMessageRole.USER.value())
.content("Hello!")
.build()
))
.build();
ChatCompletionResponse response = client.chat().createChatCompletion(request);import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'your-zai-api-key',
baseURL: 'https://api.z.ai/api/paas/v4/'
});
const completion = await client.chat.completions.create({
model: 'glm-4.7',
messages: [{ role: 'user', content: 'Hello!' }]
});# Basic request
curl -X POST "https://api.z.ai/api/paas/v4/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
# Streaming
curl -X POST "https://api.z.ai/api/paas/v4/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7",
"messages": [...],
"stream": true
}'
# With vision
curl -X POST "https://api.z.ai/api/paas/v4/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.6v",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
]
}'Rate limits are based on your account tier and subscription.
| Plan | Requests per Minute | Concurrent Requests |
|---|---|---|
| Free | 60 | 3 |
| GLM Coding | Higher limits | Higher limits |
| Enterprise | Custom | Custom |
Rate limit headers are included in responses:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1629900000Implement exponential backoff for 429 errors:
import time
def make_request_with_retry(client, request, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(**request)
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
raise- Monitor token usage with the
usagefield in responses - Use
max_tokensto control response length - Implement context caching for repeated content
- Choose appropriate model for your use case
- Implement retry logic for 429 errors with exponential backoff
- Validate API keys before making requests
- Handle streaming interruptions gracefully
- Log errors for debugging
- Use streaming for real-time applications
- Use
temperature=0for deterministic responses - Implement request batching where possible
- Use appropriate model for the task (don't over-provision)
- Store API keys securely (environment variables)
- Use HTTPS for all API calls
- Implement proper access controls
- Rotate API keys regularly
- Never commit API keys to version control
- Official Documentation: https://docs.z.ai/
- Developer Platform: https://open.bigmodel.cn/
- GitHub: https://github.com/zai-org
- Python SDK: https://github.com/zai-org/z-ai-sdk-python
- Java SDK: https://github.com/zai-org/z-ai-sdk-java
- GLM-4.7-Flash released - Open source, free 30B parameter model
- GLM-4.7 released - Latest flagship with enhanced coding
- GLM-4.6 released - 200K context, agent-oriented
- GLM-4.6V released - Flagship multimodal
- CogVideoX-3 released - Latest video generation
- GLM Coding Plan launched - Specialized coding endpoint
Document generated based on Z.ai API documentation as of February 2026.