NLP——SGLang-Qwen3本地部署

参考链接：
- sglang-zh.llamafactory.cn

整体说明

SGLang 全称是 Structured Generation Language，是由 LMSYS Org 发起的开源项目
SGLang 通过共同设计后端运行时和前端语言，使用户与模型的交互更快、更可控
采用 RadixAttention 技术，通过基数树管理键值缓存（KV Cache），支持多轮对话中共享前缀的缓存复用，在多轮任务中可将缓存命中率提升 3-5 倍，显著降低延迟
SGLang 的前端采用编译器式设计，通过领域特定语言（DSL）简化复杂任务编程，后端运行时优化调度和资源分配，还可通过正则表达式和有限状态机（FSM）实现约束解码，直接生成 JSON 等结构化数据
SGLang 更适合处理复杂任务，如多轮对话、规划、工具调用（如调用 API 或数据库）等，以及需要生成 JSON、XML 等结构化数据的任务，如智能客服、数据分析等
- 在 Llama-7B 多轮对话任务中，吞吐量比 vLLM 高 5 倍，延迟降低 30%-50%
注：SGLang 接口经常变化，导致不同版本对应的接口不可复用，非常麻烦！

安装 SGLang

通过 pip 安装

pip install --upgrade pip
pip install "sglang[all]"

# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

部署服务

使用命令行启动

1	python -m sglang.launch_server --model-path ~/llm/model/Qwen3-0.6B --port 30000

请求服务

注：下面的命令暂未考虑 Qwen 模型的 Chat 模版

使用命令行访问

curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
    "text": "白日依山尽,",
    "sampling_params": {
    "max_new_tokens": 16,
    "temperature": 0
    }
}'

使用 OpenAI 兼容的 API 访问

import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
    model="default",
    prompt="The capital of France is",
    temperature=0,
    max_tokens=32,
)
print(response)

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "列出三个国家和他们的首都"},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

# Text embedding，需要在服务启动命令中添加 --is-embedding 参数才能访问下面的接口
response = client.embeddings.create(
    model="default",
    input="How are you today",
)
print(response)