Skip to main content

Introduction

Quick Start

Start the vLLM server.

Create a file called docker-compose.yml.

# Usage with CUDA, requires Nvidia GPU.
services:
sv:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
image: vllm/vllm-openai:v0.4.0
command: "--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --gpu-memory-utilization 0.8 --max-model-len 4096 --quantization awq"
ports:
- 8000:8000
network_mode: host
ipc: "host"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface

Start it with docker-compose up

note

Check vLLM docs for more information or help on how to start the server.

vLLM 0.4.0 or newer is required

Install @lmscript/client

npm install @lmscript/client

Create a script

Create a file called main.ts

import { LmScript } from "@lmscript/client";
import { VllmBackend } from "@lmscript/client/backends/vllm";
import { BULLET_LIST_REGEX } from "@lmscript/client/regex";

const backend = new VllmBackend({
url: `http://localhost:8000`,
template: "mistral",
model: "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
});

const model = new LmScript(backend, { temperature: 0.0 });
const {
captured: { mdList },
} = await model
.user(
"Tell me a list of 5 jokes. Answer with a markdown list, where each item of the list has a joke, in a single line.",
)
.assistant((m) =>
m.gen("mdList", {
maxTokens: 128,
stop: "\n\n",
regex: BULLET_LIST_REGEX,
}),
)
.run();

console.log(mdList);

Execute the file

Use your preferred runtime. @lmscript/client has no dependencies and runs everywhere.

For instance, to run using tsx and node: npx tsx main.ts

UI Prompt Editor

TODO