Setting up Ephemeral GPU workspaces in Modal for SGLang Development
Introduction
I’ve recently started doing development on SGLang, which is an LLM inference engine. However I need GPU access to actively test my changes. I have access to GPUs using SLURM, however it requires some queue time (sometimes waiting for days) and moreover most of the time GPU sits idle while doing development which I don’t like. Also the selection of GPUs is limited in that cluster. Therefore I wanted to have some ephemeral environment where I can quickly test on different GPUs without keeping them idling while I do the development. I’ve originally met with Modal on a GPU Kernel programming contest, using flash-infer to deploy and test kernel performance. I really liked the ephemeral GPU containers, because they are so easy and fast to deploy, you only pay for the time they were running.
The code is available at: github.com/Dogacel/model-workspaces-vscode-sglang
Modal Architecture
Modal is a serverless runtime platform for AI inference that has low startup times. You can define your entire infrastructure in Python dynamically without any other configuration file. Modal runs your code in isolated containers. Your deployments are called “Apps” and it bundles one more more “Functions”. Functions are serverless endpoints, meaning no container will be redundantly running if no request is coming. It also provides “Sandboxes” that allow you to run containers with arbitrary dependencies and scripts, which we will use to create our development environment.
Setting up a Docker Image
Modal lets you declare your image declaretively using a Python DSL. For my environment, I needed python 3.11 with CUDA 13.
modal.Image.from_registry(
"nvidia/cuda:13.0.0-devel-ubuntu24.04", add_python="3.11"
)
Next I’ve installed some tools that help me work in CLI for short tasks, such as tmux and vim. I’ve also installed the build tools needed for SGLang and an ssh server to allow connecting directly. Note that I’ve figured what I need to install by first starting a relatively empty container and next installing stuff in the container manually as I face errors. Trying to rebuild image everytime could be time consuming.
.apt_install(
"git", "build-essential", "cmake", "ninja-build",
"vim", "tmux", "htop", "wget", "curl",
"openssh-client", "openssh-server",
"libnuma1",
)
.env({"CUDA_HOME": "/usr/local/cuda"})
Project Setup
We will clone sglang into our image and install its dependencies. I’ve created a secret on Modal’s dashboard for my GitHub token to be able to access my private fork. Note that everytime we add a new dependency, we have to rebuild the image.
.run_commands(
f"git clone --branch {SGLANG_BRANCH} https://[email protected]/{GITHUB_USER}/sglang.git /opt/sglang",
"cd /opt/sglang/python && pip install -e '.[all]'",
secrets=[github_secret],
)
Also I had to set my LD path for SGLang, otherwise it couldn’t find the required .so files as they were installed with python.
python -c "
import nvidia, os, glob;
paths = glob.glob(os.path.join(os.path.dirname(nvidia.__file__), '*/lib'));
open('/etc/ld.so.conf.d/nvidia-python.conf','w').write('\n'.join(paths))
" && ldconfig
Persisting the development environment
I’ve created a volume to persist my workspace, so I don’t have to pull or push my work everytime. We will mount those volumes to our app or the sandbox later on.
hf_cache_vol = modal.Volume.from_name("hf-cache", create_if_missing=True)
workspace_vol = modal.Volume.from_name("sglang-workspace", create_if_missing=True)
HF_CACHE_PATH = "/root/.cache/huggingface"
WORKSPACE_PATH = "/workspace"
VOLUMES = {
HF_CACHE_PATH: hf_cache_vol,
WORKSPACE_PATH: workspace_vol,
}
.env({
"HF_HUB_CACHE": HF_CACHE_PATH,
"SGLANG_CACHE_DIR": f"{WORKSPACE_PATH}/.sglang_cache",
})
SSH access
To get SSH access without any password prompt, we can upload our public key directly.
_SSH_KEY_NAMES = ["id_ed25519.pub", "id_rsa.pub", "id_ecdsa.pub"]
SSH_PUB_KEY = next(
(Path.home() / ".ssh" / name for name in _SSH_KEY_NAMES if (Path.home() / ".ssh" / name).exists()),
None,
)
...
.run_commands("mkdir -p /run/sshd /root/.ssh && chmod 700 /root/.ssh")
.add_local_file(str(SSH_PUB_KEY), "/root/.ssh/authorized_keys", copy=True)
Setting up VSCode & Claude Code
I prefer working on VSCode using the “Remote Access” plugin. Therefore I’ve installed VSCode directly into the image to persist my workspace settings and session connection.
.run_commands(
"curl -fsSL 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' -o /tmp/vscode-cli.tar.gz"
" && tar -xzf /tmp/vscode-cli.tar.gz -C /usr/local/bin"
" && rm /tmp/vscode-cli.tar.gz",
)
Also make sure it persists your plugins, preferences and auth after your first login.
VSCODE_EXTENSIONS = [
"ms-python.python",
"ms-python.pylint",
"ms-python.debugpy",
"ms-toolsai.jupyter",
"Anthropic.claude-code",
]
def start_vscode_tunnel(name: str):
"""Start a VS Code tunnel, persisting auth and extensions on the volume."""
vscode_data = f"{WORKSPACE_PATH}/.vscode-cli"
os.makedirs(vscode_data, exist_ok=True)
env = {**os.environ, "VSCODE_CLI_DATA_DIR": vscode_data}
cmd = ["code", "tunnel", "--accept-server-license-terms", "--name", name]
for ext in VSCODE_EXTENSIONS:
cmd.extend(["--install-extension", ext])
print(f"Starting VS Code tunnel '{name}'...")
subprocess.run(cmd, env=env, check=False)
Note that I want to persist my Claude Code sesssion too, thus I’ve persisted my claude config under my mounted disk too.
.env({
"CLAUDE_CONFIG_DIR": f"{WORKSPACE_PATH}/.claude",
"IS_SANDBOX": "1", # Ensures sure dangerously skip permissions works
})
Usage
Now that we have our Modal environment ready, we can create a local entrypoint that allows us to launch VSCode directly, with an option to add GPU if we need to do some debugging or GPU execution.
@app.local_entrypoint()
def vscode(**kwargs):
...
sb = modal.Sandbox.create(
"python", "-c",
f"from tunnel import start_vscode_tunnel; start_vscode_tunnel('{name}')",
image=dev_image,
gpu=use_gpu,
cpu=use_cpu,
memory=use_memory,
volumes=VOLUMES,
secrets=SECRETS,
timeout=6 * 3600,
app=app,
)
The usage is simple, you can request your development environment with any spec you need,
modal run dev.py # no GPU, default resources
modal run dev.py --gpu H100 --cpu 16 --memory 131072 # H100 + 16 CPU + 128 GiB
This will automaticaly create a VSCode tunnel that you can access from you VSCode desktop as long as it is connected to your account.
Final Words
Having low-overhead GPU access is cruicial for GPU development and cost efficiency. Modal lets us achieve this by letting us spin up instances quickly with any desired spec while our workspace is persisted in volumes. In this example, I’ve only setup VSCode tunnel to get live-editing and terminal access however one can create custom functions rather than executing code manually every time. Using functions means
Enjoy Reading This Article?
Here are some more articles you might like to read next: