Documentation

Everything you need to launch GPU instances, train models, and deploy inference endpoints on FuturaNexus.

Quick Start

1. Create an Account

Sign up at futuranexus.io/login with GitHub or Google. Add a payment method and you're ready to launch. No minimum spend required.

2. Add SSH Key

Settings → SSH Keys → Add SSH Key. Or generate one:

ssh-keygen -t ed25519 -C "your@email.com" -f ~/.ssh/futuranexus
cat ~/.ssh/futuranexus.pub  # Copy this to Settings → SSH Keys

Ed25519 recommended. RSA 4096-bit and ECDSA also supported. Keys are auto-injected into every new instance.

3. Launch an Instance

Dashboard → Launch Instance → Select instance type (GPU + RAM + vCPUs) → Choose environment (PyTorch, JAX, etc.) → Enable access methods (SSH, Jupyter, VS Code) → Set budget cap → Launch. Running in under 30 seconds.

dashboard.futuranexus.app/instances/new

Launch a GPU instance

Up and running in under a minute. Billed per second.

H100 SXMPopular
80 GB VRAM$3.00/hr
A100 80GB
80 GB VRAM$2.00/hr
RTX 4090
24 GB VRAM$0.50/hr
A6000
48 GB VRAM$0.80/hr
PyTorch 2.4vLLMJAXCustom

Summary

GPUH100 SXM
Rate$3.00 / hr
Per second$0.000833
EnvironmentPyTorch 2.4
3,200 credits available
1/4Name your instance

A friendly name — find it later in the dashboard.

4. Connect

Use the in-browser terminal, Jupyter Lab link, VS Code Server link, or connect via native SSH:

ssh -i ~/.ssh/futuranexus root@<instance-ip>
# Or click "Terminal" in the dashboard — zero-drop, session persists across disconnects

5. Start Training (Optional)

Dashboard → Training → New Job → Select a base model (Qwen3, Llama, Gemma, etc.) → Upload dataset → Configure method (Closed-Form Newton for single-pass optimization, Reasoning Alignment, SFT, DPO, GRPO, LoRA, QLoRA) → Enable multi-stage pipeline or scaled training → Start. GPU is auto-selected based on model size.

Instance Types

Available GPUs

Each instance type is a complete machine: GPU(s) + system RAM + vCPUs + local SSD. Multi-GPU configs (×2, ×4, ×8) are provisioned as a single node with NVLink/NVSwitch high-speed interconnect — not separate machines.

GPUVRAMRAMvCPUsInterconnectBWFP16 TFLOPS
RTX 409024 GB64 GB161.0 TB/s330
L40S48 GB128 GB16864 GB/s362
A100 40GB40 GB128 GB12NVLink 31.5 TB/s312
A100 80GB80 GB256 GB16NVLink 32.0 TB/s312
H100 SXM80 GB256 GB26NVLink 43.35 TB/s989
H200 SXM141 GB480 GB32NVLink 44.8 TB/s989
B200 (coming)192 GB512 GB48NVLink 58.0 TB/s2,250

Multi-GPU configs available for A100, H100, and H200 in ×1, ×2, ×4, and ×8. Full 8-GPU nodes use NVSwitch for maximum interconnect bandwidth.

On-Demand vs Spot

On-demand: Pay per second, stop anytime. No interruptions. Best for inference servers, interactive development, and long-running jobs that can't tolerate interruption.

Spot: 40-60% off on-demand rates. Uses spare GPU capacity. Your instance can be reclaimed with 60 seconds notice if demand spikes. Best for training with checkpointing — if preempted, resume from the last checkpoint.

Tip: Enable "Gradient checkpointing" and save checkpoints every N steps when using spot instances. If preempted, launch a new spot instance, re-attach your persistent volume, and resume.

Instance Lifecycle

Requested → Provisioning → Booting → Running → Stopping → Terminated. Failed instances show error details in the dashboard. Budget cap reached triggers a 5-minute warning before auto-termination.

Auto-shutdown on idle: Configurable idle timeout (15 min–2 hours). Triggers when no SSH sessions are active and GPU utilization is below 5%. Prevents forgotten instances from burning credits.

Environments

Pre-Configured Environments

Every environment includes CUDA toolkit, cuDNN, and NCCL. Higher-level environments add framework-specific packages. All environments include nvidia-smi, htop, tmux, git, and wget.

🔥 PyTorch 2.5 + CUDA 12.4torch, torchvision, transformers, accelerate, peft, bitsandbytes, datasets, wandb, tensorboard, jupyter
🧪 JAX + CUDA 12.4jax, flax, optax, orbax, jupyter
📊 TensorFlow 2.17tensorflow, keras 3, tensorboard, jupyter
⚡ vLLM InferencevLLM server with OpenAI-compatible API
🎨 ComfyUIStable Diffusion with ComfyUI node-based workflow
🦙 OllamaRun and serve open-weight LLMs via CLI
🔧 CUDA 12.4 MinimalCUDA toolkit, cuDNN, NCCL only
🐧 Ubuntu 22.04 (Bare)Clean OS, full root access

Startup Scripts

Optional bash scripts that run automatically after the environment boots. Use them to clone repos, install extra packages, download datasets, or set environment variables.

#!/bin/bash
# Clone your repo
git clone https://github.com/your-org/your-repo.git /workspace/repo
cd /workspace/repo && pip install -r requirements.txt

# Download a HuggingFace dataset
python -c "from datasets import load_dataset; load_dataset('tatsu-lab/alpaca').save_to_disk('/data/alpaca')"

# Set env vars
export WANDB_API_KEY=your_key_here

Startup script logs are visible in the terminal after boot.

SSH & Access Methods

Every running instance exposes four ways in — an in-browser zero-drop terminal, native SSH, Jupyter Lab, and VS Code Server — all from the instance card. Close the tab and the session keeps running; reconnect with full scrollback. See it:

dashboard.futuranexus.app/instances/llama-finetune
llama-finetuneRunning
H100 SXM · 198.51.100.42 · up 1h 12m
PyTorch 2.5
Web Terminal
Zero-drop, in-browser
Native SSH
Your client + key
Jupyter Lab
Port 8888, token
VS Code Server
Port 8080, IDE
root@llama-finetunezero-drop · session persists
$ nvidia-smi --query-gpu=name,memory.used --format=csv,noheader
H100 SXM, 41216 MiB
$ tail -f train.log
step 1420/4000  loss 0.612  lr 1.8e-4  12.4 tok/s/gpu
# ── browser tab closed, reopened 3 min later ──
$ # …still here. scrollback intact.
1/3Open your instance

Running, healthy, and reachable — access methods live on the card.

Zero-Drop SSH

Our control plane maintains a persistent SSH connection to your instance. When you connect via the web terminal or native SSH, you're proxied through this persistent link. If your browser disconnects, the SSH session continues. Reconnect and pick up exactly where you left off — full scrollback preserved.

Native SSH

# Basic connection
ssh -i ~/.ssh/futuranexus root@<instance-ip>

# With SSH config (~/.ssh/config)
Host fn-*
  User root
  IdentityFile ~/.ssh/futuranexus
  ServerAliveInterval 30
  ServerAliveCountMax 3

# Then: ssh fn-<instance-name>

Jupyter Lab

Auto-starts on port 8888 when enabled. Accessible directly via a dashboard link — no port forwarding needed. Token is auto-generated and shown in the instance detail page.

VS Code Server

Full VS Code IDE in your browser via code-server on port 8080. Extensions, themes, keybindings, settings sync — everything works. Accessible via dashboard link.

Port Forwarding & HTTP Ports

# Forward Jupyter manually (if not using dashboard link)
ssh -L 8888:localhost:8888 root@<instance-ip>

# Forward TensorBoard + Gradio
ssh -L 6006:localhost:6006 -L 7860:localhost:7860 root@<instance-ip>

# Or enable "Expose HTTP Ports" at launch — auto-generates
# public URLs for ports 7860-9000 (Gradio, Streamlit, FastAPI, etc)

File Transfer

# Upload dataset
scp -i ~/.ssh/futuranexus data.jsonl root@<ip>:/workspace/data/

# Download trained model
scp -i ~/.ssh/futuranexus -r root@<ip>:/workspace/output/ ./local-output/

# Rsync for large transfers (resume on interrupt)
rsync -avz -e "ssh -i ~/.ssh/futuranexus" root@<ip>:/workspace/checkpoints/ ./checkpoints/

Training

Go from a base model to a fine-tuned checkpoint without touching a launch script. Pick a model, point at your data, choose a method — the platform sizes the GPU, streams live loss curves, and registers the result. Step through the real flow below.

dashboard.futuranexus.app/training/new

New training job

Fine-tune an open model on your data. GPU auto-selected by model size.

Llama 3.3 70BPopular
70B params~40 GB (QLoRA)
Qwen3 32B
32B params~22 GB (QLoRA)
Gemma 3 12B
12B params~10 GB (LoRA)
Mistral Small 24B
24B params~16 GB (QLoRA)

Detected: JSONL (instruct) · 52,002 rows · validated

LoRA
Adapter weights only
QLoRA
4-bit base + adapters
SFT
Full supervised tune

Summary

ModelLlama 3.3 70B
MethodQLoRA · 4-bit
GPUA100 80GB
Epochs3
Est. time~2h 10m
Fits on one GPU
1/4Pick a base model

Llama, Qwen3, Gemma, Mistral… VRAM estimated per method, GPU auto-selected.

Supported Methods

SFT (Supervised Fine-Tuning)

Standard fine-tuning on instruction/response pairs. Best for teaching a model new tasks or domains.

DPO (Direct Preference Optimization)

Align model with human preferences using chosen/rejected pairs. No reward model needed.

GRPO (Group Relative Policy Optimization)

RL-based alignment with verifiable rewards. Best for math, code, and factual accuracy.

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning — trains small adapter weights, not the full model. 10-100x less VRAM.

QLoRA (Quantized LoRA)

LoRA on a 4-bit quantized base model. Train 70B models on a single A100 80GB.

Model Selection

Select any public model from the HuggingFace Hub or enter a custom model ID. Popular models (Llama 4, Llama 3.3, Gemma 3, Mistral Small, Phi-4, DeepSeek R1) are pre-listed with VRAM estimates. GPU is auto-selected based on model size — override manually if needed.

Dataset Formats

JSONL (Chat){"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
JSONL (Instruct){"instruction": "...", "output": "..."}
JSONL (ShareGPT){"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
CSVColumns: instruction, input (optional), output
ParquetHuggingFace Datasets format — columns auto-detected

Upload directly or provide a HuggingFace dataset ID. Format is auto-detected and validated before training starts. Max 10 GB per upload.

Training Options

Gradient checkpointing: Reduces VRAM usage at the cost of ~20% slower training. Enabled by default for models >7B.

Completions-only: Mask prompt tokens during training — the model only learns from responses, not instructions. Reduces noise.

Early stopping: Automatically stops training when validation loss stops improving (patience: 3 eval steps).

Push to HuggingFace Hub: Automatically push the trained LoRA adapter to your HuggingFace account on completion.

Multi-Stage Pipeline

Chain multiple training stages in sequence. Example: Closed-Form Newton (fast adapter initialization in minutes) → Reasoning Alignment (align reasoning with constitutional principles) → SFT refinement (polish on curated examples). Each stage uses the output of the previous stage as its starting point.

Scaled Training

Multi-GPU and multi-node distributed training with configurable parallelism strategy. Supports DDP, FSDP (Fully Sharded), DeepSpeed ZeRO-2/3, and Pipeline Parallelism. Configure nodes, GPUs per node, micro-batch size, and communication backend (NCCL, Gloo, MPI).

Native Control Plane

For maximum performance, dispatch training to the native binary. Zero garbage collection pauses, direct Metal GPU compute on Apple Silicon unified memory, zero-copy CPU↔GPU transfer. The Closed-Form Newton engine runs natively with no Python overhead — optimal adapter weights computed analytically via SVD decomposition.

Data Pipeline

Cloud Connectors

Connect external storage backends to pull and push data to your instances and the platform cache. Supported connectors:

AWS S3Standard S3 buckets. Provide access key + secret or use IAM role.
Google Cloud StorageGCS buckets via service account JSON key.
Azure Blob StorageAzure containers via connection string or SAS token.
Cloudflare R2S3-compatible. Uses R2 API tokens. Zero egress fees.
HuggingFace HubPull models and datasets directly from HF. Uses your access token for gated content.
HTTP / URLDirect download from any public or authenticated HTTPS URL.

Credentials are encrypted at rest. Configure in Settings → Storage Providers or inline when creating a connector.

Transfer Manager

Transfers are created automatically when you cache a model or import a dataset, or you can initiate them manually from a connector. Features:

Parallel streams: Large files are split into concurrent streams for maximum throughput.

Progress tracking: Real-time progress percentage, transfer speed, and ETA displayed in the dashboard.

Pause/Resume: Pause active transfers and resume later without re-downloading.

Auto-retry: Transient failures are automatically retried with exponential backoff.

Model Cache

The platform maintains a shared NVMe cache for frequently-used models. When you cache a model (e.g., from HuggingFace Hub), it's stored on high-speed platform storage and can be instant-mounted to any instance — no re-download required.

Use the quick import bar on the Data Pipeline → Model Cache tab: enter a HuggingFace repo ID (e.g., meta-llama/Llama-3.1-8B-Instruct) and click Cache Model.

Cached models can be evicted when no longer needed. Models currently mounted to running instances cannot be evicted.

Dataset Catalog

Import datasets from HuggingFace Datasets, S3 buckets, or direct URLs. Imported datasets are indexed and can be used directly in training jobs.

Source prefixes: huggingface:// for HF Datasets, s3:// for S3 objects, or a direct HTTPS URL.

Streamable datasets: Large datasets can be streamed directly to training without requiring a full local copy.

Models & Inference

Any model in your registry deploys as an OpenAI-compatible endpoint in a single click — the platform provisions vLLM, loads the weights, and warms the URL. Swap your base_url and you're live. Walk the flow:

dashboard.futuranexus.app/models/deploy

Deploy an inference endpoint

OpenAI-compatible API, one click. Streaming, temperature, stop sequences — all supported.

alpaca-llama-70b
safetensors39 GB
qwen3-coder-sft
safetensors16 GB
gemma3-support
GGUF Q47 GB
A100 80GB
$2.00/hr
H100 SXM
$3.00/hr
L40S
$1.10/hr
Live · OpenAI-compatible
base_url = "https://api.futuranexus.io/v1"
model    = "alpaca-llama-70b"
# POST /chat/completions  → streaming ready

Deployment

BackendvLLM 0.6
GPUA100 80GB
Context32k tokens
Cold start~45 s
Rate$2.00 / hr
1/4Pick a model

Anything in your registry — trained, imported, or uploaded.

Model Registry

The model registry stores all your models — imported from HuggingFace, uploaded directly, trained on the platform, or pulled via URL. Each model tracks its format (safetensors, GGUF, ONNX, PyTorch), size, and deployment status.

Importing Models

Four import methods are available:

HuggingFace HubEnter a repo ID (e.g., meta-llama/Llama-3.1-8B-Instruct). Supports gated models with your HF access token.
URLDirect download from any HTTPS URL. Format auto-detected from file extension.
File UploadUpload model files directly via the browser. Supports multi-part uploads up to 50 GB.
From TrainingModels trained on the platform are automatically added to the registry.

Deploying as Inference Endpoint

Any model in the registry can be deployed as an OpenAI-compatible inference endpoint with one click. The platform provisions a vLLM inference instance, loads the model weights, and generates an API endpoint URL.

# Example: call your deployed model
curl https://api.futuranexus.io/v1/chat/completions \
  -H "Authorization: Bearer fn_prod_sk_..." \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model-name", "messages": [{"role": "user", "content": "Hello!"}]}'

Endpoints support the full OpenAI Chat Completions API including streaming, temperature, top_p, max_tokens, and stop sequences.

Storage

Persistent volumes outlive any instance — create one, attach it at a mount path, and re-attach it to the next box in the same region. Or back it with your own S3 / R2 / Wasabi bucket and keep full ownership of the bytes.

dashboard.futuranexus.app/storage/new

New persistent volume

Datasets, checkpoints, and models that outlive the instance.

100 GB500 GB1 TBCustom
llama-finetune
mount → /workspace/data
will attach
Managed SSD
$0.05/GB·mo
AWS S3
Your bucket
Cloudflare R2
Zero egress

Summary

Namecheckpoints-q3
Size500 GB
BackingManaged SSD
Regionus-east
Cost$25.00 / mo
Anti-orphan TTL: 7 days
1/3Create a volume

Network-attached SSD that survives termination — name it and size it.

Storage Types

Ephemeral (free): Local NVMe SSD on the instance. Fast but lost on termination. Good for temp files, cache, and scratch space.

Persistent ($0.05/GB/mo managed): Network-attached SSD that survives instance termination. Can be re-attached to any new instance in the same region. Use for datasets, checkpoints, and trained models.

Object storage: S3-compatible storage for large datasets and artifacts. Not mounted as a filesystem — accessed via API or tools like aws s3 cp.

Anti-Orphan Protection

Detached persistent volumes have a configurable TTL (time-to-live). When a volume is detached from a terminated instance, the TTL countdown begins. If not re-attached within the TTL period, the volume is automatically deleted. This prevents forgotten volumes from silently accumulating storage charges. You'll receive a notification before deletion.

BYO Storage Providers

Connect your own cloud storage in Settings → Storage Providers. Supported backends:

AWS S3 · Cloudflare R2 · Wasabi · Google Cloud Storage · Backblaze B2

When you create a persistent volume with a BYO provider, data is stored in your bucket — you retain full ownership. Credentials are encrypted at rest.

Compute Providers

FuturaNexus Managed (Default)

We provision GPUs from our provider network. Best availability, fastest launch times (under 30 seconds), automatic failover, and no setup required. Prices as listed in the pricing page.

BYO (Bring Your Own) Provider

Use your own API key from a third-party provider. We handle orchestration, monitoring, zero-drop SSH, and the full dashboard experience — you just bring the compute. Supported providers:

Vast.ai · Lambda · RunPod · Hyperbolic · CoreWeave

Add your API key in Settings → Providers. A flat 5% orchestration fee applies for dashboard, monitoring, and SSH proxy services. You pay the provider directly for GPU compute.

Billing

Spend is visible to the second and fenced by guardrails you set: a hard budget cap with a 5-minute grace warning, and auto-shutdown when an instance goes idle. Nothing burns credits while you sleep. Step through it:

dashboard.futuranexus.app/billing

Billing & controls

Spend visible to the second, with hard guardrails you set.

This instance · live
$4.71
H100 · $3.00/hr · $0.000833/s
metering · 1h 34m 12s
Budget cap$4.71 / $20.00
$10$20$50Custom

Notifications at 80% · 95% · 100%, then a 5-minute warning before auto-terminate.

Auto-shutdown on idle
No active SSH and GPU utilization below 5% for 30 minutes.
15m30m1h2h
1/3Watch the per-second meter

Every GPU billed to the second — no rounding, no hourly minimums.

Per-Second Billing

All GPU charges are per-second with no rounding — ever. If you use an H100 for 47 seconds, you pay for exactly 47 seconds. No hourly minimums, no rounding up.

Budget Caps

Set a maximum spend per instance. When the cap is reached, the instance auto-terminates with a 5-minute warning — enough time to save checkpoints and data. You'll receive notifications at 80%, 95%, and 100% of the cap.

Auto-Shutdown on Idle

Configurable idle timeout (15 min to 2 hours). Triggers when no SSH sessions are active and GPU utilization is below 5%. Prevents forgotten instances from burning credits. Recommended for all interactive development sessions.

Credits & Invoices

Pre-paid credits are applied first to all charges. Monthly invoices are generated for any usage beyond your credit balance. Download PDF invoices from the Billing page. All prices are in USD.

API Reference

Authentication

Create API keys in Settings → API Keys. Include the key in the Authorization header:

curl -H "Authorization: Bearer fn_prod_sk_..." \
  https://api.futuranexus.io/v1/instances

API keys can be scoped to specific resources (instances, training, storage, billing). Rate limit: 100 req/min per key.

Endpoints

GET /v1/instancesList all instances
POST /v1/instancesLaunch new instance
GET /v1/instances/:idGet instance details + metrics
POST /v1/instances/:id/stopStop instance
DELETE /v1/instances/:idTerminate & delete
GET /v1/trainingList training jobs
POST /v1/trainingStart training job
GET /v1/training/:idGet job details + metrics
GET /v1/modelsList trained models
POST /v1/models/:id/deployDeploy as endpoint
GET /v1/storage/volumesList volumes
POST /v1/storage/volumesCreate volume
GET /v1/billing/usageCurrent period usage
GET /v1/billing/creditsCredit balance
GET /v1/billing/invoicesInvoice history

WebSocket Events

Real-time instance metrics and state changes via WebSocket at wss://api.futuranexus.io/v1/ws. Events: instance.state_change, instance.metrics, training.progress, training.complete, billing.budget_warning.

CLI

Installation

# Install via npm
npm install -g @futuranexus/cli

# Or via Homebrew (macOS/Linux)
brew install futuranexus/tap/fn

# Authenticate
fn auth login

Common Commands

# List instances
fn instances list

# Launch instance
fn instances launch --gpu h100_sxm --env pytorch-full --name my-run

# SSH into instance
fn ssh my-run

# Start training
fn training start --model meta-llama/Llama-3.1-8B-Instruct --dataset ./data.jsonl --method sft

# Check training status
fn training status tj_001

# List volumes
fn storage list