Documentation
Everything you need to launch GPU instances, train models, and deploy inference endpoints on FuturaNexus.
Quick Start
1. Create an Account
Sign up at futuranexus.io/login with GitHub or Google. Add a payment method and you're ready to launch. No minimum spend required.
2. Add SSH Key
Settings → SSH Keys → Add SSH Key. Or generate one:
ssh-keygen -t ed25519 -C "your@email.com" -f ~/.ssh/futuranexus cat ~/.ssh/futuranexus.pub # Copy this to Settings → SSH Keys
Ed25519 recommended. RSA 4096-bit and ECDSA also supported. Keys are auto-injected into every new instance.
3. Launch an Instance
Dashboard → Launch Instance → Select instance type (GPU + RAM + vCPUs) → Choose environment (PyTorch, JAX, etc.) → Enable access methods (SSH, Jupyter, VS Code) → Set budget cap → Launch. Running in under 30 seconds.
Launch a GPU instance
Up and running in under a minute. Billed per second.
Summary
A friendly name — find it later in the dashboard.
4. Connect
Use the in-browser terminal, Jupyter Lab link, VS Code Server link, or connect via native SSH:
ssh -i ~/.ssh/futuranexus root@<instance-ip> # Or click "Terminal" in the dashboard — zero-drop, session persists across disconnects
5. Start Training (Optional)
Dashboard → Training → New Job → Select a base model (Qwen3, Llama, Gemma, etc.) → Upload dataset → Configure method (Closed-Form Newton for single-pass optimization, Reasoning Alignment, SFT, DPO, GRPO, LoRA, QLoRA) → Enable multi-stage pipeline or scaled training → Start. GPU is auto-selected based on model size.
Instance Types
Available GPUs
Each instance type is a complete machine: GPU(s) + system RAM + vCPUs + local SSD. Multi-GPU configs (×2, ×4, ×8) are provisioned as a single node with NVLink/NVSwitch high-speed interconnect — not separate machines.
| GPU | VRAM | RAM | vCPUs | Interconnect | BW | FP16 TFLOPS |
|---|---|---|---|---|---|---|
| RTX 4090 | 24 GB | 64 GB | 16 | — | 1.0 TB/s | 330 |
| L40S | 48 GB | 128 GB | 16 | — | 864 GB/s | 362 |
| A100 40GB | 40 GB | 128 GB | 12 | NVLink 3 | 1.5 TB/s | 312 |
| A100 80GB | 80 GB | 256 GB | 16 | NVLink 3 | 2.0 TB/s | 312 |
| H100 SXM | 80 GB | 256 GB | 26 | NVLink 4 | 3.35 TB/s | 989 |
| H200 SXM | 141 GB | 480 GB | 32 | NVLink 4 | 4.8 TB/s | 989 |
| B200 (coming) | 192 GB | 512 GB | 48 | NVLink 5 | 8.0 TB/s | 2,250 |
Multi-GPU configs available for A100, H100, and H200 in ×1, ×2, ×4, and ×8. Full 8-GPU nodes use NVSwitch for maximum interconnect bandwidth.
On-Demand vs Spot
On-demand: Pay per second, stop anytime. No interruptions. Best for inference servers, interactive development, and long-running jobs that can't tolerate interruption.
Spot: 40-60% off on-demand rates. Uses spare GPU capacity. Your instance can be reclaimed with 60 seconds notice if demand spikes. Best for training with checkpointing — if preempted, resume from the last checkpoint.
Tip: Enable "Gradient checkpointing" and save checkpoints every N steps when using spot instances. If preempted, launch a new spot instance, re-attach your persistent volume, and resume.
Instance Lifecycle
Requested → Provisioning → Booting → Running → Stopping → Terminated. Failed instances show error details in the dashboard. Budget cap reached triggers a 5-minute warning before auto-termination.
Auto-shutdown on idle: Configurable idle timeout (15 min–2 hours). Triggers when no SSH sessions are active and GPU utilization is below 5%. Prevents forgotten instances from burning credits.
Environments
Pre-Configured Environments
Every environment includes CUDA toolkit, cuDNN, and NCCL. Higher-level environments add framework-specific packages. All environments include nvidia-smi, htop, tmux, git, and wget.
Startup Scripts
Optional bash scripts that run automatically after the environment boots. Use them to clone repos, install extra packages, download datasets, or set environment variables.
#!/bin/bash
# Clone your repo
git clone https://github.com/your-org/your-repo.git /workspace/repo
cd /workspace/repo && pip install -r requirements.txt
# Download a HuggingFace dataset
python -c "from datasets import load_dataset; load_dataset('tatsu-lab/alpaca').save_to_disk('/data/alpaca')"
# Set env vars
export WANDB_API_KEY=your_key_hereStartup script logs are visible in the terminal after boot.
SSH & Access Methods
Every running instance exposes four ways in — an in-browser zero-drop terminal, native SSH, Jupyter Lab, and VS Code Server — all from the instance card. Close the tab and the session keeps running; reconnect with full scrollback. See it:
$ nvidia-smi --query-gpu=name,memory.used --format=csv,noheader H100 SXM, 41216 MiB $ tail -f train.log step 1420/4000 loss 0.612 lr 1.8e-4 12.4 tok/s/gpu # ── browser tab closed, reopened 3 min later ── $ # …still here. scrollback intact.
Running, healthy, and reachable — access methods live on the card.
Zero-Drop SSH
Our control plane maintains a persistent SSH connection to your instance. When you connect via the web terminal or native SSH, you're proxied through this persistent link. If your browser disconnects, the SSH session continues. Reconnect and pick up exactly where you left off — full scrollback preserved.
Native SSH
# Basic connection ssh -i ~/.ssh/futuranexus root@<instance-ip> # With SSH config (~/.ssh/config) Host fn-* User root IdentityFile ~/.ssh/futuranexus ServerAliveInterval 30 ServerAliveCountMax 3 # Then: ssh fn-<instance-name>
Jupyter Lab
Auto-starts on port 8888 when enabled. Accessible directly via a dashboard link — no port forwarding needed. Token is auto-generated and shown in the instance detail page.
VS Code Server
Full VS Code IDE in your browser via code-server on port 8080. Extensions, themes, keybindings, settings sync — everything works. Accessible via dashboard link.
Port Forwarding & HTTP Ports
# Forward Jupyter manually (if not using dashboard link) ssh -L 8888:localhost:8888 root@<instance-ip> # Forward TensorBoard + Gradio ssh -L 6006:localhost:6006 -L 7860:localhost:7860 root@<instance-ip> # Or enable "Expose HTTP Ports" at launch — auto-generates # public URLs for ports 7860-9000 (Gradio, Streamlit, FastAPI, etc)
File Transfer
# Upload dataset scp -i ~/.ssh/futuranexus data.jsonl root@<ip>:/workspace/data/ # Download trained model scp -i ~/.ssh/futuranexus -r root@<ip>:/workspace/output/ ./local-output/ # Rsync for large transfers (resume on interrupt) rsync -avz -e "ssh -i ~/.ssh/futuranexus" root@<ip>:/workspace/checkpoints/ ./checkpoints/
Training
Go from a base model to a fine-tuned checkpoint without touching a launch script. Pick a model, point at your data, choose a method — the platform sizes the GPU, streams live loss curves, and registers the result. Step through the real flow below.
New training job
Fine-tune an open model on your data. GPU auto-selected by model size.
Detected: JSONL (instruct) · 52,002 rows · validated
Summary
Llama, Qwen3, Gemma, Mistral… VRAM estimated per method, GPU auto-selected.
Supported Methods
Standard fine-tuning on instruction/response pairs. Best for teaching a model new tasks or domains.
Align model with human preferences using chosen/rejected pairs. No reward model needed.
RL-based alignment with verifiable rewards. Best for math, code, and factual accuracy.
Parameter-efficient fine-tuning — trains small adapter weights, not the full model. 10-100x less VRAM.
LoRA on a 4-bit quantized base model. Train 70B models on a single A100 80GB.
Model Selection
Select any public model from the HuggingFace Hub or enter a custom model ID. Popular models (Llama 4, Llama 3.3, Gemma 3, Mistral Small, Phi-4, DeepSeek R1) are pre-listed with VRAM estimates. GPU is auto-selected based on model size — override manually if needed.
Dataset Formats
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"instruction": "...", "output": "..."}{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}Columns: instruction, input (optional), outputHuggingFace Datasets format — columns auto-detectedUpload directly or provide a HuggingFace dataset ID. Format is auto-detected and validated before training starts. Max 10 GB per upload.
Training Options
Gradient checkpointing: Reduces VRAM usage at the cost of ~20% slower training. Enabled by default for models >7B.
Completions-only: Mask prompt tokens during training — the model only learns from responses, not instructions. Reduces noise.
Early stopping: Automatically stops training when validation loss stops improving (patience: 3 eval steps).
Push to HuggingFace Hub: Automatically push the trained LoRA adapter to your HuggingFace account on completion.
Multi-Stage Pipeline
Chain multiple training stages in sequence. Example: Closed-Form Newton (fast adapter initialization in minutes) → Reasoning Alignment (align reasoning with constitutional principles) → SFT refinement (polish on curated examples). Each stage uses the output of the previous stage as its starting point.
Scaled Training
Multi-GPU and multi-node distributed training with configurable parallelism strategy. Supports DDP, FSDP (Fully Sharded), DeepSpeed ZeRO-2/3, and Pipeline Parallelism. Configure nodes, GPUs per node, micro-batch size, and communication backend (NCCL, Gloo, MPI).
Native Control Plane
For maximum performance, dispatch training to the native binary. Zero garbage collection pauses, direct Metal GPU compute on Apple Silicon unified memory, zero-copy CPU↔GPU transfer. The Closed-Form Newton engine runs natively with no Python overhead — optimal adapter weights computed analytically via SVD decomposition.
Data Pipeline
Cloud Connectors
Connect external storage backends to pull and push data to your instances and the platform cache. Supported connectors:
Credentials are encrypted at rest. Configure in Settings → Storage Providers or inline when creating a connector.
Transfer Manager
Transfers are created automatically when you cache a model or import a dataset, or you can initiate them manually from a connector. Features:
Parallel streams: Large files are split into concurrent streams for maximum throughput.
Progress tracking: Real-time progress percentage, transfer speed, and ETA displayed in the dashboard.
Pause/Resume: Pause active transfers and resume later without re-downloading.
Auto-retry: Transient failures are automatically retried with exponential backoff.
Model Cache
The platform maintains a shared NVMe cache for frequently-used models. When you cache a model (e.g., from HuggingFace Hub), it's stored on high-speed platform storage and can be instant-mounted to any instance — no re-download required.
Use the quick import bar on the Data Pipeline → Model Cache tab: enter a HuggingFace repo ID (e.g., meta-llama/Llama-3.1-8B-Instruct) and click Cache Model.
Cached models can be evicted when no longer needed. Models currently mounted to running instances cannot be evicted.
Dataset Catalog
Import datasets from HuggingFace Datasets, S3 buckets, or direct URLs. Imported datasets are indexed and can be used directly in training jobs.
Source prefixes: huggingface:// for HF Datasets, s3:// for S3 objects, or a direct HTTPS URL.
Streamable datasets: Large datasets can be streamed directly to training without requiring a full local copy.
Models & Inference
Any model in your registry deploys as an OpenAI-compatible endpoint in a single click — the platform provisions vLLM, loads the weights, and warms the URL. Swap your base_url and you're live. Walk the flow:
Deploy an inference endpoint
OpenAI-compatible API, one click. Streaming, temperature, stop sequences — all supported.
base_url = "https://api.futuranexus.io/v1" model = "alpaca-llama-70b" # POST /chat/completions → streaming ready
Deployment
Anything in your registry — trained, imported, or uploaded.
Model Registry
The model registry stores all your models — imported from HuggingFace, uploaded directly, trained on the platform, or pulled via URL. Each model tracks its format (safetensors, GGUF, ONNX, PyTorch), size, and deployment status.
Importing Models
Four import methods are available:
Deploying as Inference Endpoint
Any model in the registry can be deployed as an OpenAI-compatible inference endpoint with one click. The platform provisions a vLLM inference instance, loads the model weights, and generates an API endpoint URL.
# Example: call your deployed model
curl https://api.futuranexus.io/v1/chat/completions \
-H "Authorization: Bearer fn_prod_sk_..." \
-H "Content-Type: application/json" \
-d '{"model": "your-model-name", "messages": [{"role": "user", "content": "Hello!"}]}'Endpoints support the full OpenAI Chat Completions API including streaming, temperature, top_p, max_tokens, and stop sequences.
Storage
Persistent volumes outlive any instance — create one, attach it at a mount path, and re-attach it to the next box in the same region. Or back it with your own S3 / R2 / Wasabi bucket and keep full ownership of the bytes.
New persistent volume
Datasets, checkpoints, and models that outlive the instance.
Summary
Network-attached SSD that survives termination — name it and size it.
Storage Types
Ephemeral (free): Local NVMe SSD on the instance. Fast but lost on termination. Good for temp files, cache, and scratch space.
Persistent ($0.05/GB/mo managed): Network-attached SSD that survives instance termination. Can be re-attached to any new instance in the same region. Use for datasets, checkpoints, and trained models.
Object storage: S3-compatible storage for large datasets and artifacts. Not mounted as a filesystem — accessed via API or tools like aws s3 cp.
Anti-Orphan Protection
Detached persistent volumes have a configurable TTL (time-to-live). When a volume is detached from a terminated instance, the TTL countdown begins. If not re-attached within the TTL period, the volume is automatically deleted. This prevents forgotten volumes from silently accumulating storage charges. You'll receive a notification before deletion.
BYO Storage Providers
Connect your own cloud storage in Settings → Storage Providers. Supported backends:
AWS S3 · Cloudflare R2 · Wasabi · Google Cloud Storage · Backblaze B2
When you create a persistent volume with a BYO provider, data is stored in your bucket — you retain full ownership. Credentials are encrypted at rest.
Compute Providers
FuturaNexus Managed (Default)
We provision GPUs from our provider network. Best availability, fastest launch times (under 30 seconds), automatic failover, and no setup required. Prices as listed in the pricing page.
BYO (Bring Your Own) Provider
Use your own API key from a third-party provider. We handle orchestration, monitoring, zero-drop SSH, and the full dashboard experience — you just bring the compute. Supported providers:
Vast.ai · Lambda · RunPod · Hyperbolic · CoreWeave
Add your API key in Settings → Providers. A flat 5% orchestration fee applies for dashboard, monitoring, and SSH proxy services. You pay the provider directly for GPU compute.
Billing
Spend is visible to the second and fenced by guardrails you set: a hard budget cap with a 5-minute grace warning, and auto-shutdown when an instance goes idle. Nothing burns credits while you sleep. Step through it:
Billing & controls
Spend visible to the second, with hard guardrails you set.
Notifications at 80% · 95% · 100%, then a 5-minute warning before auto-terminate.
Every GPU billed to the second — no rounding, no hourly minimums.
Per-Second Billing
All GPU charges are per-second with no rounding — ever. If you use an H100 for 47 seconds, you pay for exactly 47 seconds. No hourly minimums, no rounding up.
Budget Caps
Set a maximum spend per instance. When the cap is reached, the instance auto-terminates with a 5-minute warning — enough time to save checkpoints and data. You'll receive notifications at 80%, 95%, and 100% of the cap.
Auto-Shutdown on Idle
Configurable idle timeout (15 min to 2 hours). Triggers when no SSH sessions are active and GPU utilization is below 5%. Prevents forgotten instances from burning credits. Recommended for all interactive development sessions.
Credits & Invoices
Pre-paid credits are applied first to all charges. Monthly invoices are generated for any usage beyond your credit balance. Download PDF invoices from the Billing page. All prices are in USD.
API Reference
Authentication
Create API keys in Settings → API Keys. Include the key in the Authorization header:
curl -H "Authorization: Bearer fn_prod_sk_..." \ https://api.futuranexus.io/v1/instances
API keys can be scoped to specific resources (instances, training, storage, billing). Rate limit: 100 req/min per key.
Endpoints
GET /v1/instancesList all instancesPOST /v1/instancesLaunch new instanceGET /v1/instances/:idGet instance details + metricsPOST /v1/instances/:id/stopStop instanceDELETE /v1/instances/:idTerminate & deleteGET /v1/trainingList training jobsPOST /v1/trainingStart training jobGET /v1/training/:idGet job details + metricsGET /v1/modelsList trained modelsPOST /v1/models/:id/deployDeploy as endpointGET /v1/storage/volumesList volumesPOST /v1/storage/volumesCreate volumeGET /v1/billing/usageCurrent period usageGET /v1/billing/creditsCredit balanceGET /v1/billing/invoicesInvoice historyWebSocket Events
Real-time instance metrics and state changes via WebSocket at wss://api.futuranexus.io/v1/ws. Events: instance.state_change, instance.metrics, training.progress, training.complete, billing.budget_warning.
CLI
Installation
# Install via npm npm install -g @futuranexus/cli # Or via Homebrew (macOS/Linux) brew install futuranexus/tap/fn # Authenticate fn auth login
Common Commands
# List instances fn instances list # Launch instance fn instances launch --gpu h100_sxm --env pytorch-full --name my-run # SSH into instance fn ssh my-run # Start training fn training start --model meta-llama/Llama-3.1-8B-Instruct --dataset ./data.jsonl --method sft # Check training status fn training status tj_001 # List volumes fn storage list