Fine-Tuning with Kubeflow Trainer v2
Run supervised fine-tuning with LlamaFactory on Kubernetes using Kubeflow Trainer v2.
Trainer v2 splits the job into a reusable TrainingRuntime (image + pipeline steps + LlamaFactory config) and per-experiment TrainJob runs that override only what changes (model, dataset, hyperparameters, GPU resources).
Prerequisites
If you hit RBAC errors, ask a cluster admin to grant your workbench ServiceAccount read/write on trainjobs and trainingruntimes in the target namespace (example role: apiGroups: ["trainer.kubeflow.org"], resources: ["trainjobs","trainingruntimes"]).
Build or use a prebuilt image
Use alaudadockerhub/fine_tune_with_llamafactory:v0.1.11, or build your own from the Containerfile under assets/build-train-image/.
Run the example notebook
Download fine-tune-with-trainer-v2.ipynb into your workbench and follow the cells. The notebook creates a TrainingRuntime, then submits a TrainJob that mounts the shared PVC and uses the aml-image-builder-secret.
For Huawei Ascend NPUs, use fine-tune-with-trainer-v2-mindspeed-npu.ipynb instead — it runs the MindSpeed-LLM SFT pipeline (HF → MCore checkpoint, preprocess, train) on huawei.com/Ascend910B4 resources with runtimeClassName: ascend.
Scheduling with Kueue
When Kueue is installed, TrainJobs stay suspended until Kueue admits them against the configured ClusterQueue quota. Ready-to-apply YAMLs live in assets/kueue/:
The three files in turn:
cluster-queue.yaml— a singleResourceFlavorplus aClusterQueuecovering cpu / memory /nvidia.com/gpu. Cluster admin applies it once per quota pool.local-queue.yaml— a namespacedLocalQueuethat referencescluster-queue. Namespace admin applies it once per namespace.trainjob-kueue-example.yaml— aTrainJoblabelledkueue.x-k8s.io/queue-name: local-queue. The TrainJob stays Suspended until Kueue admits it; once admitted, JobSet brings the trainer pods up.
See Kueue docs for the full setup.
When the Kueue PodsReady timeout is short and the training image is large, the first attempt may be evicted on image-pull timeout. Resubmitting usually succeeds because the image is cached on the node.