Skip to content

feat: add optional Unsloth training backendFeat/unsloth integration#2

Merged
cuolm merged 9 commits into
masterfrom
feat/unsloth-integration
May 6, 2026
Merged

feat: add optional Unsloth training backendFeat/unsloth integration#2
cuolm merged 9 commits into
masterfrom
feat/unsloth-integration

Conversation

@cuolm

@cuolm cuolm commented May 6, 2026

Copy link
Copy Markdown
Owner

Summary

This PR adds optional Unsloth training support and improves data safety for the C‑based FIM pipeline.

Key Changes

  • Backend: Introduced optional Unsloth training backend with lazy import, while keeping standard HF Trainer as default.
  • Memory: Aligned Unsloth max_seq_length exactly to max_token_sequence_length for better VRAM efficiency.
  • Data Safety: Fixed empty‑subblock crash and removed unsafe UTF‑8‑cutting truncation; replaced with two‑pass filtering and token‑length checks.
  • Infrastructure: Added unsloth‑related tests, optional dependency declaration, compiled‑cache ignore, and a token‑length analysis script.

cuolm added 9 commits May 1, 2026 16:48
- Two-step filtering approach:
  1: First pass uses estimated byte_per_token_ratio with
  max_token_sequence_length to calculate max_bytes_per_subblock =
  int(config.max_token_sequence_length * bytes_per_token_ratio) and discards code
  blocks exceeding that estimate.
  2. Second pass applies _filter_overlong_tokenized_examples()
  with actual token counts as hard limit to guarantee no tokenized examples exceed
  max_token_sequence_length
- Eliminates Unsloth max_seq_length overflow risk
- Branch model loading into _load_hf_model (existing) and _load_unsloth_model (lazy import)
- Unsloth uses use_gradient_checkpointing="unsloth" internally, disables HF Trainer gradient_checkpointing when active
- Unsloth max_seq_length set to exact max_token_sequence_length for precise memory allocation
- Full compatibility with existing merge_lora_and_save and GGUF conversion pipeline
Empty subblocks caused np.random.choice(0, size=1) ValueError crash. Added early
return [] when len(subblock_ranges) == 0. This is safe because empty blocks cannot
generate FIM examples.

Code block truncation code_block_utf8[:max_bytes_per_subblock] cut Tree-sitter
nodes mid-Unicode (box chars █=e2 96 88 → e2 96), causing UnicodeDecodeError
at 512-token limit. Removed truncation. This is safe because tokenizer
post-filters overlong examples.
@cuolm cuolm merged commit fc6161b into master May 6, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant