feat: add optional Unsloth training backendFeat/unsloth integration by cuolm · Pull Request #2 · cuolm/codefinetuner

cuolm · 2026-05-06T07:03:29Z

Summary

This PR adds optional Unsloth training support and improves data safety for the C‑based FIM pipeline.

Key Changes

Backend: Introduced optional Unsloth training backend with lazy import, while keeping standard HF Trainer as default.
Memory: Aligned Unsloth max_seq_length exactly to max_token_sequence_length for better VRAM efficiency.
Data Safety: Fixed empty‑subblock crash and removed unsafe UTF‑8‑cutting truncation; replaced with two‑pass filtering and token‑length checks.
Infrastructure: Added unsloth‑related tests, optional dependency declaration, compiled‑cache ignore, and a token‑length analysis script.

- Two-step filtering approach: 1: First pass uses estimated byte_per_token_ratio with max_token_sequence_length to calculate max_bytes_per_subblock = int(config.max_token_sequence_length * bytes_per_token_ratio) and discards code blocks exceeding that estimate. 2. Second pass applies _filter_overlong_tokenized_examples() with actual token counts as hard limit to guarantee no tokenized examples exceed max_token_sequence_length - Eliminates Unsloth max_seq_length overflow risk

- Branch model loading into _load_hf_model (existing) and _load_unsloth_model (lazy import) - Unsloth uses use_gradient_checkpointing="unsloth" internally, disables HF Trainer gradient_checkpointing when active - Unsloth max_seq_length set to exact max_token_sequence_length for precise memory allocation - Full compatibility with existing merge_lora_and_save and GGUF conversion pipeline

Empty subblocks caused np.random.choice(0, size=1) ValueError crash. Added early return [] when len(subblock_ranges) == 0. This is safe because empty blocks cannot generate FIM examples. Code block truncation code_block_utf8[:max_bytes_per_subblock] cut Tree-sitter nodes mid-Unicode (box chars █=e2 96 88 → e2 96), causing UnicodeDecodeError at 512-token limit. Removed truncation. This is safe because tokenizer post-filters overlong examples.

cuolm added 9 commits May 1, 2026 16:48

tests: add unsloth model loading tests

a970cb6

tests: add required max_token_sequence_length to textwrap config tests

5f437ef

build: add unsloth optional dependency with linux/win32 platform markers

7f19a81

chore: add script for token lenght analysis of dataset

005c75d

chore: add unsloth compiled cache folder to gitignore

edf46dd

chore: update codefinetuner configuration parameters

5bfe460

cuolm merged commit fc6161b into master May 6, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add optional Unsloth training backendFeat/unsloth integration#2

feat: add optional Unsloth training backendFeat/unsloth integration#2
cuolm merged 9 commits into
masterfrom
feat/unsloth-integration

cuolm commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuolm commented May 6, 2026

Summary

Key Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant