Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"
Caution
On Apple Silicon, linear programming (PuLP/CBC) may be unstable depending on environment settings. If needed, run under Rosetta.
uv syncDownload NLTK resources if required:
uv run python -m nltk.downloader alldocker compose up -d
docker compose exec Human_LLM_collaborative_annotation bashRun commands in the container:
python src/main.py ...Or from host:
docker compose exec Human_LLM_collaborative_annotation python src/main.py ...- Hugging Face:
huggingface-cli login
- Weights & Biases:
wandb login
At minimum:
data/tweet_eval/tweet_eval_annotated_with_llm.csvdata/word-sets.json(used bytweet_eval_vocabpreprocessing)
Missing split files (for example tweet_eval_10_train.csv) are generated automatically during preprocessing.
This project uses Hydra. Any setting can be overridden from the command line with dot notation:
uv run python src/main.py wandb_enabled=false trainer.seed=203.1 Base Config
Base settings live in config/config.yaml.
| Setting | Default | Description |
|---|---|---|
defaults |
dataset: spiral, trainer: spiral, model: common_confusion |
Hydra config groups loaded by default. |
method |
train |
Execution method. Use train, confusion, or icrowd depending on the experiment config. |
name |
${trainer.dataset_name} experiment |
Run name passed to W&B when logging is enabled. |
debug |
false |
Enables short debug behavior in training and changes commit_hash to debug_mode. |
abci |
false |
Appends _abci to the run name in the common training path. |
mode |
train |
Main execution mode. Current entry point supports train. |
annotator_num |
6 |
Number of annotators or systems used by model and assignment configs. |
commit_hash |
${commit_hash: ${debug}} |
Output namespace generated by the Hydra resolver. |
wandb_enabled |
true |
Enables W&B initialization and W&B logger usage. Set false to run without W&B. |
wandb_entity |
kei-moriyama-the-university-of-tokyo |
W&B entity. |
wandb_project |
task assignment |
W&B project name. |
logger._target_ |
lightning.pytorch.loggers.WandbLogger |
Lightning logger class used when wandb_enabled=true. |
train.epoch |
150 |
Number of training epochs unless overridden by trainer or debug behavior. |
Disable W&B logging with:
uv run python src/main.py wandb_enabled=false3.2 Experiment Configs
Experiment configs are selected with +experiment=<name>.
| Config | Main overrides |
|---|---|
tweet_eval_confusion |
Uses trainer: sentiment, dataset: tweet_eval, model: confusion, LossConfusion, and ConfusionModel. |
tweet_eval_confusion_cost_const |
Adds trainer.random_assignment=false and model.CostConstraint with cost_per_annotator and total_cost_per_annotator. |
tweet_eval_learning_to_defer_assignment |
Uses dataset: tweet_eval_vocab, model: linear, LossLearningToDefer, MatchingBatchModel, and assign_interval. |
tweet_eval_icrowd_assignment |
Sets method=icrowd and uses train.icrowd.NLPICrowdTaskAssignment. |
spiral_different_test_num_confusion |
Uses dataset: spiral_test_num, ConfusionModel, and MaximumNumberConstraint. |
spiral_different_test_num_confusion_cost |
Uses dataset: spiral_test_num, ConfusionModel, and CostConstraint. |
All commands below assume uv.
For Docker, replace uv run with docker compose exec Human_LLM_collaborative_annotation.
uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_confusion \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_confusion_cost_const \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_learning_to_defer_assignment \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_icrowd_assignment \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_confusion \
dataset.sampling_rate=0.5,0.6,0.7,0.8,0.9,1.0 \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=tweet_eval_confusion \
dataset.filter_ratio=0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0 \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=spiral_different_test_num_confusion \
dataset.test_data_num=10000,30000,50000,100000 \
trainer.seed=10,20,30,40,50uv run python src/main.py -m \
debug=false \
+experiment=spiral_different_test_num_confusion_cost \
dataset.test_data_num=10000,30000,50000,100000 \
trainer.seed=10,20,30,40,50Main outputs are written to:
outputs/<commit_hash>/<dataset>/<loss_or_method>/<seed>/
Typical artifacts:
score_test.json- W&B tables/metrics
With debug=true, commit_hash becomes debug_mode.
{
WIP
}