Skip to content

babalablab/task-assignment

Repository files navigation

Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"

Caution

On Apple Silicon, linear programming (PuLP/CBC) may be unstable depending on environment settings. If needed, run under Rosetta.

1. Environment Setup

1.1 Using uv (recommended)

uv sync

Download NLTK resources if required:

uv run python -m nltk.downloader all

1.2 Using Docker

docker compose up -d
docker compose exec Human_LLM_collaborative_annotation bash

Run commands in the container:

python src/main.py ...

Or from host:

docker compose exec Human_LLM_collaborative_annotation python src/main.py ...

2. Authentication and Required Data

2.1 Login

  • Hugging Face:
    huggingface-cli login
  • Weights & Biases:
    wandb login

2.2 Required local files

At minimum:

  • data/tweet_eval/tweet_eval_annotated_with_llm.csv
  • data/word-sets.json (used by tweet_eval_vocab preprocessing)

Missing split files (for example tweet_eval_10_train.csv) are generated automatically during preprocessing.

3. Configuration

This project uses Hydra. Any setting can be overridden from the command line with dot notation:

uv run python src/main.py wandb_enabled=false trainer.seed=20
3.1 Base Config

Base settings live in config/config.yaml.

Setting Default Description
defaults dataset: spiral, trainer: spiral, model: common_confusion Hydra config groups loaded by default.
method train Execution method. Use train, confusion, or icrowd depending on the experiment config.
name ${trainer.dataset_name} experiment Run name passed to W&B when logging is enabled.
debug false Enables short debug behavior in training and changes commit_hash to debug_mode.
abci false Appends _abci to the run name in the common training path.
mode train Main execution mode. Current entry point supports train.
annotator_num 6 Number of annotators or systems used by model and assignment configs.
commit_hash ${commit_hash: ${debug}} Output namespace generated by the Hydra resolver.
wandb_enabled true Enables W&B initialization and W&B logger usage. Set false to run without W&B.
wandb_entity kei-moriyama-the-university-of-tokyo W&B entity.
wandb_project task assignment W&B project name.
logger._target_ lightning.pytorch.loggers.WandbLogger Lightning logger class used when wandb_enabled=true.
train.epoch 150 Number of training epochs unless overridden by trainer or debug behavior.

Disable W&B logging with:

uv run python src/main.py wandb_enabled=false
3.2 Experiment Configs

Experiment configs are selected with +experiment=<name>.

Config Main overrides
tweet_eval_confusion Uses trainer: sentiment, dataset: tweet_eval, model: confusion, LossConfusion, and ConfusionModel.
tweet_eval_confusion_cost_const Adds trainer.random_assignment=false and model.CostConstraint with cost_per_annotator and total_cost_per_annotator.
tweet_eval_learning_to_defer_assignment Uses dataset: tweet_eval_vocab, model: linear, LossLearningToDefer, MatchingBatchModel, and assign_interval.
tweet_eval_icrowd_assignment Sets method=icrowd and uses train.icrowd.NLPICrowdTaskAssignment.
spiral_different_test_num_confusion Uses dataset: spiral_test_num, ConfusionModel, and MaximumNumberConstraint.
spiral_different_test_num_confusion_cost Uses dataset: spiral_test_num, ConfusionModel, and CostConstraint.

4. Reproduction Commands

All commands below assume uv. For Docker, replace uv run with docker compose exec Human_LLM_collaborative_annotation.

4.1 Case 1: Full Annotation on Large Dataset

Ours (maximum-assignment constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  trainer.seed=10,20,30,40,50

Ours (cost constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion_cost_const \
  trainer.seed=10,20,30,40,50

Baseline: L2D + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_learning_to_defer_assignment \
  trainer.seed=10,20,30,40,50

Baseline: iCrowd + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_icrowd_assignment \
  trainer.seed=10,20,30,40,50

4.2 Case 2: Full Annotation on Small Dataset (sampling-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.sampling_rate=0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.3 Case 3: Partial Annotation (annotation/filter-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.filter_ratio=0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.4 Linear Programming Runtime Evaluation (Spiral)

Maximum-assignment constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

Cost constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion_cost \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

5. Output Artifacts

Main outputs are written to:

  • outputs/<commit_hash>/<dataset>/<loss_or_method>/<seed>/

Typical artifacts:

  • score_test.json
  • W&B tables/metrics

With debug=true, commit_hash becomes debug_mode.

6. Citation

{
    WIP
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages