K-BrowseComp

Introduction

K-BrowseComp is a Korean web-browsing agent benchmark inspired by BrowseComp. It evaluates whether agents can retrieve hard-to-find public information from Korean websites, keep track of multi-hop or parallel evidence, and return a single stable short answer grounded in Korean contexts.

📄 Paper: K-BrowseComp

🤗 Dataset: K-BrowseComp

K-BrowseComp contains 400 problems:

K-BrowseComp-Verified: 300 problems manually written and validated by native Korean speakers.
K-BrowseComp-Synthetic: 100 machine-generated diagnostic problems created with hard few-shot exemplars and failure-mode-targeted generation. (See automated_data_gen/ for the generation pipeline.)

Each verified item is designed to require either multi-hop reasoning or parallel-branching constraint satisfaction over public Korean web evidence. The dataset release includes the problem, gold answer, expected trajectory, source URLs, and checklist values for key intermediate evidence, enabling trajectory diagnostics in addition to final-answer scoring.

In our experiments, even the strongest evaluated model remains below 50% on K-BrowseComp-Verified and reaches only 26% on the Synthetic split.

The benchmark therefore tests more than whether a model can search in Korean: it tests whether the model can preserve the exact evidence chain needed to answer Korean users' browsing questions reliably.

K-BrowseComp Search Evals

Evaluation and automated problem-generation code for K-BrowseComp.

This repository is adapted from perplexityai/search_evals.

Runtime evals download K-BrowseComp from prometheus-eval/k-browsecomp by default. The repo also keeps fallback local JSONL copies:

search_evals/datasets/ko_browsecomp.jsonl
search_evals/datasets/ko_browsecomp_synthetic.jsonl

Generated run outputs, model trajectories, seed-bank artifacts, local API keys, virtual environments, and logs are ignored by git.

License and Attribution

Unless otherwise noted, this repository is distributed under the MIT License; see LICENSE.

The evaluation framework is adapted from perplexityai/search_evals, which is also MIT-licensed. Files adapted from that repository include file-level source comments at the top of the file.

Setup

uv sync

Python dependencies are managed with uv. Most commands below should be run from the repository root.

Colab Quickstart

Run the Colab notebook to evaluate gpt-5-nano on a dry-run K-BrowseComp sample with only notebook cells. The notebook downloads the dataset from prometheus-eval/k-browsecomp:

The first cell is the only cell you need to edit. It asks for OPENAI_API_KEY and one search API key for the selected search engine.

Credentials

search_evals/run_eval.py reads credentials from environment variables.

export OPENAI_API_KEY="..."
export PERPLEXITY_API_KEY="..."

Use the matching search key for the search engine you select:

Search engine	Required env var
`perplexity` / `perplexity-long`	`PERPLEXITY_API_KEY`
`brave`	`BRAVE_API_KEY`
`exa`	`EXA_API_KEY`
`tavily`	`TAVILY_API_KEY`

Model keys depend on the model prefix:

Model prefix	Required env var
`gpt...`	`OPENAI_API_KEY`
`gemini...`	`GEMINI_API_KEY` or `GOOGLE_API_KEY`
`claude...`	`ANTHROPIC_API_KEY`
`openrouter:...`	`OPENROUTER_API_KEY`
`litellm_proxy/...`	`LITELLM_PROXY_BASE_URL` and `LITELLM_PROXY_API_KEY`

For automated_data_gen/problem_generation/, gitignored files under api_keys/ are also auto-loaded by _load_env.py; see the autogen section below.

Run Evaluations

By default, suite=ko_browsecomp loads the verified config and test split from prometheus-eval/k-browsecomp.

Run a small live sample:

uv run python search_evals/run_eval.py \
  search_engine=perplexity \
  model=gpt-5.4-mini \
  suite=ko_browsecomp \
  dry_run=true \
  run_suffix=smoke

For K-BrowseComp, dry_run=true runs the first 10 samples and still calls the model, search engine, and grader APIs. Use run_suffix to make the output folder easier to distinguish.

Run the full default dataset:

uv run python search_evals/run_eval.py \
  search_engine=perplexity \
  model=gpt-5.4-mini \
  suite=ko_browsecomp

Run the synthetic dataset:

uv run python search_evals/run_eval.py \
  search_engine=perplexity \
  model=gpt-5.4-mini \
  suite=ko_browsecomp \
  hf_dataset_config=synthetic \
  dry_run=true \
  run_suffix=synthetic-smoke

You can still evaluate a local JSONL file by passing dataset_path=...; when dataset_path is set, the Hugging Face dataset settings are ignored.

Outputs are written under:

per-question task files: runs/<search>-<model>_<suite>[-suffix]/
aggregate score: runs/results/<search>-<model>_<suite>[-suffix].json

By default, existing per-question task files are reused. To delete cached files for a run and start over:

uv run python search_evals/run_eval.py \
  search_engine=perplexity \
  model=gpt-5.4-mini \
  suite=ko_browsecomp \
  dry_run=true \
  run_suffix=smoke \
  rerun=true

Automated Problem Generation

Autogen code lives under automated_data_gen/.

The main Stage 3 problem-generation entry point is:

cd automated_data_gen/problem_generation
python3 run.py --litellm --max-seeds 1

Default adversarial target models are:

azure_ai/gpt-5.4-mini when --litellm is enabled; otherwise gpt-5.4-mini
gemini/gemini-3-flash-preview

Acceptance requires every target model to fail for a target failure mode. You can override the list:

python3 run.py --litellm \
  --target-models "azure_ai/gpt-5.4-mini,gemini/gemini-3-flash-preview"

Autogen credentials can be exported as environment variables or placed in gitignored files:

File	Env var(s) populated
`api_keys/api_key_perplexity.txt`	`PERPLEXITY_API_KEY`
`api_keys/litellm.txt`	`LITELLM_PROXY_API_KEY`, `OPENAI_API_KEY`
`api_keys/base_url.txt`	`LITELLM_PROXY_BASE_URL`, `OPENAI_BASE_URL`
`api_keys/gemini.txt`	`GEMINI_API_KEY`, `GOOGLE_API_KEY`

Autogen --dry-run has a different meaning from eval dry_run=true: it only renders per-seed task_spec.md files and does not launch the proposer loop.

cd automated_data_gen/problem_generation
python3 run.py --dry-run --seed-json /path/to/local_seed.json --max-seeds 1

Local/generated seed material is not part of the public repo. If you have a private seed file, pass it with --seed-json; otherwise Stage 3 looks for the gitignored Stage 2 aggregate files under automated_data_gen/seed_expansion/.

Tests

uv run --extra dev pytest

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
automated_data_gen		automated_data_gen
docs		docs
examples		examples
misc		misc
search_evals		search_evals
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K-BrowseComp

Introduction

K-BrowseComp Search Evals

License and Attribution

Setup

Colab Quickstart

Credentials

Run Evaluations

Automated Problem Generation

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

K-BrowseComp

Introduction

K-BrowseComp Search Evals

License and Attribution

Setup

Colab Quickstart

Credentials

Run Evaluations

Automated Problem Generation

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages