GML-2088: Add regression testing framework with evaluation metrics by prinskumar-tigergraph · Pull Request #43 · tigergraph/graphrag

prinskumar-tigergraph · 2026-06-15T17:43:37Z

PR Type

Tests, Enhancement

Description

Add GraphRAG regression evaluation framework
Score hallucination confidence with reasons
Automate graph setup and evaluation runs
Mount regression datasets into container

Diagram Walkthrough

flowchart LR
  dataset["Test dataset"] --> setup["Graph setup"]
  setup -- "creates and ingests" --> graph["GraphRAG graph"]
  graph -- "queries" --> evaluator["Regression evaluator"]
  evaluator -- "scores" --> metrics["Correctness and hallucination"]
  metrics -- "writes" --> results["CSV summary"]

File Walkthrough

Relevant files

Enhancement

1 files

agent_hallucination_check.py `Return hallucination confidence with reasoning`	+94/-22

Tests

5 files

evaluator.py `Add GraphRAG regression evaluation CLI`	+748/-0
setup_graph.py `Add regression graph setup automation`	+349/-0
run_eval.sh `Add containerized evaluation runner script`	+45/-0
run_load_eval.sh `Add exported graph evaluation runner`	+85/-0
run_setup.sh `Add containerized graph setup runner`	+30/-0

Dependencies

1 files

requirements.txt `Bump pyTigerGraph minimum dependency version`	+1/-1

Configuration changes

1 files

docker-compose.yml `Mount regression test directories into container`	+2/-0

tg-pr-agent · 2026-06-15T17:44:20Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Breaking Change `HallucinationCheckResponse` changed from a `score` field to `confidence` and `reason`. Existing callers that still read `score` or expect the previous yes/no schema may fail unless all usages are updated or backward compatibility is provided. class HallucinationCheckResponse(BaseModel): confidence: float = Field( description=( "Confidence score between 0.0 and 1.0 that the answer is hallucinated. " "0.0 = fully grounded in the context, 1.0 = completely hallucinated. " "Use the full range — e.g. 0.2 for mostly grounded with minor gaps, " "0.8 for mostly unsupported claims." ) ) reason: str = Field( description=( "A concise one-to-two sentence explanation of why you assigned this " "confidence score, citing specific claims from the answer and context." ) ) Incomplete Feature `--load-exported` is exposed by the CLI and wrapper script, but `load_exported()` is still a stub that always exits with failure. This makes the advertised load-and-evaluate workflow unusable. def load_exported(graphname: str, exported_dir: str) -> None: """TODO: Restore graph from ExportedGraph/ folder. Implement using TigerGraph's export/import mechanism once confirmed. Options: - GSQL: DROP GRAPH + CREATE GRAPH + load GBAR - GraphRAG restore API endpoint (if available) - pyTigerGraph backup restore """ if not os.path.isdir(exported_dir): sys.exit( f"ERROR: ExportedGraph directory not found: {exported_dir}\n" f"Run setup first or add exported graph files to this directory." ) print(f"\n --load-exported is not yet implemented.") print(f" Edit load_exported() in setup_graph.py with the TigerGraph") print(f" import command for your environment, then rerun.\n") sys.exit(1) Edge Case `run_eval()` can create `ThreadPoolExecutor(max_workers=0)` when the dataset contains zero questions, which raises a runtime error instead of producing a clear validation message. total = len(questions) workers = min(8, total) results: List[Optional[EvalResult]] = [None] * total printer = _print_detailed if detailed else _print_compact with ThreadPoolExecutor(max_workers=workers, thread_name_prefix="eval") as pool:

tg-pr-agent · 2026-06-15T17:45:37Z

+    sys.stdout.flush()
+    sys.stderr.flush()
+    os._exit(1 if any(r.error for r in results) else 0)


Suggestion: Make the evaluator fail when a question is unanswered or any metric fails, not only when the HTTP query itself errors. Otherwise a regression run can exit successfully even though hallucination/correctness checks were skipped or failed for every row. [possible issue, importance: 7]

Suggested change

sys.stdout.flush()

sys.stderr.flush()

os._exit(1 if any(r.error for r in results) else 0)

sys.stdout.flush()

sys.stderr.flush()

exit_code = 1 if any(

r.error or not r.answered_question or r.metric_errors

for r in results

) else 0

os._exit(exit_code)

tg-pr-agent · 2026-06-15T17:45:37Z

+    ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={
+        "data_source":        "server",
+        "data_source_config": {"data_path": folder_path},
+        "loader_config":      {},
+        "file_format":        "json",
+    })


Suggestion: Do not hard-code file_format to json when data/ may contain other document types. Derive and validate the extension before creating the ingest job so non-JSON datasets do not get ingested with the wrong loader. [possible issue, importance: 7]

Suggested change

ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={

"data_source": "server",

"data_source_config": {"data_path": folder_path},

"loader_config": {},

"file_format": "json",

})

extensions = {

os.path.splitext(fp)[1].lstrip(".").lower()

for fp in doc_files

if os.path.splitext(fp)[1]

}

if len(extensions) != 1:

raise SetupError(

f"data/ must contain one file type for ingest; found: {sorted(extensions)}"

)

file_format = extensions.pop()

ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={

"data_source": "server",

"data_source_config": {"data_path": folder_path},

"loader_config": {},

"file_format": file_format,

})

tg-pr-agent · 2026-06-15T17:45:37Z

+LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \
+    python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/tty)


Suggestion: Avoid writing to /dev/tty while capturing LOAD_OUTPUT; many CI and non-interactive shells do not have a controlling TTY, causing tee to fail under set -euo pipefail. Tee to stderr instead so output is still visible while stdout remains capturable. [possible issue, importance: 8]

Suggested change

LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \

python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/tty)

LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \

python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/stderr)

GML-2088: Add regression testing framework with evaluation metrics

7335f76

tg-pr-agent Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GML-2088: Add regression testing framework with evaluation metrics#43

GML-2088: Add regression testing framework with evaluation metrics#43
prinskumar-tigergraph wants to merge 1 commit into
mainfrom
GML-2088-Evaluation-Benchmarking

prinskumar-tigergraph commented Jun 15, 2026 •

edited by tg-pr-agent Bot

Loading

Uh oh!

tg-pr-agent Bot commented Jun 15, 2026

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \
		python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" \| tee /dev/tty)

Conversation

prinskumar-tigergraph commented Jun 15, 2026 • edited by tg-pr-agent Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

tg-pr-agent Bot commented Jun 15, 2026

PR Reviewer Guide 🔍

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

tg-pr-agent Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prinskumar-tigergraph commented Jun 15, 2026 •

edited by tg-pr-agent Bot

Loading