Skip to content

GML-2088: Add regression testing framework with evaluation metrics#43

Open
prinskumar-tigergraph wants to merge 1 commit into
mainfrom
GML-2088-Evaluation-Benchmarking
Open

GML-2088: Add regression testing framework with evaluation metrics#43
prinskumar-tigergraph wants to merge 1 commit into
mainfrom
GML-2088-Evaluation-Benchmarking

Conversation

@prinskumar-tigergraph

@prinskumar-tigergraph prinskumar-tigergraph commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

PR Type

Tests, Enhancement


Description

  • Add GraphRAG regression evaluation framework

  • Score hallucination confidence with reasons

  • Automate graph setup and evaluation runs

  • Mount regression datasets into container


Diagram Walkthrough

flowchart LR
  dataset["Test dataset"] --> setup["Graph setup"]
  setup -- "creates and ingests" --> graph["GraphRAG graph"]
  graph -- "queries" --> evaluator["Regression evaluator"]
  evaluator -- "scores" --> metrics["Correctness and hallucination"]
  metrics -- "writes" --> results["CSV summary"]
Loading

File Walkthrough

Relevant files
Enhancement
1 files
agent_hallucination_check.py
Return hallucination confidence with reasoning                     
+94/-22 
Tests
5 files
evaluator.py
Add GraphRAG regression evaluation CLI                                     
+748/-0 
setup_graph.py
Add regression graph setup automation                                       
+349/-0 
run_eval.sh
Add containerized evaluation runner script                             
+45/-0   
run_load_eval.sh
Add exported graph evaluation runner                                         
+85/-0   
run_setup.sh
Add containerized graph setup runner                                         
+30/-0   
Dependencies
1 files
requirements.txt
Bump pyTigerGraph minimum dependency version                         
+1/-1     
Configuration changes
1 files
docker-compose.yml
Mount regression test directories into container                 
+2/-0     

@tg-pr-agent

tg-pr-agent Bot commented Jun 15, 2026

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Breaking Change

HallucinationCheckResponse changed from a score field to confidence and reason. Existing callers that still read score or expect the previous yes/no schema may fail unless all usages are updated or backward compatibility is provided.

class HallucinationCheckResponse(BaseModel):
    confidence: float = Field(
        description=(
            "Confidence score between 0.0 and 1.0 that the answer is hallucinated. "
            "0.0 = fully grounded in the context, 1.0 = completely hallucinated. "
            "Use the full range — e.g. 0.2 for mostly grounded with minor gaps, "
            "0.8 for mostly unsupported claims."
        )
    )
    reason: str = Field(
        description=(
            "A concise one-to-two sentence explanation of why you assigned this "
            "confidence score, citing specific claims from the answer and context."
        )
    )
Incomplete Feature

--load-exported is exposed by the CLI and wrapper script, but load_exported() is still a stub that always exits with failure. This makes the advertised load-and-evaluate workflow unusable.

def load_exported(graphname: str, exported_dir: str) -> None:
    """TODO: Restore graph from ExportedGraph/ folder.

    Implement using TigerGraph's export/import mechanism once confirmed.
    Options:
      - GSQL: DROP GRAPH + CREATE GRAPH + load GBAR
      - GraphRAG restore API endpoint (if available)
      - pyTigerGraph backup restore
    """
    if not os.path.isdir(exported_dir):
        sys.exit(
            f"ERROR: ExportedGraph directory not found: {exported_dir}\n"
            f"Run setup first or add exported graph files to this directory."
        )
    print(f"\n  --load-exported is not yet implemented.")
    print(f"  Edit load_exported() in setup_graph.py with the TigerGraph")
    print(f"  import command for your environment, then rerun.\n")
    sys.exit(1)
Edge Case

run_eval() can create ThreadPoolExecutor(max_workers=0) when the dataset contains zero questions, which raises a runtime error instead of producing a clear validation message.

total    = len(questions)
workers  = min(8, total)
results: List[Optional[EvalResult]] = [None] * total
printer  = _print_detailed if detailed else _print_compact

with ThreadPoolExecutor(max_workers=workers, thread_name_prefix="eval") as pool:

Comment on lines +746 to +748
sys.stdout.flush()
sys.stderr.flush()
os._exit(1 if any(r.error for r in results) else 0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Make the evaluator fail when a question is unanswered or any metric fails, not only when the HTTP query itself errors. Otherwise a regression run can exit successfully even though hallucination/correctness checks were skipped or failed for every row. [possible issue, importance: 7]

Suggested change
sys.stdout.flush()
sys.stderr.flush()
os._exit(1 if any(r.error for r in results) else 0)
sys.stdout.flush()
sys.stderr.flush()
exit_code = 1 if any(
r.error or not r.answered_question or r.metric_errors
for r in results
) else 0
os._exit(exit_code)

Comment on lines +145 to +150
ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={
"data_source": "server",
"data_source_config": {"data_path": folder_path},
"loader_config": {},
"file_format": "json",
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Do not hard-code file_format to json when data/ may contain other document types. Derive and validate the extension before creating the ingest job so non-JSON datasets do not get ingested with the wrong loader. [possible issue, importance: 7]

Suggested change
ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={
"data_source": "server",
"data_source_config": {"data_path": folder_path},
"loader_config": {},
"file_format": "json",
})
extensions = {
os.path.splitext(fp)[1].lstrip(".").lower()
for fp in doc_files
if os.path.splitext(fp)[1]
}
if len(extensions) != 1:
raise SetupError(
f"data/ must contain one file type for ingest; found: {sorted(extensions)}"
)
file_format = extensions.pop()
ingest_info = client.post(f"/ui/{graphname}/create_ingest", json_body={
"data_source": "server",
"data_source_config": {"data_path": folder_path},
"loader_config": {},
"file_format": file_format,
})

Comment on lines +61 to +62
LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \
python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/tty)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Avoid writing to /dev/tty while capturing LOAD_OUTPUT; many CI and non-interactive shells do not have a controlling TTY, causing tee to fail under set -euo pipefail. Tee to stderr instead so output is still visible while stdout remains capturable. [possible issue, importance: 8]

Suggested change
LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \
python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/tty)
LOAD_OUTPUT=$(docker exec -e PYTHONUNBUFFERED=1 "${CONTAINER}" \
python /code/tests/regression/setup_graph.py --load-exported "${SETUP_ARGS[@]}" | tee /dev/stderr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants