Model Auto-Discovery Tests

Overview

What: A pytest-based runner that auto-discovers Torch models from tt-forge-models and generates tests for inference and training across parallelism modes.
Why: Standardize model testing, reduce bespoke tests in repos, and scale coverage as models are added or updated.
Scope: Discovers loader.py under <model>/pytorch/ in third_party/tt_forge_models, queries variants, and runs each combination of:
- Run mode: inference, training
- Parallelism: single_device, data_parallel, tensor_parallel

Note: Discovery currently targets PyTorch models only. JAX model auto-discovery is planned.

Prerequisites

A working TT-XLA development environment, built and ready to run tests, with pytest installed.
third_party/tt_forge_models git submodule initialized and up to date:

git submodule update --init --recursive third_party/tt_forge_models

Device availability matching your chosen parallelism mode (e.g., multiple devices for data/tensor parallel).
Optional internet access for per-model pip installs during test execution.
env-var IRD_LF_CACHE set to point to large file cache / webserver for s3 bucket mirror. Reach out to team for details.

Quick start / commonly used commands

Warning: Since the number of models and variants supported here is high (1000+), it is a good idea to run with --collect-only first to see what will be discovered/collected before running non-targeted pytest commands locally.

Also, running the full matrix can collect thousands of tests and may install per-model Python packages during execution. Prefer targeted runs locally using -m, -k, or an exact node id.

Tip: Use -q --collect-only to list tests with full path shown, remove --collect-only and use -vv when running.

List all tests without running:

pytest --collect-only -q tests/runner/test_models.py |& tee collect.log

List only tensor-parallel expected-passing on n300-llmbox (remove --collect-only to run):

pytest --collect-only -q tests/runner/test_models.py -m "tensor_parallel and expected_passing and n300_llmbox" --arch n300-llmbox |& tee tests.log

Run a specific collected test node id exactly:

pytest -vv tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_2_1b-single_device-inference] |& tee test.log

Validate test_config files for typos, model name changes, useful when making updates:

pytest -svv --validate-test-config tests/runner/test_models.py |& tee validate.log

List all expected passing llama inference tests for n150 (using substring -k and markers with -m):

pytest -q --collect-only -k "llama" tests/runner/test_models.py -m "n150 and expected_passing and inference" |& tee tests.log

tests/runner/test_models.py::test_all_models[deepcogito/pytorch-v1_preview_llama_3b-single_device-inference]
tests/runner/test_models.py::test_all_models[huggyllama/pytorch-llama_7b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_8b_instruct-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_1_8b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b_instruct-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_1_8b-single_device-inference]
<snip>
21/3048 tests collected (3027 deselected) in 3.53s

How discovery and parametrization work

The runner scans third_party/tt_forge_models/**/pytorch/loader.py (the git submodule) and imports ModelLoader to call query_available_variants().
For every discovered variant, the runner generates tests across run modes and parallelism.
Implementation highlights:
- Discovery and IDs: tests/runner/test_utils.py (setup_test_discovery, discover_loader_paths, create_test_entries, create_test_id_generator)
- Main test: tests/runner/test_models.py
- Config loading/validation: tests/runner/test_config/config_loader.py (merges YAML into Python with validation)

Test IDs and filtering

Test ID format: <relative_model_path>-<variant_name>-<parallelism>-<run_mode>
Examples:
- squeezebert/pytorch-squeezebert-mnli-single_device-inference
- ...-data_parallel-training
Filter by substring with -k or by markers with -m:

pytest -q -k "qwen_2_5_vl/pytorch-3b_instruct" tests/runner/test_models.py
pytest -q -m "training and tensor_parallel" tests/runner/test_models.py

Take a look at model-test-passing.json and related .json files inside .github/workflows/test-matrix-presets for seeing how filtering works for CI jobs.

Parallelism modes

single_device: Standard execution on one device.
data_parallel: Inputs are automatically batched to xr.global_runtime_device_count(); shard spec inferred on batch dim 0.
tensor_parallel: Mesh derived from loader.get_mesh_config(num_devices); execution sharded by model dimension.

Per-model requirements

If a model provides requirements.txt next to its loader.py, the runner will:
1. Freeze the current environment
2. Install those requirements (and optional requirements.nodeps.txt with --no-deps)
3. Run tests
4. Uninstall newly added packages and restore version changes
Environment toggles:
- TT_XLA_DISABLE_MODEL_REQS=1 to disable install/uninstall management
- TT_XLA_REQS_DEBUG=1 to print pip operations for debugging

Test configuration and statuses

Central configuration is authored as YAML in tests/runner/test_config/* and loaded/validated by tests/runner/test_config/config_loader.py (merged into Python at runtime).
Example: tests/runner/test_config/test_config_inference_single_device.yaml for all single device inference test tagging, and tests/runner/test_config/test_config_inference_data_parallel.yaml for data parallel inference test tagging.
Each entry is keyed by the collected test ID and can specify:
- Status: EXPECTED_PASSING, KNOWN_FAILURE_XFAIL, NOT_SUPPORTED_SKIP, UNSPECIFIED, EXCLUDE_MODEL
- Comparators: required_pcc, assert_pcc, assert_allclose, allclose_rtol, allclose_atol
- Metadata: bringup_status, reason, custom markers (e.g., push, nightly)
- Architecture scoping: supported_archs used for filtering by CI job and optional arch_overrides used if test_config entries need to be modified based on arch.

YAML to Python loading and validation

The YAML files in tests/runner/test_config/* are the single source of truth. At runtime, tests/runner/test_config/config_loader.py:
- Loads and merges all YAML fragments into a single Python dictionary keyed by collected test IDs
- Normalizes enum-like values (accepts both names like EXPECTED_PASSING and values like expected_passing)
- Applies --arch <archname>-specific arch_overrides when provided
- Validates field names/types and raises helpful errors on typos or invalid values
- Uses ruamel.yaml for parsing, which will flag duplicate mapping keys and detect duplicate test entries both within a single YAML file and across multiple YAML files. Duplicates cause validation errors with clear messages.

Model status and bringup_status guidance

Use tests/runner/test_config/* to declare intent for each collected test ID. Typical fields:

status (from ModelTestStatus) controls filtering of tests in CI:
- EXPECTED_PASSING: Test is green and should run in Nightly CI. Optionally set thresholds.
- KNOWN_FAILURE_XFAIL: Known failure that should xfail; include reason and bringup_status to set them statically otherwise will attempt to be set dynamically at runtime.
- NOT_SUPPORTED_SKIP: Skip on this architecture or generally unsupported; provide reason and (optionally) bringup_status.
- UNSPECIFIED: Default for new tests; runs in Experimental Nightly until triaged.
- EXCLUDE_MODEL: Deselect from auto-run entirely (rare; use for temporary exclusions).
bringup_status (from BringupStatus) summarizes current health for Superset dashboard reporting:
- PASSED (set automatically on pass),
- INCORRECT_RESULT (e.g., PCC mismatch),
- FAILED_FE_COMPILATION (frontend compile error),
- FAILED_TTMLIR_COMPILATION (tt-mlir compile error),
- FAILED_RUNTIME (runtime crash),
- NOT_STARTED, UNKNOWN.
reason: Short human-readable context, ideally with a link to a tracking issue.
Comparator controls: prefer required_pcc; use assert_pcc=False sparingly as a temporary measure.

Examples

Passing with a tuned PCC threshold if reasonable / understood decrease:

"resnet/pytorch-resnet_50_hf-single_device-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "required_pcc": 0.98,
}

Known compile failure (xfail) with issue link:

"clip/pytorch-openai/clip-vit-base-patch32-single_device-inference": {
  "status": ModelTestStatus.KNOWN_FAILURE_XFAIL,
  "bringup_status": BringupStatus.FAILED_TTMLIR_COMPILATION,
  "reason": "Error Message - Github issue link",
}

If minor unexpected PCC mismatch, open ticket, decrease threshold and set bringup_status/reason as:

"wide_resnet/pytorch-wide_resnet101_2-single_device-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "required_pcc": 0.96,
  "bringup_status": BringupStatus.INCORRECT_RESULT,
  "reason": "PCC regression after consteval changes - Github Issue Link",
}

If severe unexpected PCC mismatch, open ticket, disable pcc check and set bringup_status/reason as:

"gpt_neo/causal_lm/pytorch-gpt_neo_2_7B-single_device-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "assert_pcc": False,
  "bringup_status": BringupStatus.INCORRECT_RESULT,
  "reason": "AssertionError: PCC comparison failed. Calculated: pcc=-1.0000001192092896. Required: pcc=0.99 - Github Issue Link",
}

Architecture-specific overrides (e.g., pcc thresholds, status, etc):

"qwen_3/embedding/pytorch-embedding_8b-single_device-inference": {
    "status": ModelTestStatus.EXPECTED_PASSING,
    "arch_overrides": {
        "n150": {
            "status": ModelTestStatus.NOT_SUPPORTED_SKIP,
            "reason": "Too large for single chip",
            "bringup_status": BringupStatus.FAILED_RUNTIME,
        },
    },
},

Targeting architectures

Use --arch {n150,p150,n300,n300-llmbox} on pytest command line to enable arch_overrides resolution in config in case there are specific overrides (like PCC requirements, checking enablement, tagging) per arch.
Tests are also marked with supported arch markers (or defaults), so you can select subsets using -m, example:

pytest -q -m n300 --arch n300 tests/runner/test_models.py
pytest -q -m n300_llmbox --arch n300-llmbox tests/runner/test_models.py

Placeholder models (report-only)

Placeholder models are declared in YAML at tests/runner/test_config/test_config_placeholders.yaml and list important customer ModelGroup.RED models not yet merged, typically marked with BringupStatus.NOT_STARTED. These entries are loaded using the same config loader as other YAML files.
tests/runner/test_models.py::test_placeholder_models emits report entries with the placeholder marker; used for reporting on Superset dashboard and run in tt-xla Nightly CI (typically via model-test-xfail.json).
Be sure to remove the placeholder at the same time the real model is added to avoid duplicate reports.

CI setup

Push/PR: A small, fast subset runs on each pull request (e.g., tests marked push). This provides quick signal without large queues.
Nightly: The broad model matrix (inference/training across supported parallelism) runs nightly and reports to the Superset dashboard. Tests are selected via markers and tests/runner/test_config/* statuses/arch tags like ModelTestStatus.EXPECTED_PASSING
Experimental nightly: New or experimental models not yet promoted/tagged in tests/runner/test_config/* (typically unspecified) run separately. These do not report to Superset until promoted with proper status/markers.

Adding a new model to run in Nightly CI

It is not difficult, but involves potentially 2 projects (tt-xla and tt-forge-models). If model is already added to tt-forge-models and uplifted to tt-xla then skip steps 1-4.

In tt-forge-models/<model>/pytorch/loader.py, implement a ModelLoader if doesn’t already exist, exposing:
- query_available_variants() and get_model_info(variant=...)
- load_model(...) and load_inputs(...)
- load_shard_spec(...) (if needed) and get_mesh_config(num_devices) (for tensor parallel)
Optionally add requirements.txt (and requirements.nodeps.txt) next to loader.py for per-model dependencies.
Contribute the model upstream: open a PR in the tt-forge-models repository and land it (see tt-forge-models repo: https://github.com/tenstorrent/tt-forge-models).
Uplift third_party/tt_forge_models submodule in tt-xla to the merged commit so the loader is discoverable:
- Update the submodule and commit the pointer:

git submodule update --remote third_party/tt_forge_models
git add third_party/tt_forge_models
git commit -m "Uplift tt-forge-models submodule to <version> to include <model>"

Verify the test appears via --collect-only and run desired flavor locally if needed.
Add or update the corresponding entry in tests/runner/test_config/* to set status/thresholds/markers/arch support so that the model test is run in tt-xla Nightly CI. Look at existing tests for reference.
Remove any corresponding placeholder entry from PLACEHOLDER_MODELS in test_config_placeholders.yaml if it exists.
Locally run pytest -q --validate-test-config tests/runner/test_models.py to validate tests/runner/test_config/* updates (on-PR jobs run it too).
Open a PR in tt-xla for changes, consider running full set of expected passing models on CI to qualify tt_forge_models uplift (if it is risky), and land the PR in tt-xla main when confident in changes.

Troubleshooting

Discovery/import errors show as: Cannot import path: <loader.py>: <error>; add per-model requirements or set TT_XLA_DISABLE_MODEL_REQS=1 to isolate issues.
Runtime/compilation failures are recorded with a bring-up status and reason in test properties; check the test report’s tags and error_message.
Some models may be temporarily excluded from discovery; see logs printed during collection.
Use -vv and --collect-only for detailed collection/ID debugging.

Future enhancements

Expand auto-discovery beyond PyTorch to include JAX models
Automate updates of tests/runner/test_config/* potentially based on results of Nightly CI, automatic promotion of tests from Experimental Nightly to stable Nightly.
Broader usability improvements and workflow polish tracked in issue #1307

Reference

tests/runner/test_models.py: main parametrized pytest runner
tests/runner/test_utils.py: discovery, IDs, DynamicTorchModelTester
tests/runner/requirements.py: per-model requirements context manager
tests/runner/conftest.py: config attachment, markers, --arch, config validation
tests/runner/test_config/*.yaml: YAML test config files (source of truth)
tests/runner/test_config/config_loader.py: loads/merges/validates YAML into Python at runtime
third_party/tt_forge_models/config.py: Parallelism and model metadata