Model Auto-Discovery Tests
Overview
What: A pytest-based runner that auto-discovers Torch models from
tt-forge-modelsand generates tests for inference and training across parallelism modes.Why: Standardize model testing, reduce bespoke tests in repos, and scale coverage as models are added or updated.
Scope: Discovers
loader.pyunder<model>/pytorch/inthird_party/tt_forge_models, queries variants, and runs each combination of:Run mode:
inference,trainingParallelism:
single_device,data_parallel,tensor_parallel
Note: Discovery currently targets PyTorch models only. JAX model auto-discovery is planned.
Prerequisites
A working TT-XLA development environment, built and ready to run tests, with
pytestinstalled.third_party/tt_forge_modelsgit submodule initialized and up to date:
git submodule update --init --recursive third_party/tt_forge_models
Device availability matching your chosen parallelism mode (e.g., multiple devices for data/tensor parallel).
Optional internet access for per-model pip installs during test execution.
env-var
IRD_LF_CACHEset to point to large file cache / webserver for s3 bucket mirror. Reach out to team for details.
Quick start / commonly used commands
Warning: Since the number of models and variants supported here is high (1000+), it is a good idea to run with --collect-only first to see what will be discovered/collected before running non-targeted pytest commands locally.
Also, running the full matrix can collect thousands of tests and may install per-model Python packages during execution. Prefer targeted runs locally using -m, -k, or an exact node id.
Tip: Use -q --collect-only to list tests with full path shown, remove --collect-only and use -vv when running.
List all tests without running:
pytest --collect-only -q tests/runner/test_models.py |& tee collect.log
List only tensor-parallel expected-passing on
n300-llmbox(remove--collect-onlyto run):
pytest --collect-only -q tests/runner/test_models.py -m "tensor_parallel and expected_passing and n300_llmbox" --arch n300-llmbox |& tee tests.log
Run a specific collected test node id exactly:
pytest -vv tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_2_1b-single_device-inference] |& tee test.log
Validate test_config files for typos, model name changes, useful when making updates:
pytest -svv --validate-test-config tests/runner/test_models.py |& tee validate.log
List all expected passing llama inference tests for n150 (using substring
-kand markers with-m):
pytest -q --collect-only -k "llama" tests/runner/test_models.py -m "n150 and expected_passing and inference" |& tee tests.log
tests/runner/test_models.py::test_all_models[deepcogito/pytorch-v1_preview_llama_3b-single_device-inference]
tests/runner/test_models.py::test_all_models[huggyllama/pytorch-llama_7b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_8b_instruct-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_1_8b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b_instruct-single_device-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_1_8b-single_device-inference]
<snip>
21/3048 tests collected (3027 deselected) in 3.53s
How discovery and parametrization work
The runner scans
third_party/tt_forge_models/**/pytorch/loader.py(the git submodule) and importsModelLoaderto callquery_available_variants().For every discovered variant, the runner generates tests across run modes and parallelism.
Implementation highlights:
Discovery and IDs:
tests/runner/test_utils.py(setup_test_discovery,discover_loader_paths,create_test_entries,create_test_id_generator)Main test:
tests/runner/test_models.pyConfig loading/validation:
tests/runner/test_config/config_loader.py(merges YAML into Python with validation)
Test IDs and filtering
Test ID format:
<relative_model_path>-<variant_name>-<parallelism>-<run_mode>Examples:
squeezebert/pytorch-squeezebert-mnli-single_device-inference...-data_parallel-training
Filter by substring with
-kor by markers with-m:
pytest -q -k "qwen_2_5_vl/pytorch-3b_instruct" tests/runner/test_models.py
pytest -q -m "training and tensor_parallel" tests/runner/test_models.py
Take a look at model-test-passing.json and related .json files inside .github/workflows/test-matrix-presets for seeing how filtering works for CI jobs.
Parallelism modes
single_device: Standard execution on one device.
data_parallel: Inputs are automatically batched to
xr.global_runtime_device_count(); shard spec inferred on batch dim 0.tensor_parallel: Mesh derived from
loader.get_mesh_config(num_devices); execution sharded by model dimension.
Per-model requirements
If a model provides
requirements.txtnext to itsloader.py, the runner will:Freeze the current environment
Install those requirements (and optional
requirements.nodeps.txtwith--no-deps)Run tests
Uninstall newly added packages and restore version changes
Environment toggles:
TT_XLA_DISABLE_MODEL_REQS=1to disable install/uninstall managementTT_XLA_REQS_DEBUG=1to print pip operations for debugging
Test configuration and statuses
Central configuration is authored as YAML in
tests/runner/test_config/*and loaded/validated bytests/runner/test_config/config_loader.py(merged into Python at runtime).Example:
tests/runner/test_config/test_config_inference_single_device.yamlfor all single device inference test tagging, andtests/runner/test_config/test_config_inference_data_parallel.yamlfor data parallel inference test tagging.Each entry is keyed by the collected test ID and can specify:
Status:
EXPECTED_PASSING,KNOWN_FAILURE_XFAIL,NOT_SUPPORTED_SKIP,UNSPECIFIED,EXCLUDE_MODELComparators:
required_pcc,assert_pcc,assert_allclose,allclose_rtol,allclose_atolMetadata:
bringup_status,reason, custommarkers(e.g.,push,nightly)Architecture scoping:
supported_archsused for filtering by CI job and optionalarch_overridesused if test_config entries need to be modified based on arch.
YAML to Python loading and validation
The YAML files in
tests/runner/test_config/*are the single source of truth. At runtime,tests/runner/test_config/config_loader.py:Loads and merges all YAML fragments into a single Python dictionary keyed by collected test IDs
Normalizes enum-like values (accepts both names like
EXPECTED_PASSINGand values likeexpected_passing)Applies
--arch <archname>-specificarch_overrideswhen providedValidates field names/types and raises helpful errors on typos or invalid values
Uses
ruamel.yamlfor parsing, which will flag duplicate mapping keys and detect duplicate test entries both within a single YAML file and across multiple YAML files. Duplicates cause validation errors with clear messages.
Model status and bringup_status guidance
Use tests/runner/test_config/* to declare intent for each collected test ID. Typical fields:
status(fromModelTestStatus) controls filtering of tests in CI:EXPECTED_PASSING: Test is green and should run in Nightly CI. Optionally set thresholds.KNOWN_FAILURE_XFAIL: Known failure that should xfail; includereasonandbringup_statusto set them statically otherwise will attempt to be set dynamically at runtime.NOT_SUPPORTED_SKIP: Skip on this architecture or generally unsupported; providereasonand (optionally)bringup_status.UNSPECIFIED: Default for new tests; runs in Experimental Nightly until triaged.EXCLUDE_MODEL: Deselect from auto-run entirely (rare; use for temporary exclusions).
bringup_status(fromBringupStatus) summarizes current health for Superset dashboard reporting:PASSED(set automatically on pass),INCORRECT_RESULT(e.g., PCC mismatch),FAILED_FE_COMPILATION(frontend compile error),FAILED_TTMLIR_COMPILATION(tt-mlir compile error),FAILED_RUNTIME(runtime crash),NOT_STARTED,UNKNOWN.
reason: Short human-readable context, ideally with a link to a tracking issue.Comparator controls: prefer
required_pcc; useassert_pcc=Falsesparingly as a temporary measure.
Examples
Passing with a tuned PCC threshold if reasonable / understood decrease:
"resnet/pytorch-resnet_50_hf-single_device-inference": {
"status": ModelTestStatus.EXPECTED_PASSING,
"required_pcc": 0.98,
}
Known compile failure (xfail) with issue link:
"clip/pytorch-openai/clip-vit-base-patch32-single_device-inference": {
"status": ModelTestStatus.KNOWN_FAILURE_XFAIL,
"bringup_status": BringupStatus.FAILED_TTMLIR_COMPILATION,
"reason": "Error Message - Github issue link",
}
If minor unexpected PCC mismatch, open ticket, decrease threshold and set bringup_status/reason as:
"wide_resnet/pytorch-wide_resnet101_2-single_device-inference": {
"status": ModelTestStatus.EXPECTED_PASSING,
"required_pcc": 0.96,
"bringup_status": BringupStatus.INCORRECT_RESULT,
"reason": "PCC regression after consteval changes - Github Issue Link",
}
If severe unexpected PCC mismatch, open ticket, disable pcc check and set bringup_status/reason as:
"gpt_neo/causal_lm/pytorch-gpt_neo_2_7B-single_device-inference": {
"status": ModelTestStatus.EXPECTED_PASSING,
"assert_pcc": False,
"bringup_status": BringupStatus.INCORRECT_RESULT,
"reason": "AssertionError: PCC comparison failed. Calculated: pcc=-1.0000001192092896. Required: pcc=0.99 - Github Issue Link",
}
Architecture-specific overrides (e.g., pcc thresholds, status, etc):
"qwen_3/embedding/pytorch-embedding_8b-single_device-inference": {
"status": ModelTestStatus.EXPECTED_PASSING,
"arch_overrides": {
"n150": {
"status": ModelTestStatus.NOT_SUPPORTED_SKIP,
"reason": "Too large for single chip",
"bringup_status": BringupStatus.FAILED_RUNTIME,
},
},
},
Targeting architectures
Use
--arch {n150,p150,n300,n300-llmbox}on pytest command line to enablearch_overridesresolution in config in case there are specific overrides (like PCC requirements, checking enablement, tagging) per arch.Tests are also marked with supported arch markers (or defaults), so you can select subsets using
-m, example:
pytest -q -m n300 --arch n300 tests/runner/test_models.py
pytest -q -m n300_llmbox --arch n300-llmbox tests/runner/test_models.py
Placeholder models (report-only)
Placeholder models are declared in YAML at
tests/runner/test_config/test_config_placeholders.yamland list important customerModelGroup.REDmodels not yet merged, typically marked withBringupStatus.NOT_STARTED. These entries are loaded using the same config loader as other YAML files.tests/runner/test_models.py::test_placeholder_modelsemits report entries with theplaceholdermarker; used for reporting on Superset dashboard and run in tt-xla Nightly CI (typically viamodel-test-xfail.json).Be sure to remove the placeholder at the same time the real model is added to avoid duplicate reports.
CI setup
Push/PR: A small, fast subset runs on each pull request (e.g., tests marked
push). This provides quick signal without large queues.Nightly: The broad model matrix (inference/training across supported parallelism) runs nightly and reports to the Superset dashboard. Tests are selected via markers and
tests/runner/test_config/*statuses/arch tags likeModelTestStatus.EXPECTED_PASSINGExperimental nightly: New or experimental models not yet promoted/tagged in
tests/runner/test_config/*(typicallyunspecified) run separately. These do not report to Superset until promoted with proper status/markers.
Adding a new model to run in Nightly CI
It is not difficult, but involves potentially 2 projects (tt-xla and tt-forge-models). If model is already added to tt-forge-models and uplifted to tt-xla then skip steps 1-4.
In
tt-forge-models/<model>/pytorch/loader.py, implement aModelLoaderif doesn’t already exist, exposing:query_available_variants()andget_model_info(variant=...)load_model(...)andload_inputs(...)load_shard_spec(...)(if needed) andget_mesh_config(num_devices)(for tensor parallel)
Optionally add
requirements.txt(andrequirements.nodeps.txt) next toloader.pyfor per-model dependencies.Contribute the model upstream: open a PR in the
tt-forge-modelsrepository and land it (seett-forge-modelsrepo: https://github.com/tenstorrent/tt-forge-models).Uplift
third_party/tt_forge_modelssubmodule intt-xlato the merged commit so the loader is discoverable:Update the submodule and commit the pointer:
git submodule update --remote third_party/tt_forge_models
git add third_party/tt_forge_models
git commit -m "Uplift tt-forge-models submodule to <version> to include <model>"
Verify the test appears via
--collect-onlyand run desired flavor locally if needed.Add or update the corresponding entry in
tests/runner/test_config/*to set status/thresholds/markers/arch support so that the model test is run in tt-xla Nightly CI. Look at existing tests for reference.Remove any corresponding placeholder entry from
PLACEHOLDER_MODELSintest_config_placeholders.yamlif it exists.Locally run
pytest -q --validate-test-config tests/runner/test_models.pyto validatetests/runner/test_config/*updates (on-PR jobs run it too).Open a PR in
tt-xlafor changes, consider running full set of expected passing models on CI to qualifytt_forge_modelsuplift (if it is risky), and land the PR intt-xlamain when confident in changes.
Troubleshooting
Discovery/import errors show as:
Cannot import path: <loader.py>: <error>; add per-model requirements or setTT_XLA_DISABLE_MODEL_REQS=1to isolate issues.Runtime/compilation failures are recorded with a bring-up status and reason in test properties; check the test report’s
tagsanderror_message.Some models may be temporarily excluded from discovery; see logs printed during collection.
Use
-vvand--collect-onlyfor detailed collection/ID debugging.
Future enhancements
Expand auto-discovery beyond PyTorch to include JAX models
Automate updates of
tests/runner/test_config/*potentially based on results of Nightly CI, automatic promotion of tests from Experimental Nightly to stable Nightly.Broader usability improvements and workflow polish tracked in issue #1307
Reference
tests/runner/test_models.py: main parametrized pytest runnertests/runner/test_utils.py: discovery, IDs,DynamicTorchModelTestertests/runner/requirements.py: per-model requirements context managertests/runner/conftest.py: config attachment, markers,--arch, config validationtests/runner/test_config/*.yaml: YAML test config files (source of truth)tests/runner/test_config/config_loader.py: loads/merges/validates YAML into Python at runtimethird_party/tt_forge_models/config.py:Parallelismand model metadata