Thesis results

This page reports the headline benchmark results for F1 StratLab — the accuracy and latency of its seven ML models and six sub-agents, as referenced in chapter 5 of the TFG thesis. Every figure is regenerated automatically from the notebooks under notebooks/agents/, so it always tracks the latest model artefacts.

Threshold sweeps

Each classifier sub-agent exposes a precision–recall trade-off that the strategist picks deliberately. The sweeps below scan the full threshold space and mark the production operating point.

Overtake (N12)

Production threshold 0.7976 was tuned in N12 step 5 on the raw LightGBM scores. The sweep shows the trade-off is robust around that point: F1 stays within a few hundredths across the neighbouring grid.

Safety Car (N14)

The Safety Car model is a soft contextual prior, not an exact predictor. AUC-PR is 0.0723 versus a 0.0432 baseline. The 0.234 production threshold is F2-optimal (recall-weighted) because false alarms cost little and missing an imminent SC is expensive.

Undercut (N16)

The undercut classifier sees the highest positive prevalence of the three (>30 % on the holdout) because the labelling step kept only pairs with a true undercut opportunity. The 0.522 threshold falls in the flat F1 region.

MC Dropout coverage

The TCN tire-degradation model (N09 global + N10 per-compound fine-tunes) uses 50-pass MC Dropout to produce P10 / P50 / P90 percentile bands. Both the raw [P10, P90] coverage (epistemic only) and the calibrated coverage that adds the empirical residual sigma (aleatoric included) are reported.

Raw coverage stays around 0.20 across all compounds — active dropout only captures the model-weight uncertainty, not the lap-to-lap aleatoric noise. The calibrated coverage matches the 0.80 nominal target by construction.

How to regenerate

# Threshold sweeps + MC Dropout figures (one notebook, ~5 min on GPU)
uv run jupyter nbconvert --execute --inplace notebooks/agents/N33_thresholds_and_calibration.ipynb

# Quantitative RAG benchmark (10-15 min, builds 2 additional Qdrant collections)
uv run jupyter nbconvert --execute --inplace notebooks/agents/N30B_rag_benchmark.ipynb

Both notebooks emit CSV and Markdown tables alongside their PNGs:

Sweeps: data/eval/threshold_sweep_{overtake,sc,undercut}.{csv,md}
MC Dropout: data/eval/mc_dropout_coverage.{csv,md}
RAG benchmark: data/rag_eval/results_v1.md

Numeric headline metrics

Component	Metric	Value	Source
Pace model (N06 XGBoost)	MAE on 2025 holdout	0.410 s	`data/eval/pace_baselines.{csv,md}`
Whisper turbo (CUDA)	mean per-clip latency	233.9 ms (P95 325.8 ms)	`data/eval/whisper_results.{csv,md}`
NLP pipeline (GPU)	mean `run_pipeline`	42.1 ms	`data/eval/nlp_pipeline_cpu.{csv,md}`
Sub-agent latency	min / max mean	487 ms (pace) / 4.4 s (rag w/ LLM)	`data/eval/subagent_latency.{csv,md}`
RAG agent	Content P@5	0.80	`data/rag_eval/results_v1.md`
MC Dropout (C2)	calibrated 80 % coverage	0.840	`data/eval/mc_dropout_coverage.{csv,md}`

All numbers reproducible with the commands above.