Thesis results

This page reports the headline benchmark results for F1 StratLab — the accuracy and latency of its seven ML models and six sub-agents, as referenced in chapter 5 of the TFG thesis. Every figure is regenerated automatically from the notebooks under notebooks/agents/, so it always tracks the latest model artefacts.

Threshold sweeps

Each classifier sub-agent exposes a precision–recall trade-off that the strategist picks deliberately. The sweeps below scan the full threshold space and mark the production operating point.

Overtake (N12)

Production threshold 0.7976 was tuned in N12 step 5 on the raw LightGBM scores. The sweep shows the trade-off is robust around that point: F1 stays within a few hundredths across the neighbouring grid.

Safety Car (N14)

The Safety Car model is a soft contextual prior, not an exact predictor. AUC-PR is 0.0723 versus a 0.0432 baseline. The 0.234 production threshold is F2-optimal (recall-weighted) because false alarms cost little and missing an imminent SC is expensive.

Undercut (N16)

The undercut classifier sees the highest positive prevalence of the three (>30 % on the holdout) because the labelling step kept only pairs with a true undercut opportunity. The 0.522 threshold falls in the flat F1 region.

MC Dropout coverage

The TCN tire-degradation model (N09 global + N10 per-compound fine-tunes) uses 50-pass MC Dropout to produce P10 / P50 / P90 percentile bands. Both the raw [P10, P90] coverage (epistemic only) and the calibrated coverage that adds the empirical residual sigma (aleatoric included) are reported.

Raw coverage stays around 0.20 across all compounds — active dropout only captures the model-weight uncertainty, not the lap-to-lap aleatoric noise. The calibrated coverage matches the 0.80 nominal target by construction.

How to regenerate

# Threshold sweeps + MC Dropout figures (one notebook, ~5 min on GPU)
uv run jupyter nbconvert --execute --inplace notebooks/agents/N33_thresholds_and_calibration.ipynb

# Quantitative RAG benchmark (10-15 min, builds 2 additional Qdrant collections)
uv run jupyter nbconvert --execute --inplace notebooks/agents/N30B_rag_benchmark.ipynb

Both notebooks emit CSV and Markdown tables alongside their PNGs:

Numeric headline metrics

Component Metric Value Source
Pace model (N06 XGBoost) MAE on 2025 holdout 0.410 s data/eval/pace_baselines.{csv,md}
Whisper turbo (CUDA) mean per-clip latency 233.9 ms (P95 325.8 ms) data/eval/whisper_results.{csv,md}
NLP pipeline (GPU) mean run_pipeline 42.1 ms data/eval/nlp_pipeline_cpu.{csv,md}
Sub-agent latency min / max mean 487 ms (pace) / 4.4 s (rag w/ LLM) data/eval/subagent_latency.{csv,md}
RAG agent Content P@5 0.80 data/rag_eval/results_v1.md
MC Dropout (C2) calibrated 80 % coverage 0.840 data/eval/mc_dropout_coverage.{csv,md}

All numbers reproducible with the commands above.