December 15, 2025

How to Evaluate an AI Development Company: 5 Red and Green Flags for Decision Makers

Written by

Picture of Ignas Vaitukaitis

Ignas Vaitukaitis

AI Agent Engineer - LLMs · Diffusion Models · Fine-Tuning · RAG · Agentic Software · Prompt Engineering

Selecting an AI development partner feels like navigating a minefield. Every vendor promises cutting-edge solutions, impressive demos, and seamless integration. But demos hide reality. Leading AI models still hallucinate at rates between 26% and 90%, depending on the task. The European Broadcasting Union found 45% misrepresentation in news-related AI tasks alone.

The gap between a polished presentation and production-ready AI is enormous. Decision-makers need a framework that cuts through marketing claims and reveals which vendors can actually deliver reliable, compliant, and cost-controlled AI systems at scale.

This guide presents five critical evaluation flags—each with clear green indicators (what good looks like) and red warnings (risk signals to avoid). These flags form a mutually reinforcing system. The absence of any single flag represents meaningful risk to your project.

Quick Answer: The strongest predictor of a reliable AI vendor is an embedded evaluation culture supported by rigorous data lineage, standardized observability, cross-regime governance alignment, and an operationalized prompt/RAG lifecycle with CI gates. Vendors exhibiting all five green flags will outperform peers on reliability, cost control, and compliance.

How We Selected These Evaluation Criteria

This framework synthesizes evidence from multiple authoritative domains:

Clinical trial methodology informs how vendors should design and report randomized experiments. The Studies Within A Trial (SWAT) concept from Trial Forge Guidance provides a rigorous model for embedded experimentation that translates directly to AI workflows.

Healthcare data standards from HL7 and FHIR establish patterns for data provenance, interoperability, and audit trails that apply across regulated industries.

LLMOps production disciplines documented by OpenTelemetry’s GenAI semantic conventions and evaluation frameworks like DeepEval define what production-grade AI operations look like.

Regulatory frameworks including NIST AI RMF crosswalks and EU AI Act requirements establish governance expectations for high-risk systems.

We prioritized recent sources (2024–2025) from authoritative institutions and treated non-authoritative sources as illustrative only. Each flag is grounded in verifiable practices with direct ties to these sources.

Comparison Table: Red vs. Green Flags at a Glance

FlagGreen IndicatorsRed WarningsEvidence to Require
Evidence CultureProtocolized A/B studies; ethical harmonization; CONSORT-like reportingAd hoc “prompt tinkering”; no ethics analysis; no interim evidenceAnonymized A/B protocol; SWAT result examples
Data ProvenanceDataset due-diligence with lineage; FHIR/HL7-aware pipelines; validation strategies“Trust us” pipelines; no lineage documentation; ignores data qualityDataset dossier; provenance logs; FHIR interfaces
Observability & SLOsOpenTelemetry gen_ai telemetry; SLIs for hallucination/groundedness; incident playbooksOpaque logs; no hallucination metrics; single-provider dependency2-week pilot dashboard; incident drill results
Governance AlignmentNIST→ISO 42001 crosswalked program; high-risk system plans; PMM drillsISO badges as compliance claims; no HITL design; no regulatory roadmapCrosswalk mappings; sample technical file
LLMOps MaturityPrompt externalization; RAG CI evaluation; cost controls and routingHard-coded prompts; no CI gates; no cost governancePrompt registry; CI config; cost control evidence

1. Evidence Culture with Embedded Experiments – The SWAT Mindset for AI

A vendor’s ability to generate evidence about process choices in real-time is crucial when deploying probabilistic systems. The SWAT concept—randomized, embedded studies within a host trial—translates to AI as controlled A/B experiments embedded within production-like workflows.

Why This Flag Matters Most

LLMs behave unpredictably. Prompts that work in testing may fail in production. Retrieval configurations that seem optimal may degrade over time. Without systematic experimentation, vendors rely on intuition and ad hoc “prompt tinkering”—a recipe for unreliable systems.

Top vendors treat prompts, retrieval settings, and guardrails as “trial processes” subject to rigorous evaluation. They design experiments with clear objectives, randomization, and analysis plans. According to Trial Forge Guidance 1, this embedded experimentation approach is the single strongest predictor of reliable delivery under uncertainty.

Green Flags (What Good Looks Like)

Registered, protocolized experiments: The vendor writes short protocols for embedded experiments adapted from Trial Forge templates. They maintain an internal “SWAT registry” for reuse across clients and can discuss CONSORT-style reporting adapted to AI process studies.
Ethical heterogeneity handled upfront: For projects touching human subjects or sensitive data, the vendor demonstrates harmonization across jurisdictions. They explain when consent is required versus waived for methodological variations that don’t change material risk, following Trial Forge Guidance 5.
SWAT results drive host decisions: The vendor shows examples where interim experiment results informed system process choices—such as selecting a retention messaging variant based on measured performance.
Healthcare familiarity: For clinical projects, vendors understand registry-based randomized trials and RCHD-derived endpoints, including adjudication approaches to ensure outcomes are robust.

Red Flags (Risk Indicators)

No experimentation framework; reliance on intuition or ad hoc changesEthics treated as an afterthought; no protocol or consent analysis for user-facing experimentsNo willingness to publish internal methodological findings or adapt reporting templatesCannot explain how experiment results inform production decisions

Questions to Ask

“Provide one anonymized example of an embedded, randomized A/B study protocol you ran in the last 12 months, including objective, outcome measures, and governance approvals where applicable.”

“How would you adapt Trial Forge’s SWAT practices to our environment, including ethics for cross-border testing?”

Best For

Organizations deploying AI in high-stakes environments where system behavior must be validated continuously. Essential for healthcare, financial services, and any domain where decisions affect people’s lives or livelihoods.

2. Data Provenance and Interoperability – The Foundation of Trustworthy AI

Regulatory-grade outcomes require documented data lineage, integrity standards, and interoperability. Vendors who cannot produce provenance evidence for your domains are high risk—both scientifically and legally.

Why This Flag Matters

AI systems are only as good as their data. In regulated industries, you must evidence where data came from, how it was curated, and whether it’s suitable for your use case. According to research on routinely collected health data, trial sponsors using RCHD must document custodianship, curation, automation, and quality controls across the entire lifecycle.

The HL7 FHIR ecosystem signals the patterns and communities vendors should follow for secure exchange and traceability.

Green Flags (What Good Looks Like)

Dataset due diligence files: For each external dataset, vendors produce a dossier documenting provenance, curation workflows, known lags, and fitness for use. Robust curation and controls can support use as “transcribed source equivalents” suitable for regulatory submissions.
Validation and endpoint comparability: Evidence from linking RCHD to trial data shows strong specificity (97.9–99.9%) and NPV (99.0–99.7%) for disease-related outcomes. Vendors understand lag issues and propose appropriate strategies.
Interoperability-first architecture: Vendor architects for FHIR resources and can show familiarity with “FHIR Testing,” “SMART on FHIR,” and “FAST Security” patterns for secure, auditable integration.
Protocol and TMF integration: They propose recording dataset relevance and validity assessments in protocols and Trial Master Files.

Red Flags (Risk Indicators)

“Trust us” about data pipelines; no documentation of lineage, curation, or controlsClaiming RCHD suffices for acute-phase endpoints without addressing lag or completeness risksNo exposure to FHIR/HL7 practices; no secure exchange design patternCannot explain data quality limitations or mitigation strategies

Questions to Ask

“Provide one due-diligence sample for a registry or RCHD-like source that includes curator SOPs, audit logs, known weaknesses, and mitigations.”

“How would you instrument provenance and audit trails across extraction→transform→load→model inference using standardized telemetry?”

Best For

Healthcare organizations, clinical research sponsors, and any enterprise operating in regulated industries where data integrity is subject to audit. Critical for organizations using external datasets or registries as inputs to AI systems.

3. Observability and SLOs Using OpenTelemetry – Seeing Inside the Black Box

Without standardized traces, metrics, and logs, you cannot debug failure modes, quantify hallucinations, or meet SLAs. Production-grade vendors emit comparable, vendor-neutral telemetry that captures the entire prompt→retrieval→tool→inference flow.

Why This Flag Matters

The incorporation of OpenLLMetry semantics into OpenTelemetry’s GenAI conventions means vendors can now produce standardized telemetry with token/cost accounting and latency breakdowns. This isn’t optional—it’s the foundation for reliability engineering.

Given that hallucination rates remain significant across top models (26–90%+ depending on model and task), vendors must propose domain-specific SLIs beyond generic accuracy metrics.

Green Flags (What Good Looks Like)

Emits gen_ai semantic telemetry: The vendor’s services emit spans and metrics for prompt components (with token counts), retrieval spans (embedding model, top-k, filters), model call metadata, tool calls, cost accounting, and latency breakdowns. This matches OpenTelemetry GenAI specifications.
SLIs/SLOs reflect user experience and risk: Vendors propose SLOs that include accuracy proxies, groundedness thresholds, guardrail violation rates, p95 latency, availability, and cost ceilings. They provide incident response, failover strategies, and canary rollouts.
Hallucination and safety monitoring: Vendors propose domain-specific SLIs beyond generic accuracy and monitor over-refusal versus unsafe leakage rates longitudinally.
Multi-provider resilience: Router-based policies enable failover between providers; no single-provider dependency.

Red Flags (Risk Indicators)

Opaque logging with no standard schema; difficulty attributing errors to prompt, retrieval, tool, or modelNo explicit hallucination or groundedness metrics; platform cannot compute or alert on theseSingle-provider dependency with no failover or routing logicCannot demonstrate incident response procedures

Questions to Ask

“Instrument a sample endpoint in our test environment with OpenTelemetry gen_ai semantics and produce a 2-week pilot dashboard showing p95 latency, token/cost per task, groundedness errors by prompt version, and retrieval hit rates.”

“Show us your incident runbooks, circuit breakers, and rollback procedures. Conduct a live drill before broader rollout.”

Best For

Any organization deploying AI in production where uptime, cost control, and quality matter. Essential for customer-facing applications, high-volume systems, and any deployment where failures have business consequences.

4. Cross-Regime Governance Alignment – Regulatory Readiness That Scales

Regulation is converging globally. NIST AI RMF-based programs can be mapped to ISO/IEC 42001, ISO/IEC 23894, and further aligned to OECD recommendations and regimes like the EU AI Act. Vendors who cannot show this mapping are risky partners.

Why This Flag Matters

NIST AI RMF crosswalks demonstrate that one governance program can satisfy multiple jurisdictions and auditors. For high-risk use cases like hiring, the EU AI Act classifies CV screening and ranking systems under Annex III—requiring third-party conformity assessment, EU database registration, and post-market monitoring.

Vendors who claim “ISO badges” prove compliance are misleading you. ISO certifications help organize engineering and governance but do not themselves prove EU AI Act compliance until harmonized standards are referenced in the OJEU.

Green Flags (What Good Looks Like)

Crosswalked governance program: The vendor maintains a control catalog aligned to NIST AI RMF and demonstrates how controls instantiate ISO/IEC 42001 management system requirements with evidence packages and audit trails.
High-risk system readiness: For CV screening, ranking, or performance prediction, the vendor acknowledges EU AI Act Annex III classification, has documented risk management, data governance procedures, human oversight design, and plans for conformity assessment.
Realistic timeline discipline: According to provider playbook guidance, vendors can speak to Q3/Q4 milestones: internal audits against Articles 9–15 and 17; corrective actions; QMS and technical file to Notified Body; EU database entries; PMM drills.
Post-market monitoring as operational loop: PMM looks like a production SRE loop with telemetry feeds, user feedback, periodic re-evaluations, and reporting obligations.

Red Flags (Risk Indicators)

“We have ISO 27001 and 27701, so we’re compliant”—these certifications don’t imply AI Act complianceNo documented human oversight design or post-market monitoring planNo plan for EU database registration or CE markingNo awareness of Notified Body interactions for high-risk systems

Questions to Ask

“Show us your NIST→ISO 42001 crosswalk mappings and a sample evidence package including model card, data provenance logs, risk assessments, and HITL documentation.”

“For hiring/HR systems: provide a draft technical file and risk management file, plus a roadmap to conformity assessment and registration.”

Best For

Organizations deploying AI in regulated industries or high-risk use cases. Critical for hiring/HR technology, healthcare applications, financial services, and any system that affects fundamental rights or safety.

5. LLMOps Maturity with Prompt/RAG Lifecycle – Production-Grade Operations

LLMOps is not “MLOps++”—it manages probabilistic behavior, versioned prompts, retrieval configurations, multi-model routing, and token economics. Mature vendors externalize prompts, run A/B tests, evaluate RAG components in CI, and tie deployments to evaluation gates.

Why This Flag Matters

According to Microsoft’s LLM app lifecycle guidance, production systems must enable reverting from production to experimentation when metrics degrade. This requires treating prompts as first-class, versioned artifacts.

RAG evaluation frameworks show that retriever and generator components must be evaluated separately with CI unit tests. Without this discipline, you cannot attribute failures or improve systematically.

Green Flags (What Good Looks Like)

Prompt externalization and versioning: Prompts are managed as first-class artifacts decoupled from code, enabling discovery, side-by-side comparisons, multi-model testing, and governance. Tooling options include Langfuse, PromptHub, LangSmith, Promptfoo, and MLflow.
RAG evaluation as CI: Vendors evaluate retriever and generator separately, run CI unit tests, and assert on metrics such as Answer Relevancy, Faithfulness, Contextual Precision/Recall, plus domain rubrics. They log hyperparameters (chunk size, top-K, embedding model) for regression analysis.
Synthetic data for retriever tuning: According to NVIDIA’s synthetic data pipeline research, vendors use synthetic QA generation with answerability filters and hard-negative mining to improve embedding model performance.
Cost governance: Rate limits, per-request cost ceilings, routing tiering (small→large model escalation), caching, and batch optimization prevent cost overruns.

Red Flags (Risk Indicators)

Hard-coded prompts; no prompt registry or A/B testing frameworkNo component-level evaluation; only end-to-end demos with cherry-picked inputsNo CI gates or synthetic/adversarial test suites; rollout decisions are subjectiveNo cost monitoring and controls at inference

Questions to Ask

“Show us your prompt registry with change control and prior A/B results including statistical analyses and model-agnostic comparisons.”

“Provide your RAG evaluation suite and CI config with metrics, thresholds, and regression tracking. Show one example of a production rollback triggered by failing tests.”

“Demonstrate cost governance designs and evidence these defenses have prevented cost overruns.”

Best For

Any organization moving from AI pilots to production deployment. Essential for teams managing multiple prompts, retrieval configurations, or model versions. Critical for cost-conscious deployments at scale.

How to Choose the Right AI Development Company

Key Factors to Consider

Your risk profile determines flag priority. High-risk deployments (healthcare, hiring, financial decisions) require all five flags at full strength. Lower-risk internal tools may tolerate gaps in governance alignment while prioritizing observability and LLMOps maturity.

Evidence culture is non-negotiable. Without systematic experimentation, vendors cannot learn or improve. This flag predicts long-term partnership success better than any technical capability.

Observability enables everything else. You cannot manage what you cannot measure. Standardized telemetry is the foundation for SLAs, cost control, and continuous improvement.

Questions to Ask Yourself

1. What happens if this AI system fails? (Determines risk classification)

2. Do we operate in regulated industries or jurisdictions? (Determines governance requirements)

3. How will we know if the system is working? (Determines observability needs)

4. How often will we need to update prompts or retrieval? (Determines LLMOps requirements)

5. What external data sources will we use? (Determines provenance requirements)

Common Mistakes to Avoid

Trusting demos over evidence. Demos are curated; production is chaotic. Require pilot dashboards and incident drills.Accepting ISO badges as compliance proof. ISO certifications organize governance but don’t prove AI Act compliance.Ignoring cost governance. Token costs compound quickly. Require cost controls from day one.Skipping the bake-off. A 90-day pilot with clear milestones reveals vendor capabilities better than any RFP response.

Recommended Weighting

Based on the assembled evidence, weight these flags as follows:

FlagRecommended Weight
Evidence Culture (SWAT mindset)25%
Observability + SLOs25%
LLMOps Maturity20%
Data Provenance20%
Governance Alignment10%

Rationale: Without experimentation and observability, you cannot learn or control in production. LLMOps maturity translates learning into managed deployments. Provenance ensures outcomes are grounded in admissible data. Governance is essential but depends on the other four being real and evidenced.

A 90-Day Vendor Bake-Off Plan

Days 0–14: Discovery and Risk Mapping

Inventory candidate use cases; classify by impact and data sensitivityDefine domain-specific SLIs and success criteria for a SWAT-style experimentDeliverables: Experiment protocol draft; data provenance plan; dataset due-diligence dossier

Days 15–45: Build Core LLMOps Scaffold

Deploy prompt registry; decouple prompts from code; set up A/B harnessInstrument OpenTelemetry gen_ai; emit spans for prompt→retrieval→tool→inferenceStand up RAG CI tests with domain rubricsDeliverables: Working telemetry dashboards; initial CI pipelines; synthetic QA pipeline

Days 46–75: Integrate Monitoring and Safety Gates

Define SLOs; configure alerts and cost ceilings; implement model routingRun canary rollout; conduct incident drillIf high-risk, draft QMS/technical file and run PMM rehearsalDeliverables: Canary results; drill report; governance crosswalk for use case

Days 76–90: Pilot, Evaluate, and Decide

Run the SWAT experiment; collect data; produce CONSORT-like reportPresent outcome versus criteria; scale up or remediateDecision triggers: If SLIs/SLOs met and evidence package complete, proceed. If not, require corrective action plan.

Frequently Asked Questions

What is the most important flag when evaluating an AI development company?

Evidence culture with embedded experiments is the strongest predictor of reliable delivery. Vendors who run systematic A/B tests on prompts, retrieval, and guardrails can learn and improve continuously. Without this capability, vendors rely on intuition—which fails when probabilistic systems behave unexpectedly in production.

How do I know if a vendor’s governance claims are legitimate?

Ask for NIST→ISO 42001 crosswalk mappings with actual evidence packages. Legitimate vendors can show control catalogs, risk assessments, model cards, and audit trails. Be skeptical of vendors who cite ISO 27001 or 27701 as proof of AI compliance—these certifications help but don’t address AI-specific requirements under frameworks like the EU AI Act.

What observability metrics should I require from an AI vendor?

At minimum, require SLIs for groundedness/faithfulness, hallucination rate, guardrail violation rate, p95 latency, and cost per interaction. Vendors should emit OpenTelemetry gen_ai semantic telemetry covering prompt components, retrieval spans, model calls, and tool invocations. Dashboards should enable attribution of failures to specific components.

How can I evaluate a vendor’s LLMOps maturity quickly?

Ask to see their prompt registry and one example of a production rollback triggered by failing CI tests. Mature vendors externalize prompts from code, version them, run A/B tests, and gate deployments on evaluation thresholds. If prompts are hard-coded or rollout decisions are subjective, the vendor lacks production-grade discipline.

What should I include in contracts with AI development companies?

Include clauses requiring: (1) OpenTelemetry GenAI semantic conventions with dashboard access, (2) CI evaluation gates with specific metric thresholds and rollback on breaches, (3) NIST AI RMF→ISO 42001 crosswalk mappings with evidence artifacts, and (4) at least one protocolized A/B experiment during pilot with ethical approvals and outcome reporting.

Conclusion

Evaluating an AI development company requires looking beyond demos and marketing claims. The five flags outlined—evidence culture, data provenance, observability, governance alignment, and LLMOps maturity—form a reinforcing system that predicts production success.

For high-stakes deployments: Prioritize vendors who demonstrate all five flags with concrete evidence. The 90-day bake-off plan provides a structured approach to validation.

For lower-risk internal tools: Focus on observability and LLMOps maturity first. These capabilities enable learning and improvement even when other flags are developing.

For regulated industries: Governance alignment and data provenance become critical. Require crosswalk mappings, technical files, and PMM rehearsals before broader rollout.

The absence of any single flag represents meaningful risk. Vendors who exhibit all five capabilities—evidenced through pilots and artifacts—will outperform peers on reliability, cost control, and compliance.

Your next step: Use the comparison table and questions in this guide to structure your vendor evaluation. Require a 90-day pilot with clear milestones before committing to broader deployment. The evidence you gather during this period will reveal more about vendor capabilities than any RFP response or demo.