Reading the AI evaluation report — what the scores actually mean
Six dimensions, weighted by role, summed into one recommendation. Here is how to read each score, when to trust it, and when to override it with your own judgement.
Key takeaways
- The six dimensions are calibrated within a role, not across roles. An 80 in Engineer A vs Engineer B is comparable; Engineer A vs Sales A is not.
- The overall recommendation is a weighted sum of dimensions, with weights chosen per role archetype.
- The Strengths/Concerns text is more actionable than the scores. Use it for hiring debriefs.
- Override the AI when your context (team, history, the candidate's portfolio) outweighs the interview-only signal it sees.
Before you start
The AI evaluation is the headline output of every completed RecruitMe interview. Six dimensions, an overall recommendation, and a written summary of strengths and concerns. This article explains what each dimension is actually measuring, how the weighted recommendation comes together, and the right way to use the report.

The overall recommendation
At the top of the evaluation panel, you see one of five labels:
- Strongly Recommend — the candidate excelled across dimensions and the weighted overall score is high (typically 85+).
- Recommend — solid performance with no major red flags (typically 70–85).
- Hold — mixed signal; worth a second look but not a clear yes (typically 55–70).
- Not Recommended — clear weaknesses that aren't easily ignored (typically 40–55).
- Strongly Not Recommended — significant gaps; almost no chance of being a fit (typically <40).
The recommendation is computed from the six dimension scores using role-aware weights. It is a summary — always look at the underlying dimensions before treating it as final.
The six dimensions
Technical Competence
What it measures: depth on role-relevant skills. For an engineer, this is things like did they correctly explain how X works?, did they reason about trade-offs?, did they show familiarity with relevant tools?. For a salesperson, it covers product knowledge and methodology.
When to trust it: when the candidate had real chance to demonstrate technical skills in the questions. If the interview didn't include technical questions, this dimension is less informative.
When to question it: when the role has unusual tech stacks the AI may not know well. A staff engineer using a niche framework may score lower than they deserve simply because the AI lacks domain knowledge.
Communication
What it measures: clarity, structure, and depth of explanation. Did the candidate organise their thoughts? Did they explain technical concepts so a non-expert could follow? Did they answer the question that was asked?
When to trust it: for any role where communication matters — i.e. almost all of them.
When to question it: strong English-second-language candidates sometimes score lower despite being technically excellent. The AI tries to control for accent and grammar but isn't perfect. Listen to the recording before downgrading a high-Tech-Competence, low-Communication candidate.
Experience Relevance
What it measures: alignment between the candidate's described background and the JD requirements. If the JD asks for 5 years of B2B SaaS and the candidate spoke mostly about consumer apps, this score reflects that gap.
When to trust it: when the candidate accurately described their actual experience.
When to question it: when you have additional context (their resume shows experience they didn't bring up; you spoke to a reference who confirmed something the candidate downplayed).
Problem Solving
What it measures: reasoning quality on open-ended or hypothetical questions. Did they ask clarifying questions before answering? Did they consider multiple approaches? Did they reason about edge cases?
Best signal for senior roles. Junior candidates often have a smaller toolkit but show strong reasoning; senior candidates without good problem-solving are rarely going to grow into the role.
Cultural Fit
What it measures: how the candidate's described work style, communication preferences, and values align with the company values you provided when setting up the job.
When to trust it: when you have set up your culture values intentionally. The default values (collaborative, growth-minded, customer-focused) are generic — your actual culture is probably more specific.
TIP
Customise the culture values in the job setup. Be specific. *"Comfortable disagreeing in writing with senior leadership"* gives the AI a much sharper target than *"collaborative."*
Interview Performance
What it measures: pace, structure, and depth of answers overall — a meta score that picks up things like did they ramble?, were their answers proportionate to the question difficulty?, did they engage genuinely vs deflect?.
When to trust it: as a sanity-check on the other dimensions. A candidate with high content scores but low Interview Performance probably had strong substance but presented it poorly — that may be coachable.
How weights work
Each dimension has a weight (shown as a percentage on the score card). Weights are tied to the role archetype the job was set up for. Typical examples:
- Engineering — Technical Competence 30%, Problem Solving 25%, Communication 20%, Experience 15%, Cultural Fit 10%.
- Sales — Communication 30%, Experience Relevance 25%, Cultural Fit 20%, Problem Solving 15%, Technical Competence 10%.
- Operations — balanced across dimensions, typically 20% each for the top four.
You can override weights per job in the job's Evaluation settings. The overall recommendation re-computes immediately.
How to use Strengths & Areas of Concern
The bulleted Strengths and Areas of concern lists are usually the most directly actionable output of the evaluation. They reference specific moments from the conversation. Use them to:
- Build follow-up questions for the human round. "You mentioned reducing logistics costs by 73% — walk me through how you measured that."
- Decide whether a Hold should be a Schedule next round or a Reject. Often the deciding factor is whether the concerns are coachable (skill gap) or foundational (judgement, values).
- Spot AI errors. If the Strengths list contains something you know to be false, that is a sign to trust the transcript and the score less.
When to override the AI
The AI sees only this interview. You see more. Override when:
- You have context the AI doesn't — a strong reference, a relevant portfolio, prior interactions with the candidate.
- The interview undersold them — short answers, technical issues, an off day.
- The role has a special factor — you need a contrarian, or someone with a non-traditional background, or someone whose strength is in domains the interview didn't cover.
Trust the AI when it directly contradicts your gut: if the score is Strongly Not Recommended and you wanted to advance the candidate because they had a great LinkedIn, listen carefully. The interview is a high-signal input on actual capability that the LinkedIn often is not.
IMPORTANT
Track your overrides over time. If you override 50% of *Recommend* candidates to *Reject* — your weights are probably miscalibrated. If you override 50% of *Not Recommended* candidates to *Hire* — same thing in reverse. Either talk to your account manager about recalibration, or fix the weights yourself in job settings.
Next: how to read the transcript.
Frequently asked questions
Why does the AI sometimes give a low Cultural Fit score to a clearly good candidate?
Cultural Fit is scored against the company values you described when setting up the job — or default values if you didn't customise them. If your defaults don't match your actual culture, the score will feel off. Customise the values in the job setup to get more aligned scoring.
How do the dimension weights work?
Each role archetype (Engineering, Sales, Operations, Support, etc.) has a default weighting profile — e.g. Engineering weights Technical Competence at 30% and Communication at 20%, while Sales flips that. You can customise weights per job in the job settings. The weights show on each dimension card on the report.
Can the AI be biased?
Like any ML system, there's risk. We work to mitigate it — for example by training on diverse interview data and avoiding signals like name/accent for scoring. The strongest protection is your own override discipline: if a score seems out of line with the transcript content, trust the transcript. We also surface a fairness audit per job in [Analytics](/knowledge-hub/analytics).
Why are some dimensions missing for some interviews?
If the interview was too short to gather enough signal on a dimension, the AI shows a dash instead of forcing a number. For example, a 15-minute interview rarely has enough material to score *Cultural Fit* reliably.