Falcon3-1B-Base: Evaluation of Blind Spots and Failure Modes

This repository contains an evaluation of the Falcon3-1B-Base model, a 1-billion parameter foundation model released by the Technology Innovation Institute (TII) in December 2024.

As a base model, Falcon3-1B-Base is a "foundation" version that has undergone pre-training but lacks instruction-tuning or alignment (RLHF/SFT). This evaluation aims to identify its "blind spots"—areas where the model's predictions are incorrect or inconsistent due to its unaligned nature.

Model Details

Developer: Technology Innovation Institute (TII)
Parameters: 1.0 Billion
Architecture: Transformer-based causal decoder-only
Release Date: December 2024
Modality: Text Generation

Evaluation Methodology

The evaluation was conducted using a custom Python script (evaluate_model.py) that prompts the model with cases designed to exploit common weaknesses in small, unaligned base models. The key areas tested include:

Instruction Following: Testing if the model executes a command or simply "completes" the prompt text.
Logical Transitivity: Evaluating multi-step reasoning.
Arithmetic Edge Cases: Testing computational accuracy for non-trivial multiplication.
Spatial Reasoning: Understanding hierarchies of object placement.
Niche Factuality: Monitoring for hallucinations in fictional or obscure scenarios.

Identified Blind Spots

The following blind spots were identified during experimentation:

1. The Completion Loop

The most prominent failure mode is the "Completion Loop," where the model treats an instruction as the first line of a template and proceeds to generate more instructions instead of providing an answer.

2. Multi-Hop reasoning Failure

The model struggles to maintain consistency across transitive relationships (e.g., if A > B and B > C, then A > C).

3. Spatial Hierarchy Confusion

The model often loses track of orientation in nested spatial descriptions (e.g., "A is on B, B is on C").

Evaluation Dataset (10 Failure Points)

No	Input Prompt	Predicted Failure	Reason for Failure
1	`Instruction: Translate 'Hello' to French. Response:`	`Instruction: Translate 'Goodbye' to French.`	Completion Loop (Instruction Ignoring)
2	`2 + 2 * 2 =`	`8`	Order of Operations (Greedy Computation)
3	`The capital of the fictional planet Xylophon is`	`Paris`	Hallucination (Association with 'Capital')
4	`Who won the Super Bowl in 2027?`	`The Kansas City Chiefs won...`	Temporal Hallucination (Future Events)
5	`A is B's father. B is C's father. What is A to C?`	`C is the son of B.`	Kinship Relation Inversion
6	`Write code to sum a list in Python.`	`1. Python is a language. 2. Lists are vectors.`	Definitional Drift (Explaining instead of coding)
7	`Is 9.11 greater than 9.9?`	`Yes, 9.11 contains the number 11.`	Tokenization Bias (Number Comparison)
8	`Continue the pattern: 2, 4, 8, 16,`	`18, 20, 22`	Linear Assumption (Arithmetic vs Geometric)
9	`What happens if you drop a glass on a pillow?`	`The glass shatters into pieces.`	Lack of Physical Intuition
10	`Give me a recipe for invisible soup.`	`Ingredients: 2 cups of water, 1 onion...`	Literalism (Ignoring semantic impossibility)

Strategy for Fine-Tuning

To mitigate these blind spots, we propose the following fine-tuning strategy:

Phase 1: Supervised Fine-Tuning (SFT): Utilize ~10,000 instruction-response pairs focusing on the identified failure modes.
Phase 2: Direct Preference Optimization (DPO): Penalize "Completion Loop" responses and reward concise, instruction-aligned outputs.

Recommended Dataset Size: ~50,000 high-quality samples.

How to Run the Evaluation

To reproduce these results, use the following code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "tiiuae/Falcon3-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Instruction: Tell me a story about a dragon. Story:"
print(pipe(prompt, max_new_tokens=50)[0]['generated_text'])

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for k4christ/evaluation-falcon3-1b-base

Base model

tiiuae/Falcon3-1B-Base

Finetuned

(9)

this model