Training Datasets
DATASET
Norwegian Legal Q&A Dataset
7,603 expert-curated question-answer pairs covering Norwegian legislation, case law, and regulatory compliance. Used to fine-tune dobetter-norge-v2.
Type
Dataset
Format
JSONL
Size
23 MB
Version
v2
7,603 Q&A Pairs
2,847 Source Docs
12 Legal Domains
JSONL Format
Dataset Overview
This dataset contains 7,603 question-answer pairs specifically designed for training language models on Norwegian legal reasoning. Each pair was derived from authoritative legal sources and validated against established legal interpretations.
Dataset Statistics
| Metric | Value |
|---|---|
| Total Q&A Pairs | 7,603 |
| Source Documents | 2,847 |
| Training Examples (augmented) | 31,842 |
| Average Question Length | 47 tokens |
| Average Answer Length | 186 tokens |
| Legal Domains Covered | 12 |
Domain Coverage
- Contract Law (Avtaleloven)
- Employment Law (Arbeidsmiljoloven)
- Company Law (Aksjeloven)
- Tax Law (Skatteloven)
- Data Protection (Personopplysningsloven / GDPR)
- Consumer Protection (Forbrukerkjopsloven)
- Criminal Law (Straffeloven)
- Administrative Law (Forvaltningsloven)
- Environmental Law (Forurensningsloven)
- Planning and Building (Plan- og bygningsloven)
- Immigration Law (Utlendingsloven)
- Intellectual Property (Åndsverkloven)
Data Format
JSON Lines format with the following structure:
{
"id": "qa-0001",
"question": "Hva er vilkårene for...",
"answer": "I henhold til § 36...",
"source_doc": "LOV-2023-06-16-40",
"domain": "contract_law",
"difficulty": "intermediate"
}BNL Perspective
Building this dataset was the hardest part of the entire dobetter-norge project. Scraping legal text is straightforward; turning it into high-quality Q&A pairs that actually teach a model to reason about law — that took months of iteration. We ended up with a semi-automated pipeline: extract key provisions, generate candidate questions, then have domain experts validate and refine. The 7,603 pairs represent roughly 2,400 hours of combined expert review time.