📊

Dataset

Cse Course Rag

Name: Cse Course Rag
Creator: hatakekksheeshh
License: MIT

by hatakekksheeshh hf-dataset--hatakekksheeshh--cse_course_rag

Nexus Index

41.0 Top 0%

S / A / P / R / Q Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context

Vital Performance

0 DL / 30D

0.0%

--- task_categories: - question-answering - information-retrieval - text-generation language: - en tags: - rag - retrieval-augmented-generation - education - course-materials - faiss - embeddings - cse - computer-science size_categories: - 1K

Source →

Data Integrity 41 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--hatakekksheeshh--cse_course_rag
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__hatakekksheeshh__cse_course_rag,
  author = {hatakekksheeshh},
  title = {Cse Course Rag Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/hatakekksheeshh/CSE_course_RAG}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

hatakekksheeshh. (2026). Cse Course Rag [Dataset]. Free2AITools. https://huggingface.co/datasets/hatakekksheeshh/CSE_course_RAG

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

41.0

ESTIMATED IMPACT TIER

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 0

Quality (Q) 0

💬 Index Insight

FNI V2.0 for Cse Course Rag: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

15,580

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: mit

license: mit
task_categories:

question-answering
information-retrieval
text-generation
language:
en
tags:
rag
retrieval-augmented-generation
education
course-materials
faiss
embeddings
cse
computer-science
size_categories:
1K<n<10K

CSE Course RAG Dataset

A comprehensive dataset for Retrieval-Augmented Generation (RAG) systems containing processed Computer Science and Engineering (CSE) course materials from Ho Chi Minh City University of Technology (HCMUT). This dataset includes pre-built FAISS indices, processed course documents, raw PDFs, and converted images, ready for use in educational RAG applications.

Dataset Description

This dataset provides a complete pipeline-ready dataset for building RAG systems on educational course materials. It includes:

Pre-built FAISS indices for fast semantic search
Processed course data in structured JSON format
Raw PDF documents (original course materials)
Converted images (OCR-ready page images)
Metadata and embeddings for retrieval and generation tasks

The dataset is designed to support research and development in educational AI systems, particularly for question-answering and information retrieval applications.

Dataset Structure

CSE_course_RAG/
├── indices/          # Pre-built FAISS indices for semantic search
├── processed/        # Processed course data (JSON format)
├── raw/             # Raw PDF documents
├── converted/       # Converted page images (OCR-ready)
├── data/            # Additional processed data
└── scratch/         # Temporary processing files

Supported Tasks

Question Answering: Answer questions about course content using retrieved context
Information Retrieval: Semantic search over course materials
Text Generation: Generate answers based on retrieved course content

Dataset Details

Dataset Size

Total Courses: Multiple CSE courses
Documents: Syllabus and material documents per course
Chunks: Pre-processed text chunks with embeddings
Indices: FAISS indices for fast retrieval

Data Processing

The dataset has been processed through the following pipeline:

Conversion: PDFs/Office docs → page images
OCR: PaddleOCR text extraction
Parsing: Structured JSON extraction (syllabus and material parsers)
Chunking: Text chunking with overlap
Embedding: Sentence-transformer embeddings
Indexing: FAISS index construction

Data Fields

Processed Data (JSON):

course: Course name
course_id: Course code
schema_version: Data schema version
slides: Array of slide objects with:
- page_index: Page number
- chapter_num: Chapter number
- source_file: Source file path
- metadata: Processing metadata
- raw_text: Extracted OCR text

FAISS Indices:

Vector embeddings for semantic search
Metadata mappings for chunk retrieval
Course-specific indices

Usage

Download the Dataset

from huggingface_hub import snapshot_download

Download the entire datasetdataset_path = snapshot_download(
    repo_id="hatakekksheeshh/CSE_course_RAG",
    repo_type="dataset",
    local_dir="./data"
)

Or use the provided download script:

python dataset.py

Using with RAG Systems

The dataset is designed to work with the CSE Course RAG system:

from rag.query_pipeline import QueryPipeline

Initialize pipeline with pre-built indices
pipeline = QueryPipeline(
    index_dir="./data/indices",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
Query the systemresult = pipeline.answer(
    query="What is the grading policy?",
    course="Introduction_to_Computing"
)

Loading FAISS Indices

import faiss
import pickle

Load FAISS index
index = faiss.read_index("./data/indices/course_name.index")
Load metadatawith open("./data/indices/course_name_metadata.pkl", "rb") as f:
    metadata = pickle.load(f)

Processing Raw Data

If you need to reprocess the data:

# Load processed course data
import json

with open("./data/processed/course_name/course_name.json", "r") as f:
    course_data = json.load(f)

Dataset Statistics

The dataset includes:

Multiple CSE courses covering various computer science topics
Structured syllabus data with course information, grading policies, prerequisites
Course materials including lecture slides and chapter content
Pre-computed embeddings using sentence-transformers models
FAISS indices optimized for fast similarity search

Evaluation

The dataset has been evaluated with the following metrics:

Answer Faithfulness: +21.1% improvement with query rewriting
Top Chunk Score: +80.9% improvement in reranker confidence
Query-Answer Similarity: Semantic alignment between queries and answers
Retrieval Performance: Query-Chunk similarity and reranker scores

Limitations

The dataset contains course materials from HCMUT and may be specific to that institution's curriculum
OCR quality depends on source document quality
Some courses may have incomplete or missing materials
The dataset is primarily in English

Citation

If you use this dataset in your research, please cite:

@dataset{cse_course_rag_2025,
  title={CSE Course RAG Dataset},
  author={Nguyen Quoc Hieu},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/hatakekksheeshh/CSE_course_RAG}
}

License

This dataset is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Ho Chi Minh City University of Technology (HCMUT) for providing course materials
HuggingFace for hosting the dataset
PaddleOCR for OCR capabilities
sentence-transformers for embedding models
FAISS for efficient similarity search

Note: This dataset is intended for research and educational purposes. Please respect the original course materials' copyright and use appropriately.

Top Tier

Social Proof

HuggingFace Hub

15.6KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--hatakekksheeshh--cse_course_rag
source: huggingface
author: hatakekksheeshh
tags: license:mitregion:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 0
downloads: 15,580

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!