Cse Course Rag
Pillar scores are computed during the next indexing cycle.
--- task_categories: - question-answering - information-retrieval - text-generation language: - en tags: - rag - retrieval-augmented-generation - education - course-materials - faiss - embeddings - cse - computer-science size_categories: - 1K
| Entity Passport | |
| Registry ID | hf-dataset--hatakekksheeshh--cse_course_rag |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__hatakekksheeshh__cse_course_rag,
author = {hatakekksheeshh},
title = {Cse Course Rag Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/hatakekksheeshh/CSE_course_RAG}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Cse Course Rag: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
license: mit
license: mit
task_categories:
- question-answering
- information-retrieval
- text-generation
language: - en
tags: - rag
- retrieval-augmented-generation
- education
- course-materials
- faiss
- embeddings
- cse
- computer-science
size_categories: - 1K<n<10K
CSE Course RAG Dataset
A comprehensive dataset for Retrieval-Augmented Generation (RAG) systems containing processed Computer Science and Engineering (CSE) course materials from Ho Chi Minh City University of Technology (HCMUT). This dataset includes pre-built FAISS indices, processed course documents, raw PDFs, and converted images, ready for use in educational RAG applications.
Dataset Description
This dataset provides a complete pipeline-ready dataset for building RAG systems on educational course materials. It includes:
- Pre-built FAISS indices for fast semantic search
- Processed course data in structured JSON format
- Raw PDF documents (original course materials)
- Converted images (OCR-ready page images)
- Metadata and embeddings for retrieval and generation tasks
The dataset is designed to support research and development in educational AI systems, particularly for question-answering and information retrieval applications.
Dataset Structure
CSE_course_RAG/
âââ indices/ # Pre-built FAISS indices for semantic search
âââ processed/ # Processed course data (JSON format)
âââ raw/ # Raw PDF documents
âââ converted/ # Converted page images (OCR-ready)
âââ data/ # Additional processed data
âââ scratch/ # Temporary processing files
Supported Tasks
- Question Answering: Answer questions about course content using retrieved context
- Information Retrieval: Semantic search over course materials
- Text Generation: Generate answers based on retrieved course content
Dataset Details
Dataset Size
- Total Courses: Multiple CSE courses
- Documents: Syllabus and material documents per course
- Chunks: Pre-processed text chunks with embeddings
- Indices: FAISS indices for fast retrieval
Data Processing
The dataset has been processed through the following pipeline:
- Conversion: PDFs/Office docs â page images
- OCR: PaddleOCR text extraction
- Parsing: Structured JSON extraction (syllabus and material parsers)
- Chunking: Text chunking with overlap
- Embedding: Sentence-transformer embeddings
- Indexing: FAISS index construction
Data Fields
Processed Data (JSON):
course: Course namecourse_id: Course codeschema_version: Data schema versionslides: Array of slide objects with:page_index: Page numberchapter_num: Chapter numbersource_file: Source file pathmetadata: Processing metadataraw_text: Extracted OCR text
FAISS Indices:
- Vector embeddings for semantic search
- Metadata mappings for chunk retrieval
- Course-specific indices
Usage
Download the Dataset
from huggingface_hub import snapshot_download
Download the entire dataset
dataset_path = snapshot_download(
repo_id="hatakekksheeshh/CSE_course_RAG",
repo_type="dataset",
local_dir="./data"
)
Or use the provided download script:
python dataset.py
Using with RAG Systems
The dataset is designed to work with the CSE Course RAG system:
from rag.query_pipeline import QueryPipeline
Initialize pipeline with pre-built indices
pipeline = QueryPipeline(
index_dir="./data/indices",
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
Query the system
result = pipeline.answer(
query="What is the grading policy?",
course="Introduction_to_Computing"
)
Loading FAISS Indices
import faiss
import pickle
Load FAISS index
index = faiss.read_index("./data/indices/course_name.index")
Load metadata
with open("./data/indices/course_name_metadata.pkl", "rb") as f:
metadata = pickle.load(f)
Processing Raw Data
If you need to reprocess the data:
# Load processed course data
import json
with open("./data/processed/course_name/course_name.json", "r") as f:
course_data = json.load(f)
Dataset Statistics
The dataset includes:
- Multiple CSE courses covering various computer science topics
- Structured syllabus data with course information, grading policies, prerequisites
- Course materials including lecture slides and chapter content
- Pre-computed embeddings using sentence-transformers models
- FAISS indices optimized for fast similarity search
Evaluation
The dataset has been evaluated with the following metrics:
- Answer Faithfulness: +21.1% improvement with query rewriting
- Top Chunk Score: +80.9% improvement in reranker confidence
- Query-Answer Similarity: Semantic alignment between queries and answers
- Retrieval Performance: Query-Chunk similarity and reranker scores
Limitations
- The dataset contains course materials from HCMUT and may be specific to that institution's curriculum
- OCR quality depends on source document quality
- Some courses may have incomplete or missing materials
- The dataset is primarily in English
Citation
If you use this dataset in your research, please cite:
@dataset{cse_course_rag_2025,
title={CSE Course RAG Dataset},
author={Nguyen Quoc Hieu},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/datasets/hatakekksheeshh/CSE_course_RAG}
}
License
This dataset is released under the MIT License. See the LICENSE file for details.
Copyright: Š 2025 Nguyen Quoc Hieu, Ho Chi Minh City University of Technology
Acknowledgments
- Ho Chi Minh City University of Technology (HCMUT) for providing course materials
- HuggingFace for hosting the dataset
- PaddleOCR for OCR capabilities
- sentence-transformers for embedding models
- FAISS for efficient similarity search
Note: This dataset is intended for research and educational purposes. Please respect the original course materials' copyright and use appropriately.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--hatakekksheeshh--cse_course_rag
- source
- huggingface
- author
- hatakekksheeshh
- tags
- license:mitregion:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
đ Engagement & Metrics
- likes
- 0
- downloads
- 15,580
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)