# DISTINT_open_data **Repository Path**: wingter/DISTINT_open_data ## Basic Information - **Project Name**: DISTINT_open_data - **Description**: The collection of open-source datasets created by the DISTINT group, JNU - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-29 - **Last Updated**: 2026-04-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DISTINT Open data Repository (DOR) The collection of open-source datasets created by the DISTINT group, JNU ## Dataset I: Human vs. AI-translated Multilingual Corpora (HAM) **Description**: High-quality multilingual parallel corpora (mostly public) paired with LLM-generated two-way translation. This dataset can be used for multiple purposes such as machine translation detection, translation fidelity evaluation, etc. **Portal**: https://github.com/wingter562/DISTINT_open_datasets-HAM ## Dataset II: Homologous Question-Answering Dataset (Homologous-QA) A derived Question-answering dataset. Each question has multiple homologous questions that share the same entity but differ in logic. This dataset can be used for evaluating Language models, retrieval algorithms, and RAG systems in terms of output consistency and knowledge multiplexing. **Portal**: https://github.com/wingter562/homologous-QA-dataset ## Dataset III: AI Services Container Runtime Profiling Dataset (AC-Prof) Reproducible measurements of the operational characteristics of Inference-Serving Containers, including cold starts and runtime behavior, under various resource specifications and input scales. It provides A dataset of **latency** (and power, partially) measurements for popular AI service containers (with deep models at the core) and **Scripts** for systematically profiling containerized ML workloads. **Portal**: https://github.com/wingter562/AI-container-runtime-profiles-dataset ## Dataset IV: Imprecise Label Learning datasets of Medical Imaging (ILLMed) This repo contains medical imaging datasets annotated with imprecise labels by multiple experts, suitable for benchmarking and research in partial label learning, multi-label learning, semi-supervised learning, etc. **Portal**: https://github.com/wingter562/imprecise-label-learning-datasets ## Dataset V: Edge-based Federated Learning Simulator (EdgeFLSim) A simulator of edge-based Federated Learning system with simple UI and built-in edge-device performance models. **Portal**: https://github.com/wingter562/EdgeFLSim ## Dataset VI: Knowledge Graph extracted from short sentences in medical domain (ShortMedKG) A text-to-graph automatic Knowledge Graph (KG) construction pipeline for Chinese medical corpora, along with an out-of-the-box Knowledge Graph (KG) built for daily medical QA. **Portal**: https://github.com/wingter562/ShortMedKG