# DataCLUE **Repository Path**: pyyx/DataCLUE ## Basic Information - **Project Name**: DataCLUE - **Description**: 以数据为中心(Data-centric)的AI,是一种新型的AI探索方向。它的核心问题是如何通过系统化的改造你的数据(无论是输入或者标签)来提高最终效果。通常建立在一个比较固定的数据集上。当前的人工智能领域, 无论是自然语言处理(如BERT) 或计算机视觉(ResNet), 已经存在很多成熟高效模型,并且模型可以很容易从开源网站获得 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-09 - **Last Updated**: 2025-02-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DataCLUE: A Benchmark Suite for Data-centric NLP You can get the Chinese Version at [README.md](./README.md) ## Table of Content | Section | Description | | --------------------------------------------------------- | ----------------------------------------------------------------- | | [Introduction](#Introduction) | Background in DataCLUE | | [Task Description](#Task_Description) | Task Description | | [Expironment Result](#Expironment_Results) | Expironment Result | | [Result Analysis](#Result_Analysis) | Result Analysis of human performance and baseline performance | | [Methods in Data-Centric AI](#Methods_in_Data-Centirc_AI) | Methods in Data-Centric AI | | [DataCLUE Characteristics](#DataCLUE_Characteristics) | DataCLUE Characteristics | | [Baseline and How to Rrun](#Baseline) | Including Baseline that we support | | [DataCLUE Evaluation](#DataCLUE_Evaluation) | How we evaluate the final result | | [Dataset Description](#Dataset_Description) | Dataset Description and Samples | | [Toolkit Description](#Toolkit_Description) | Introduce the tool set paired with DataCLUE | | [Related Works and Reference](#Reference) | Related Works and Reference(including artical, slides, solutions) | | [Contact Information and Citation](#Contact_Information) | How can you submit and contribute to this project and competition | DataCLUE has the following characteristics: 1. Academic-Friendly: A more academically friendly public test set, test_public.json, has been added. Participants can do tests and experiments independently. 2. Highly Challenging: Both the training set train and the validation set dev have a high proportion of incorrectly labeled data. The labeling accuracy rate of the public test set is high (over 95%). 3. Resource-Friendly: fast experiment cycle and low calculation cost. A single experiment in GPU environment can be completed in 3 minutes; CPU environment can also complete training in a short time. ## Introduction Existing works in AI are mostly model-centric and try to improve the performance through creating a new model. After years of development, we have many models that perform well in different tasks, such as BERT in natural language processing and ResNet in computer vision. Furthermore, these models can be easily obtained from open-source websites such as GitHub. The latest data shows that more than 90% of the papers are model-centric, and the effect is improved through model innovation or learning method improvement, even if the effect of many improvements may not be particularly obvious. Although new models are proposed everyday, many of these may not boost performance in a large margin or rarely be used. In the industry pipeline, 80% of the time may be spent on cleaning data, building high-quality data sets, or obtaining more data in the iterative process, thereby improving the effect of the model. Data-centric AI is a new trend of AI, which try to systematically transform data (whether it is input or label) to optimize the dataset quality and boost the performance. And Data-Centric AI is dataset optimization rather than model or architecture optimization in model-centric AI. DataCLUE is a Data-Centric AI benchmark of Chinese NLP, which is based on CLUE benchmark. DataCLUE tends to In the field of NLP, data-centric AI has been enriched and developed creatively by integrating the specificity of the text field. In addition to the original data set, it provides additional high-value data and data and model analysis reports (value-added services). It makes the human-in-the-loop AI pipeline more efficient and can greatly improve the final effect. ## Task_Description Competitors are expected to optimize the dataset to higher quality, the methods include (but are not limited to) the modification of training data and label, resplit the training and validation, label correction, data augmentation (craw is not allowed). The modification can be automatic/algorithmic and human-in-loop is allowed, but only manually is not suggested. The submission should be the modified training and validation dataset. ## Task_Description | Corpus | Train | Dev | Test | Label Defintion | Test_public(High Quality Data) | Data & Model Analysis Report | | :----: | :---: | :---: | :----: | :-------------: | :----------------------------: | :--------------------------: | | CIC | 10000 | 2000 | >=3000 | Stage 1 | Stage 2 | Stage 1 & 2 | | TNEWS | 53360 | 10000 | >=3000 |Stage 1 | Stage 2 | | IFLYTEK | 10134 | 2600 | >=3000 |Stage 1 | Stage 2 | | AFQMC | 10000 | 2000 | >=3000 |Stage 1 | Stage 2 | | QBQTC | 10000 | 2000 | >=3000 |Stage 1 | Stage 2 | | TRICLUE | 10000 | 2000 | >=3000 |Stage 1 | Stage 2 | Train/Dev: Noisy Data. Test_public: Hign Quality Data(with correction above 95%) can only be used to model test and should not be used to hyperparameter tuning and model training. ## Expironment_Results | | CIC (macro F1) | IFLYTEK (macro F1)|TNEWS (macro F1)|AFQMC (macro F1)| | :----:| :----: | :----: | :----: | :----: | | Human (in accuracy) | 87.40 |66.00| 71.00|81.00| | Baseline(RoBERTa_3L_wwm) | 72.78 | 30.97|46.83|59.04 | | Data-centric | 74.62 |40.15 |49.34| ?| For the experimental results, see [Single Strategy Results Summary](./baselines/single/README.md) and [Multi Strategy Results Summary](./baselines/multi/README.md) For more detailed experimental results, see: Papers and Introduction Articles ## Result_Analysis We have alrady have the baseline results in [arxiv paper](https://arxiv.org/abs/2111.08647). ## Methods_in_Data-Centirc_AI pipeline: 1.Task Definition-->2.Data Collection--->3.Model Tranining-->4.Model Deployment systematically training and iteration based dataset optimization #1.Model Tranining #2.Error analysis: Find out which types of data the algorithm model does not perform well (for example: the data is too short and the semantics are not fully expressed, and some concepts between categories are easy to confuse and the labels may be incorrect) #3. Improve data: 1) More data: data augumentation, data generation or collection of more data ---> get more input data. 2) More consistent label definition: When some categories are easy to be confused, improve the label definition --->correct the label of some data based on the clear label definition. #4. Repeat the steps #1-#3. ## DataCLUE_Characteristics 1. The first data-centric AI evaluation in Chinese NLP. 2. It is the practice of Chinese NLP tasks under the data-centric AI. 3. Richer information: In addition to the regular training, validation, and test sets, it also provides additional label definitions and high-quality data after further labeling in the training set. Combining this additional information makes the human-in-the-loop AI pipelinebecome more efficient, give enough performance boosting space. 4. Performance Analysis Report: We also provide additional analysis reports in the process of model training and prediction, making the iterative process of Data-Centric AI more systematic. ## Baseline ### Baseline with codes 1. git clone https://github.com/CLUEbenchmark/DataCLUE.git cd DataCLUE 2. cd ./baselines/models_pytorch/classifier_pytorch 3. bash run_classifier_xxx.sh 1)Training: bash run_classifier_cic.sh result will be avaliable at ./output_dir/bert/checkpoint_eval_results.txt 2)Prediction: bash run_classifier_cic.sh predict will be available at /output_dir/bert/test_prediction.json 3)Evaluation: using script below to get the performance in test_public.json. compute_f1.py [Colab Version online environment](https://colab.research.google.com/drive/1NSoVeuiggRTfLP37Np6mFdbo8kjYWapZ?usp=sharing) ### Baseline Method 1. Cross-validation model training and get the prediction for high suspicious mislabelled data based on entropy measure. 2. Data augumentation for text/data. 3. Label Augumented Data. ``` ENV: python 3.x cd root: cd baselines/simple_baseline requirement: pip install requirements.txt run: python main.py ``` ## DataCLUE_Evaluation ### 1.Submission Please upload your submission to CLUE benchmark. zip follow format: zip dataclue**\_