CLanguage_Mode_for_2

Project 2025-12-06 (修改 2025-12-08) 64 1

<h1 align="center">CLanguage_Mode_for_2</h1> <p align="center">CLRE-20</p> <p align="center"><a href="/clre20/project/CLanguage_Mode_for_2/1">中文版</a> | <a href="/clre20/project/CLanguage_Mode_for_2/2">英文版</a></p> --- # LSTM-Based Chinese Text Generation Model This repository demonstrates a **Chinese text generation model** implemented with **PyTorch** using an **LSTM (Long Short-Term Memory)** network. The pipeline covers **Data Collection → Cleaning → Segmentation → Vocabulary → Training → Text Generation**, making it a fundamental reference for building Chinese language models. ## Features * **Data Cleaning:** Extracts text from HTML `<p>` tags or specified PDF page ranges. * **Chinese Segmentation:** Uses `jieba` for tokenization with support for custom tokens (e.g., `[END_OF_NOVEL]`). * **Vocabulary Creation:** Converts words into numerical IDs with `<unk>` and `<pad>` tokens. * **Model Training:** `SimpleLSTM` trained with cross-entropy loss and Adam optimizer, including visualization of loss/accuracy curves. * **Text Prediction:** Loads the trained model to generate text based on prompts. * **Configuration:** All hyperparameters are centralized in `config.json`. ## Project Structure | File / Folder | Description | | ---------------------------- | ------------------------------------------- | | `1-DLweb.py` | Download a single web page | | `2-Data_Cleaning.py` | Extract `<p>` text from a single HTML file | | `2-pro-Data_Cleaning.py` | Process multiple HTML files in a folder | | `2-PDF.py` | Extract text from a PDF within a page range | | `3-Word_Segmentation.py` | Basic Chinese word segmentation | | `3-pro-Word_Segmentation.py` | Enhanced version with custom dictionary | | `4-Vocabulary.py` | Build vocabulary and numerical sequences | | `5-Model_Train.py` | Train the LSTM and plot training curves | | `6-Model_Predict.py` | Predict text using the trained model | | `config.json` | Hyperparameter configuration | | `Data/` | Raw HTML, PDF, or text data | | `*.txt / *.png / *.pth` | Generated outputs | ## Requirements Install dependencies: ```bash pip install torch numpy matplotlib beautifulsoup4 pypdf requests jieba ``` ## Usage 1. **Data Preparation** * Single HTML → `1-DLweb.py` → `2-Data_Cleaning.py` * Multiple HTML files → `2-pro-Data_Cleaning.py` * PDF → edit parameters in `2-PDF.py` and run 2. **Word Segmentation** * Basic: `3-Word_Segmentation.py` * Advanced: `3-pro-Word_Segmentation.py` 3. **Vocabulary & Conversion** ```bash python 4-Vocabulary.py ``` 4. **Model Training** ```bash python 5-Model_Train.py ``` Outputs: * `simple_lstm_model.pth` → model weights * `training_loss_plot_*.png` → training visualization 5. **Text Prediction** Edit `prompt` in `6-Model_Predict.py` and run: ```bash python 6-Model_Predict.py ```

返回首頁