Thursday, August 22, 2024

Install MinerU Locally to Create LLM Dataset from PDF Files

 This video shows how to install MinerU which is a LLM-powered tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format to create datasets.


Code:

git clone https://github.com/opendatalab/MinerU.git && cd MinerU

conda create -n MinerU python=3.10 && conda activate MinerU

pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com

magic-pdf --version

git lfs install

mkdir model
cd model
git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit

change magic-pdf.json for models-dir and cuda

wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf

magic-pdf -p small_ocr.pdf

No comments: