https://store-images.s-microsoft.com/image/apps.10812.33c8de2e-3e97-4413-9296-4df872fa8fd8.80fe74ef-a205-48a3-9521-a7a64502e53a.079734a9-ab3c-458c-ab1d-f487b0f6112b

ReaderLM v2

Jina AI

ReaderLM v2

Jina AI

1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON

ReaderLM v2 is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. It supports up to 512K tokens in combined input and output and offers multilingual support for 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.

Thanks to its new training paradigm and higher-quality data, ReaderLM v2 is a significant leap from its predecessor, especially in handling long-form content and markdown generation. Unlike the first generation, which treated HTML-to-markdown conversion as a “selective-copy” task, v2 handles it as a translation process, enabling complex elements like code fences, nested lists, tables, and LaTeX equations to be generated accurately.

Highlights:
  • High-Accuracy HTML-to-Markdown Conversion with Improved Stability: Transforms raw HTML into structured markdown, preserving complex elements like nested lists, tables, and LaTeX equations, while addressing degeneration issues like repetition and looping in long sequences.
  • Direct HTML-to-JSON Extraction: Allows direct conversion of HTML to JSON using customizable schemas, eliminating the need for intermediate markdown conversion.
  • Longer Context and Multilingual Support: Handles up to 512K tokens in input and output length, and supports 29 languages, making it ideal for large-scale web data processing.