Four Methods to Extract Specific Information from WordPress HTML Content

0

Extracting specific information from HTML content is a crucial task in web development. Especially when handling various data provided by platforms like WordPress, we need to process and use a large amount of information as needed. In this article, we will explore various methods to parse only the desired information from HTML-formatted content provided by the WordPress API using JavaScript. With clear and easy methods, we will help you overcome difficulties in data processing.

1. Parsing HTML using DOMParser

What is DOMParser?

DOMParser is a tool that converts HTML strings into DOM (Document Object Model) in a browser environment. This allows us to access HTML code as if it were web page elements and extract the necessary information. For example, if you want to select only the text wrapped in `p` tags from WordPress, DOMParser is useful.

DOMParser usage example

function parseContent(htmlString) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(htmlString, 'text/html');
  
  // Extracting text inside 'p' tags
  const paragraphs = doc.querySelectorAll('p');
  const parsedText = Array.from(paragraphs).map(p => p.textContent).join('\n');
  
  return parsedText;
}

const rawHtml = '<div><p>This is the first sentence.</p><p>This is the second sentence.</p></div>';
const result = parseContent(rawHtml);
console.log(result);  // "This is the first sentence.\nThis is the second sentence."

In the above code, after converting the HTML string with `DOMParser`, only the text inside `p` tags is extracted. DOMParser is a powerful tool when parsing simple HTML structures.

2. Simple parsing using regular expressions

Extracting only desired tags with regular expressions

Regular expressions (Regex) are useful tools for extracting or transforming text that matches specific patterns. In particular, regular expressions are effective for extracting only specific tags or content from simple HTML strings. However, be cautious as regular expressions may return incorrect results with overly complex HTML structures.

Tag extraction example using regular expressions

function stripTags(htmlString, tag) {
  const regex = new RegExp(`<${tag}[^>]*>(.*?)</${tag}>`, 'gi');
  const matches = [];
  let match;
  
  while ((match = regex.exec(htmlString)) !== null) {
    matches.push(match[1]);
  }
  
  return matches.join('\n');
}

const rawHtml = '<div><p>This is the first sentence.</p><p>This is the second sentence.</p></div>';
const result = stripTags(rawHtml, 'p');
console.log(result);  // "This is the first sentence.\nThis is the second sentence."

In the above code, we used regular expressions to extract text between `p` tags. [While useful for selecting specific tags in simple structures], regular expressions may not be suitable for complex HTML parsing.

3. Server-side parsing with Cheerio library

What is Cheerio?

Cheerio is a server-side library for parsing HTML, allowing you to access HTML in a way similar to jQuery. It is very useful when dealing with large amounts of HTML data on the server and is recommended when performance is more important than on the client side.

Parsing HTML with Cheerio

npm install cheerio
const cheerio = require('cheerio');

function parseWithCheerio(htmlString) {
  const $ = cheerio.load(htmlString);
  
  // Extracting text inside 'p' tags
  const parsedText = $('p').map((i, el) => $(el).text()).get().join('\n');
  
  return parsedText;
}

const rawHtml = '<div><p>This is the first sentence.</p><p>This is the second sentence.</p></div>';
const result = parseWithCheerio(rawHtml);
console.log(result);  // "This is the first sentence.\nThis is the second sentence."

Cheerio is a powerful tool for processing large amounts of HTML data on the server. If you want fast and flexible server-side HTML parsing, Cheerio is a great choice.

4. Parsing HTML in React

Parsing HTML with React

To parse HTML content in a React environment, you can render HTML using `dangerouslySetInnerHTML`. However, if you need to process data by selecting only specific elements, DOM access methods should be used.

Example of parsing elements with a specific class

import React from 'react';

function parseSpecificContent(htmlString) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(htmlString, 'text/html');
  
  // Extracting content with a specific class
  const specificContent = doc.querySelector('.specific-class')?.textContent || '';
  
  return specificContent;
}

const rawHtml = '<div><p class="specific-class">Content to extract</p><p>Other content</p></div>';

export default function PostContent() {
  const content = parseSpecificContent(rawHtml);

  return (
    <div>
      <h2>Parsed Content:</h2>
      <p>{content}</p>
    </div>
  );
}

This code demonstrates how to extract the text of an element containing a specific class using DOMParser in a React environment. You can use this method when parsing specific HTML content in React.

Conclusion: Choose the most appropriate parsing tool

To extract only the necessary information from HTML content, it is important to select the right tool for the project. Use DOMParser or regular expressions for simple structures, Cheerio for complex server-side tasks, and DOM access methods in React. Understand the pros and cons of each tool and apply them to process HTML data efficiently.

Leave a Reply