Press "Enter" to skip to content

Linux/Unix: Remove HTML Tags from File | HTML to Text

In this tutorial, I am giving an example of sed command to remove HTML tags from a file in Linux/Unix systems. Or in other words, it will convert an HTML file to a text file.

Sometimes, when we download the text from a website, we also get HTML tags, and this can be an issue when reading the data.

A standard HTML page contains many types of HTML tags. Below is a sample of an HTML file:

htmlpage.html

<html>

<head>
    <title>Web Page Title</title>
</head>

<body>
    <p>
        This line contains a bold element <b>Fox Infotech</b>. And this line contains the italic text <i>Vinish Kapoor's Blog</i> This file would be converted into the plain text by using the sed command.
</body>

</html>

HTML tags are identified by the less than (<) and greater than (>) symbols. Most HTML tags come in pairs. One tag starts the formatting process (for example, <p> for paragraph), and another tag ends the paragraph (for example, </p> to finish a paragraph).

The following is the example of Linux sed command to remove the HTML tags from a file.

Remove HTML Tags from a File in Linux

sed 's/<[^>]*>//g ; /^$/d' htmlpage.html

Output

Web Page Title
This line contains a bold element Fox Infotech.
And this line contains the italic text Vinish Kapoor's Blog
This file would be converted into the plain text by using the sed command.

Convert HTML to Text in Linux

The following sed command will remove the HTML tags and will send the output to a text file.

sed 's/<[^>]*>//g ; /^$/d' htmlpage.html > output.txt

Check the Output.txt

$ cat output.txt
Web Page Title
This line contains a bold element Fox Infotech.
And this line contains the italic text Vinish Kapoor's Blog
This file would be converted into the plain text by using the sed command.

See also: