Coder Perfect

All text between tags is selected using regex.

Problem

What is the easiest approach to pick all text between two tags, such as all of the ‘pre>’ tags on the page?

Asked by basheps

Solution #1

use “pre>(.*?)/pre>” (replace pre with whatever text you wish) to extract the first group (for more detailed instructions provide a language), although this assumes you have extremely simple and acceptable HTML.

If you’re doing something complicated, as several commenters have advised, use an HTML parser.

Answered by PyKing

Solution #2

Another line can be used to finish the tag. This is why n must be included.

<PRE>(.|\n)*?<\/PRE>

Answered by zac

Solution #3

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically, it accomplishes the following:

(?=(pre>)) The selection must be preceded by the pre> tag.

(w|d|n|[().,-:;@#$ percent &*[]”‘+–/®°0!?|]|) This is simply a regular expression that I’d want to use. In this case, it chooses a letter, a numeric, a newline character, or some of the special characters given in square brackets in the sample. The pipe symbol | stands for “or.”

+? Plus character states to select one or more of the above – order does not matter. Question mark changes the default behavior from ‘greedy’ to ‘ungreedy’.

(?=()) The /pre> tag must be applied to the selection.

Depending on your use case you might need to add some modifiers like (i or m)

This search was done in Sublime Text so that I didn’t have to utilize modifiers in my regex.

For non-capturing parentheses, see the JAVASCRIPT REGEX DOCUMENTATION.

Answered by DevWL

Solution #4

To remove the delimiting tags, type:

(?<=<pre>)(.*?)(?=</pre>)

(?=pre>) searches for text after the pre>.

(?=/pre>) checks for text before/before/before/before/before/before/before/before/before/before/before/before/before/before/before/before

The text in the pre tag will be the results.

Answered by Jean-Simon Collard

Solution #5

To get content between elements, use the pattern below. Replace [tag] with the name of the element from which you want to extract the content.

<[tag]>(.+?)</[tag]>

When tags have attributes, such as the href attribute on an anchor tag, use the pattern below.

 <[tag][^>]*>(.+?)</[tag]>

Answered by Shravan Ramamurthy

Post is based on https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags