Fix HTML attribute values ​​with double quotes in them

I have a set of HTML files with illegal syntax in a hreftag attribute <a>. For instance,

<a name="Conductor, "neutral""></a>

or

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />

or

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>

I am trying to process files using a Perl module XML::Twigusing parsefile_html($file_name). When he reads a file with this syntax, he gives this error:

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893

I need either a way to force the module to accept the bad syntax and process it, or a regular expression to search for and replace double quotes in attributes with single quotes.

+3
source share
2 answers

Given your html sample, the code below works:

use Modern::Perl;

my $html = <<end;
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
end

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg;

say $html;

Conclusion:

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>

, , , , . , , , .

, .

+2

, , - (/e). .

, .

.

use strict;
use warnings;

my $html = <<'HTML';
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, "neutral""></a>
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">
HTML

$html =~ s{(<[^>]+>)}{

  my $tag = $1;

  $tag =~ s{ \w+= " \K ( [^=<>]+ ) (?= " (?: \s+\w+= | \s*/?> )) }
  {
    (my $attr = $1) =~ tr/"//d;
    $attr;
  }egx;

  $tag;
}eg;

print $html;

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />
<a name="Conductor, neutral"></a>
<a href="1.html" title="Page 1: What are series and parallel circuits?">
+1
source

All Articles