How to extract file names from a field containing html content in sql server?

We have a cms system that writes blocks of html content to the sql server database. I know the name of the table and the name of the field where these html content blocks are located. Some html contain links () in pdf files. Here is a snippet:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

I need to extract PDF file names from all such html content blocks. At the end I need to get a list:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

all PDF file names from this field.

Any help is appreciated. Thank.

UPDATE

I got a lot of answers, thank you very much, but I forgot to mention that we are still using SQL Server 2000 here. Therefore, this should have been done using SQL 2000 SQL.

+5
source share
4 answers

, , Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
            THEN SUBSTRING(
                     html,
                     CHARINDEX('.pdf', html) -
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
            ELSE NULL
       END AS filename
FROM mytable

["/] ( ), .

- SQL Fiddle

+1

:

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
  select @tmp = left(@s, @i+3);
  select @j = charindex('/', reverse(@tmp)); -- directory delimiter
  select @k = charindex('"', reverse(@tmp)); -- start of href
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
  select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
  set @i = charindex(N'.pdf', @s);
end
return
end
GO

:

declare @t table (html varchar(max));
insert @t values
  ('
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
  ('
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
   <a href=''/look/path/The second file.pdf''
target="_blank">list</a>');

select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;

|HTML                  |                                       PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... |        Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... |                       The second file.pdf |

SQL Fiddle Demo

+3

What about handling HTML as XML?

declare @t table (html varchar(max));
insert @t 
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>'
    union all
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="Two files here-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>And I use single quotes
       <a href=''/look/path/The second file.pdf''
    target="_blank">list</a>'

select  [filename] = reverse(left(reverse('/'+p.n.value('@href', 'varchar(100)')), charindex('/',reverse('/'+p.n.value('@href', 'varchar(100)')), 1) - 1))
from    (   select  cast(html as xml)
            from    @t
        ) x(doc)
cross
apply doc.nodes('//a') p(n);

Results:

filename
---------------------------------------------------------------
Tuition-Reimbursement-Deferred.pdf
Two files here-Reimbursement-Deferred.pdf
The second file.pdf
+1
source

Try this option -

DECLARE @XML XML = 
'<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'

SELECT 
      ref_text = t.p.value('./a[1]', 'NVARCHAR(50)')
    , ref_filename = REVERSE(
                        LEFT(REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 
                        CHARINDEX('/',REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 1) - 1))
FROM @XML.nodes('/p') t(p)
+1
source

All Articles