Regex strip all html except background style url -
i have following regex
find background style urls in html. i'm trying strip html except background image urls. goal abstract list of background image urls html page.
expression url\(\s*(['"]?)(.*?)\1\s*\)
example html
<a href="#"><img style="background-image: url(http://domain.com/2003-th.jpg)"></a>
i'd not of expression.
i don't know netbeans ide, guess only.
but beware: search url(...)
everywhere. not matter text occurs: in css block, in html style-attributes, in javascript, in pure text , comments!
general modifications
if want include background-images only, should state in regex, too. becomes
\bbackground-image\s*:\s*url\(\s*(['"]?)(.*?)\1\s*\)
to speed things (at least in implementations), try prevent backreferences. in case
\bbackground-image\s*:\s*url\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)
it's bit more, @ least in sublime text it's worth it.
use
to replace urls background-images, use single regex
[\s\s]*?\bbackground-image\s*:\s*url\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)|[\s\s]+
and replace $1$2$3\n
. there (almost) 2 \n
@ end, think should no problem.
this won't work in regex engines not order of elements decisive, length of match.
however, if it's problem, can try use
[\s\s]*?\bbackground-image\s*:\s*url\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)[\s\s]*?(?=\z|\bbackground-image\s*:\s*url\(\s*(?:'[^']+'|"[^"]+"|[^)]+)\s*\))
and replace $1$2$3\n
.
[\s\s]
means every character (including\n
)\b
word boundary(?= ... )
positive lookahead. has match not part of result\z
end of text
(maybe have tweak regex bit fit netbeans)
anyway, not every regex implementeation supports lookaheads. if not supported netbeans, have use multi-step approach:
first step
replace
[\s\s]*?\bbackground-image\s*:\s*url\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)
with >-bg-url:$1$2$3\n
.
>-bg-url:
indicate values , distinct them rest.
second step
manually replace after last match (you won't need --bg-url
then) or replace
^>-bg-url:(.*)|^[\s\s]+
with $1
Comments
Post a Comment