如何使用 grep 和正则表达式从 HTML 页面提取 href 属性

你可以使用正则表达式像这样在 HTML 中 grep href="..." 属性：

grep-href.sh

grep -oP "(HREF|href)=\"\K.+?(?=\")"

grep -oP "(HREF|href)=\"\K.+?(?=\")"

grep 使用 -o（只打印匹配，这是获取前瞻断言等额外功能所必需的）和 -P（使用 Perl 正则表达式引擎）运行。正则表达式基本上是

regex.txt

href=".*"

href=".*"

其中 .+ 以非贪婪模式使用（.+?）：

regex-nongreedy.txt

href=".+?"

href=".+?"

这将给我们类似这样的匹配

example-link.html

href="/files/image.png"

href="/files/image.png"

由于我们只想要引号（"）中的内容而不是 href="..." 部分，我们可以使用正向后视断言（\K）来移除 href 部分：

regex-lookbehind.txt

href=\"\K.+?\"

href=\"\K.+?\"

但我们也想去掉结尾的双引号。为此，我们可以使用正向前瞻断言（(?=\")）：

regex-lookaround.txt

href=\"\K.+?(?=\")

href=\"\K.+?(?=\")

现在我们想匹配 href 和 HREF 以获得一些大小写不敏感：

regex-case.txt

(href|HREF)=\"\K.+?(?=\")

(href|HREF)=\"\K.+?(?=\")

通常我们想专门匹配一种文件类型。例如，我们可以只匹配 .png：

match-png.txt

(href|HREF)=\"\K.+?\.png(?=\")

(href|HREF)=\"\K.+?\.png(?=\")

为了减少某些页面上错误过长的匹配，我们想使用 [^\"]+? 而不是 .+?：

match-png-safe.txt

(href|HREF)=\"\K[^\"]+?\.png(?=\")

(href|HREF)=\"\K[^\"]+?\.png(?=\")

这不允许包含 " 字符的匹配，从而防止匹配超过标签的内容。

使用示例：

wget-grep-png.sh

wget -qO- https://nasagrace.unl.edu/data/NASApublication/maps/ | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"

wget -qO- https://nasagrace.unl.edu/data/NASApublication/maps/ | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"

输出：

output.txt

/data/NASApublication/maps/GRACE_SFSM_20201026.png
[...]

/data/NASApublication/maps/GRACE_SFSM_20201026.png
[...]

Check out similar posts by category: Linux

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow