Claude.AI 帮我改进 IsHtml 判断 | Scribbles for my memory

Claude.AI 帮我改进 IsHtml 判断

让我重新分析这个问题:

需要考虑的HTML情况:

简单标签: <div>text</div>
带属性标签: <a href="url">text</a>
自闭合标签: <br/>
带属性的自闭合标签: <img src="img.jpg"/>

需要排除的markdown链接格式:

专指<https://...>这种格式
特征是 < 后紧跟 http 或 https

当前正则的问题:

<\s*([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*([a-zA-Z]+)[^>]*/>

它会匹配所有以 < 开头，包含字母的内容，这就导致了误匹配

改进思路:

排除标签名以 http(s) 开头的情况
使用负向预查 (?!http(s)?)
放在标签名匹配的开始处

优化后的正则表达式:

<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>

验证效果:

<div>text</div> ✓ 匹配
<a href="url">text</a> ✓ 匹配
<br/> ✓ 匹配
<img src="img.jpg"/> ✓ 匹配
<https://abc.com/> ✗ 不匹配

这个解决方案:

保留了原有正则的基本结构
只添加了对标签名的限制条件
能正确处理各种HTML标签
成功排除markdown格式的URL

建议修改代码为:

public static bool IsHtml(this string input)
{
    return Regex.IsMatch(input, 
        @"<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>", 
        RegexOptions.Singleline | RegexOptions.IgnoreCase);
}

Comments

Your name Your email Your website How much is 1 + 1 ? Your comment Markdown is allowed. HTML tags allowed: <strong>, <em>, <blockquote>, <code>, <pre>, <a>.