用正则表达式选择HTML文本元素？

我想寻找© 在HTML文档中，基本上获得版权归属的实体。

版权线显示了几种不同的方式：

 © 2011 The New York Times Company

要么

  © 2011 The New York Times Company

要么

 
Published since 1996
Copyright © CounterPunch
 All rights reserved.

我想忽略日期和干预标签，只是得到“纽约时报公司”或“反击”。

我在使用JavaScript或JQuery的正则表达式方面找不到太多，但我得到的印象是它可能导致严重的问题。如果有更好的方法，请告诉我。

对于强大的解决方案，您可能需要结合使用DOM导航和一些启发式方法。您的示例可以使用正则表达式解决，但可能有更多场景……

 ©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)

适用于您的三个样品。但仅适用于他们和类似案件。

见rubular

说明：

 © // copyright symbol [\s\d]* // followed by spaces or digits (?:[^>]*>)? // maybe followed by a closing tag and another opening one ([^<]*) // than match anything up to the next tag

请参阅此答案，了解如何在javascript中使用javascript。基本上你可以使用match（/ regex /）函数：

 var result = string.match(/©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)

 $('*:contains(©)').filter(function(){ return $(this).find('*:contains(©)').length == 0 }).text();

在这里测试http://jsfiddle.net/unloco/kGPYA/