php通过正则提取页面相关信息

时间2023-07-19 15:53:02发布访客分类HTML浏览1325

导读：1、获取页面标题//提取标题 preg_match('/<title>(?<title>.*? <\/title>/i', $html, $titleArr ; $title = $titleArr['t...

1、获取页面标题

//提取标题
preg_match('/title>
    (?title>
    .*?)\/title>
    /i', $html, $titleArr);
    
$title = $titleArr['title'];

2、获取body主体内容，并将背景图片提取出来替换成其他图片地址

/**
 * 获取BODY主体区域内容
 * @param $html
 * @param $urlRoot
 * @return mixed
 */
function getBody($html,$urlRoot = null){
    
    //提取BODY主体
    preg_match('/!--body-->
    (.*?)!--body-->
    /is ', $html, $bodyArr);

    if(!$bodyArr){
    
        preg_match('/body.*?>
    (.*?)\/body>
    /is ', $html, $bodyArr);

    }
    
    $body = $bodyArr[1];
    
    //替换img文件
    $body =  preg_replace('/([img|IMG].*src=[\'|"])(\.\.\/)*(img.[^\'||^"]+)/',"$1$urlRoot$3",$body);
    
    //替换html文件内的css背景图片
    $body =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$body);
    
    return $body;

}

3、提取页面Description内容

function getDescription($html){
    
    // Get the 'content' attribute value in a meta name="description" ... />
    
    $matches = array();
    

    // Search for meta name="description" content="Buy my stuff" />
    
    preg_match('/meta.*?name=("|\')description("|\').*?content=("|\')(.*?)("|\')/i', $html, $matches);
    
    if (count($matches) >
 4) {
    
        return trim($matches[4]);

    }
    

    // Order of attributes could be swapped around: meta content="Buy my stuff" name="description" />
    
    preg_match('/meta.*?content=("|\')(.*?)("|\').*?name=("|\')description("|\')/i', $html, $matches);
    
    if (count($matches) >
 2) {
    
        return trim($matches[2]);

    }
    

    // No match
    return null;

}

4、替换css文件的背景图片地址

/**
 * 获取CSS内容
 * @param $cssCnt
 * @param $urlRoot
 * @return mixed
 */
function getCss($cssCnt,$urlRoot =null){
    
    //匹配包含 img文件夹的相对路径图片 （含义绝对路径的不包含在其中）
    //匹配替换不一定准确，因为只是将 含义 ../ 的地址转为url 而没有考虑 ../../ 之类的层级关系
    $css =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$cssCnt);

    //添加css前缀
    $css =  preg_replace('/\b.(.*?)[,|{
    ]/',"pat .$0",$cssCnt);
    
    //TODO 压缩css
    return $css;

}

声明：本文内容由网友自发贡献，本站不承担相应法律责任。对本内容有异议或投诉，请联系2913721942#qq.com核实处理，我们将尽快回复您，谢谢合作！

若转载请注明出处： php通过正则提取页面相关信息
本文地址： https://pptw.com/jishu/318610.html

提高代码可读性的8个技巧 css一张图片叠在另一个图片