Day17 - 續談爬蟲（下） | 是 Ray 不是 Array

續談爬蟲

前言

這一篇將會接續前一篇的文章內容，繼續把前一篇沒講完的爬蟲內容給補完。

續談爬蟲

一開始，我們先回顧一下我們前面寫了什麼東西

// index.js
const fs = require('fs');

const cheerio = require('cheerio');

const getData = async (url) => {
  try {
    const response = await fetch(url);
    const data = await response.text();
    return data;
  } catch (error) {
    console.log(error);
  }
}

const crawler = async () => {
  // 先初次取得第一頁的資料
  const html = await getData('https://ithelp.ithome.com.tw/2022ironman/signup/list');
  // 用 cheerio 解析 html 資料
  const $ = cheerio.load(html); 
  // 取得最後一頁的頁碼
  const paginationInner = $('span.pagination-inner > a').last();
  // 取得文字，並將頁碼文字轉成數字
  const lastPage = Number(paginationInner.text());
  // 建立一個空陣列來存放資料
  const data = [];

  // 用 for 迴圈來爬取每一頁的資料
  for(let i = 1; i <= lastPage; i++) {
    console.log(`正在爬取第 ${i} 頁`);

    // 爬取每一頁的資料
    const html = await getData(`https://ithelp.ithome.com.tw/2022ironman/signup/list?page=${i}`);

    // 用 cheerio 解析 html 資料
    const $ = cheerio.load(html);

    // 參賽者卡片
    const listCard = $('.list-card');

    // 用 each 來爬取每一個參賽者的資料
    listCard.each((index, element) => {
      // 取得參賽者的名字、分類、標題、網址
      const name = $(element).find('.contestants-list__name').text();
      const category = $(element).find('.tag span').text();
      const title = $(element).find('.contestants-list__title').text();
      const url = $(element).find('.contestants-list__title').attr('href');
      data.push({
        name,
        category,
        title,
        url,
      });
    });

    // 避免過度請求增加伺服器負擔
    await new Promise((resolve) => {
      setTimeout(() => {
        resolve();
      }, 5000); // 5 秒跑一次
    })
  }

  // 將資料寫入 data.json
  fs.writeFileSync('./data.json', JSON.stringify(data));
}

crawler();

基本上這個爬蟲幫我們整理出了以下資料

// data.json
[
  {
    "name":"Ray",
    "category":"Modern Web",
    "title":"終究都要學 React 何不現在學呢？",
    "url":"https://ithelp.ithome.com.tw/users/20119486/ironman/5111"
  },
  // ...略過其他筆資料
]

但我們還缺瀏覽人數、Like 人數、留言人數的資料，所以這時候我們就必須要進入到文章頁面中，來取得這些資料，而文章頁面我們已經撈到了，只是我們還沒有解析文章頁面的資料而已。

接下來我們就要來想辦法解析文章頁面的資料，其實很簡單，因為我們已經將資料轉換成 data.json 了，所以我們只要再寫一個爬蟲來解析 data.json 的 url 並去取得文章頁面的資料就可以了。

但這邊要請你在剛剛的專案資料夾 example-ithelp-crawler 在建立檔案，叫做 parse-list.js，因為我們要區分兩個爬蟲，所以我們就分別寫在不同的檔案中，而剛剛的檔案 index.js 請幫我改成 get-list.js

1	`touch parse-list.js`

接著就直接來看程式碼，一開始我們先看頁面請求回來的結構

// parse-list.js
const data = require('./data.json');
const fs = require('fs');
const cheerio = require('cheerio');

const getData = async (url) => {
  try {
    const response = await fetch(url);
    const data = await response.text();
    return data;
  } catch (error) {
    console.log(error);
  }
}

const crawler = async () => {
  const length = data.length;
  console.log(length);
  for(let i = 0; i <= length; i++) {
    console.log(`正在爬取第 ${i} 筆資料`);
    const html = await getData(data[i].url);
    const $ = cheerio.load(html);
  }
}

crawler()

執行後我們只需要找到這一段

<!-- 略過其他程式碼 -->
<div class="qa-list profile-list ir-profile-list">
  <!-- 略過其他程式碼 -->
</div>
<div class="qa-list profile-list ir-profile-list">
  <div class="profile-list__condition">
    <a class="qa-condition ">
      <span class="qa-condition__count">
        0
      </span>
      <span class="qa-condition__text">
        Like
      </span>
    </a>
    <a class="qa-condition ">
      <span class="qa-condition__count">
        0
      </span>
      <span class="qa-condition__text">
        留言
      </span>
    </a>
    <a class="qa-condition   qa-condition--change ">
      <span class="qa-condition__count">
        651
      </span>
      <span class="qa-condition__text">
        瀏覽
      </span>
    </a>
  </div>
  <div class="profile-list__content">
    <div class="ir-qa-list__status">
      <span class="ir-qa-list__days ir-qa-list__days--profile ">
        DAY 1
      </span>
    </div>
    <h3 class="qa-list__title">
      <a href="https://ithelp.ithome.com.tw/articles/10287240
            " class="qa-list__title-link">
        Day1-C語言的hello_world
      </a>
    </h3>
    <p class="qa-list__desc">
      系統:ubuntu-22.04 需要安裝套件如下(Command): sudo apt install build-essential C:
      #include...
    </p>
    <div class="qa-list__info">
      <a title="2022-09-01 20:29:49" class="qa-list__info-time">
        2022-09-01
      </a>
      ‧ 由
      <a href="https://ithelp.ithome.com.tw/users/20151652/profile" class="qa-list__info-link">
        Hello_world
      </a>
      分享
    </div>
  </div>
</div>
<div class="qa-list profile-list ir-profile-list">
  <!-- 略過其他程式碼 -->
</div>
<!-- 略過其他程式碼 -->

所以我們就可以先確定我們要撈的是 .qa-list.profile-list.ir-profile-list 這個元素，而我們要的資料是 Like、留言、瀏覽，所以我們就可以先來撈這三個資料撈出來後，還要全部加總起來，所以為了避免太複雜，一開始我們先只取得第一頁

// parse-list.js
const data = require('./data.json');
const fs = require('fs');
const cheerio = require('cheerio');

const getData = async (url) => {
  try {
    const response = await fetch(url);
    const data = await response.text();
    return data;
  } catch (error) {
    console.log(error);
  }
}

const crawler = async () => {
  const length = data.length;
  for(let i = 0; i <= length; i++) {
    console.log(`正在爬取第 ${i} 筆資料`);
  
    const html = await getData(data[i].url);
    const $ = cheerio.load(html);

    // 指定爬取的區塊
    const qaList = $('.qa-list.profile-list.ir-profile-list > div.profile-list__condition');

    // 資料統計放置處
    let like = 0;
    let comment = 0;
    let view = 0;

    // 針對 qaList 做迴圈處理
    qaList.each((index, element) => {
      // 將撈出來的資料轉成 cheerio 物件
      const qaListElement = $(element);

      // 撈出全部 a 標籤
      qaListElement.find('a').each((index, element) => {
        // 一樣將 a 標籤轉成 cheerio 物件
        const qaListElementA = $(element);
        // 撈出 a 標籤的文字
        const qaListElementAText = qaListElementA.text();

        // 判斷文字內容，並將數字相加
        if(qaListElementAText.includes('Like')) {
          like += Number(qaListElementA.find('.qa-condition__count').text());
        }
        if(qaListElementAText.includes('留言')) {
          comment += Number(qaListElementA.find('.qa-condition__count').text());
        }
        if(qaListElementAText.includes('瀏覽')) {
          view += Number(qaListElementA.find('.qa-condition__count').text());
        }
      });
      
    });

    console.log(like, comment, view)
  }
}

crawler()

基本上不意外你應該是可以正常取得並統計成功的，後面接下來就針對分頁去撰寫了，那麼我們就來看看分頁的結構

<div class="profile-pagination">
  <ul class="pagination">
    <li class="disabled"><span>上一頁</span></li>
    <li class="active"><span>1</span></li>
    <li><a href="https://ithelp.ithome.com.tw/users/20129584/ironman/5891?page=2">2</a></li>
    <li><a href="https://ithelp.ithome.com.tw/users/20129584/ironman/5891?page=3">3</a></li>
    <li><a href="https://ithelp.ithome.com.tw/users/20129584/ironman/5891?page=2" rel="next">下一頁</a></li>
  </ul>
</div>

概念其實跟前面的差不多，所以一樣要取得分頁最後一個，也就是 3，但這邊我們不可以寫死，因為有可能參賽者是只有 1 頁，甚至 2 頁而已，因此這一段完全都要靠判斷的，底下我也貼上完整程式碼，逐行補上註解來說明

// parse-list.js
const data = require('./data.json');
const fs = require('fs');
const cheerio = require('cheerio');

const getData = async (url) => {
  try {
    const response = await fetch(url);
    const data = await response.text();
    return data;
  } catch (error) {
    console.log(error);
  }
}

const crawler = async () => {
  const length = data.length;

  for(let i = 0; i < length; i++) {
    console.log(`正在爬取第 ${i + 1} 筆資料, ${data[i].url}`);
  
    const html = await getData(data[i].url);
    const $ = cheerio.load(html);

    // 撈出最後一頁的頁數
    // 先撈出最後一頁的 li（last()），再撈出上一個 li（prev()），再撈出裡面的 a 標籤（find('a')），最後撈出 a 標籤的文字（text()）
    const page = $('.profile-pagination > ul > li').last().prev().find('a').text();

    // 資料統計放置處
    let like = 0;
    let comment = 0;
    let view = 0;

    // 依照頁數做迴圈處理
    for(let j = 0; j < page; j++) {
      console.log(`分頁第 ${j + 1} 頁, ${data[i].url}?page=${j + 1}`);
      // 撈取分頁資料
      
      const html = await getData(`${data[i].url}?page=${j + 1}`);
      // 將分頁資料轉成 cheerio 物件
      const $ = cheerio.load(html);
      // 指定爬取的區塊
      const qaList = $('.qa-list.profile-list.ir-profile-list > div.profile-list__condition');
      
      // 針對 qaList 做迴圈處理
      qaList.each((index, element) => {
        // 將撈出來的資料轉成 cheerio 物件
        const qaListElement = $(element);

        // 撈出全部 a 標籤
        qaListElement.find('a').each((index, element) => {
          // 一樣將 a 標籤轉成 cheerio 物件
          const qaListElementA = $(element);
          // 撈出 a 標籤的文字
          const qaListElementAText = qaListElementA.text();

          // 判斷文字內容，並將數字相加
          if(qaListElementAText.includes('Like')) {
            like += Number(qaListElementA.find('.qa-condition__count').text());
          }
          if(qaListElementAText.includes('留言')) {
            comment += Number(qaListElementA.find('.qa-condition__count').text());
          }
          if(qaListElementAText.includes('瀏覽')) {
            view += Number(qaListElementA.find('.qa-condition__count').text());
          }
        });
        
      });
    }
    console.log(like, comment, view)

    // 資料回寫到原始資料中
    data[i].like = like;
    data[i].comment = comment;
    data[i].view = view;

    // 避免過度請求增加伺服器負擔
    await new Promise((resolve) => {
      setTimeout(() => {
        resolve();
      }, 5000); // 5 秒跑一次
    })
  }

  // 將統計資料寫入 data2.json
  fs.writeFileSync('./data2.json', JSON.stringify(data));
}

crawler()

Note
此段程式碼僅示範，建議不要隨便拉下來執行，因為在資料較多的關係，所以跑起來會很慢，建議可以自己去找一些資料量較少的網站來練習，或者將 const length = data.length; 改成 const length = 10; 來測試。

電腦爆炸

那麼透過以上程式碼，前一篇＋這一篇你應該會得到兩個檔案，分別是獲取參賽列表（get-list.js）跟獲取參賽者文章（parse-list）頁面，為什麼要特別拆成兩部分呢？因為參賽資料其實並不會沒事一直更動，所以基本上久久跑一次就可以了，所以才特別只跑一次哩。

但我這邊就不花時間介紹說明前端了，畢竟如果再搭配前端來介紹呈現畫面的話，可能就沒完沒了了 QQ

只是我相信你應該已經發現當我們學會如何使用爬蟲時，我們就可以使用爬蟲取得我們想要的資料，並組合成我們想要的資料格式哩～

那麼這一篇就準備先到這邊，我們下一篇見哩。

Day17 - 續談爬蟲（下）

前言

續談爬蟲

你的支持會直接轉換成更多技術筆記

Terminal

相關文章

分享這篇文章

留言

前言

續談爬蟲

你的支持會直接轉換成更多技術筆記

Terminal

相關文章

Day31 - 目錄與補充資源

Day30-旅途告一個段落

Day29-關於 JWT 驗證

Day28-Google Extension 與 Google Apps Script 蹦再一起

分享這篇文章

留言