Saturday, January 8, 2022

[SOLVED] How to avoid "module not found" error while calling scrapy project from crontab?

January 08, 2022 cron, python, pythonpath, scrapy, shell

Issue

I am currently building a small test project to learn how to use crontab on Linux (Ubuntu 20.04.2 LTS).

My crontab file looks like this:

* * * * * sh /home/path_to .../crontab_start_spider.sh >> /home/path_to .../log_python_test.log 2>&1

What I want crontab to do, is to use the shell file below to start a scrapy project. The output is stored in the file log_python_test.log.

My shell file (numbers are only for reference in this question):

0 #!/bin/bash

1 cd /home/luc/Documents/computing/tests/learning/morning
2 PATH=$PATH:/usr/local/bin
3 export PATH
4 PATH=$PATH:/home/luc/gen_env/lib/python3.7/site-packages
5 export PATH
6 scrapy crawl meteo

Some of you might be interested in the structure of my scrapy project, so here it is:

You also might want to have the code that I edited in scrapy:

My spider: meteo.py

import scrapy
from morning.items import MorningItem


class MeteoSpider(scrapy.Spider):
    name = 'meteo'
    allowed_domains = ['meteo.gc.ca']
    start_urls = ['https://www.meteo.gc.ca/city/pages/qc-136_metric_f.html']

    def parse(self, response, **kwargs):
        # Extracting data from page
        condition =response.css('div.col-sm-4:nth-child(1) > dl:nth-child(1) > dd:nth-child(2)::text').get()
        pression = response.css('div.col-sm-4:nth-child(1) > dl:nth-child(1) > dd:nth-child(4)::text').get()
        temperature = response.css('div.brdr-rght-city:nth-child(2) > dl:nth-child(1) > dd:nth-child(2)::text').get()

        # Creating and filling the item
        item = MorningItem()
        item['condition'] = condition
        item['pression'] = pression
        item['temperature'] = temperature

        return item

My item: in items.py

import scrapy


class MorningItem(scrapy.Item):
    condition = scrapy.Field()
    pression = scrapy.Field()
    temperature = scrapy.Field()

My pipeline: in pipelines.py (this default pipeline is uncommented in settings.py)

import logging

from gtts import gTTS
import os
import random
from itemadapter import ItemAdapter


class MorningPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Message creation
        messages = ["Bon matin! J'èspère que vous avez bien dormi cette nuit. Voici le topo.",
                    "Bonjour Luc. Un bon petit café et on est parti.", "Saluto amigo. Voici ce que vous devez savoir."]
        message_of_the_day = messages[random.randint(0, len(messages) - 1)]

        # Add meteo to message
        message_of_the_day += f" Voici la météo. La condition: {adapter['condition']}. La pression: " \
                              f"{adapter['pression']} kilo-pascal. La température: {adapter['temperature']} celcius."

        if '-' in adapter['temperature']:
            message_of_the_day += " Vous devriez vous mettre un petit chandail."
        elif len(adapter['temperature']) == 3:
            if int(adapter['temperature'][0:2]) > 19:
                message_of_the_day += " Vous allez être bien en sandales."

        # Creating mp3
        language = 'fr-ca'
        output = gTTS(text=message_of_the_day, lang=language, slow=False)

        # Prepare output file emplacement and saving
        if os.path.exists("/home/luc/Music/output.mp3"):
            os.remove("/home/luc/Music/output.mp3")
        output.save("/home/luc/Music/output.mp3")

        # Playing mp3 and retrieving the output
        logging.info(f'First command output: {os.system("mpg123 /home/luc/Music/output.mp3")}')

        return item

I ran the project in my terminal without any problem (scrapy crawl meteo):

WARNING:gtts.lang:'fr-ca' has been deprecated, falling back to 'fr'. This fallback will be removed in a future version.
2021-06-04 12:18:21 [gtts.lang] WARNING: 'fr-ca' has been deprecated, falling back to 'fr'. This fallback will be removed in a future version.

...

stats:
{'downloader/request_bytes': 471,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 14325,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 21.002126,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 6, 4, 16, 18, 41, 658684),
 'item_scraped_count': 1,
 'log_count/DEBUG': 82,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'memusage/max': 60342272,
 'memusage/startup': 60342272,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 6, 4, 16, 18, 20, 656558)}
INFO:scrapy.core.engine:Spider closed (finished)
2021-06-04 12:18:41 [scrapy.core.engine] INFO: Spider closed (finished)

With only a small deprecation warning message, I consider the scraping a success. The problem arise when it is runned from crontab. Here is the output from log_python_test.log:

2021-06-04 12:00:02 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: morning)
2021-06-04 12:00:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 14:51:16) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-5.8.0-53-generic-x86_64-with-debian-bullseye-sid
2021-06-04 12:00:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-06-04 12:00:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'morning',
 'NEWSPIDER_MODULE': 'morning.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['morning.spiders']}
2021-06-04 12:00:02 [scrapy.extensions.telnet] INFO: Telnet Password: bf691c25dae7d218
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2021-06-04 12:00:02 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 87, in crawl
    self.engine = self._create_engine()
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 101, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 50, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import

  File "<frozen importlib._bootstrap>", line 983, in _find_and_load

  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked

  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 728, in exec_module

  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

  File "/home/luc/Documents/computing/tests/learning/morning/morning/pipelines.py", line 3, in <module>
    from gtts import gTTS
builtins.ModuleNotFoundError: No module named 'gtts'

2021-06-04 12:00:02 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 87, in crawl
    self.engine = self._create_engine()
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 101, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 50, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/luc/Documents/computing/tests/learning/morning/morning/pipelines.py", line 3, in <module>
    from gtts import gTTS
ModuleNotFoundError: No module named 'gtts'

Sudenly, it can't find the gtts package anymore. It appears not to be the only package it can't find since had in a previous version from mutagen.mp3 import MP3 at the top of my pipeline.py and there was also a problem importing it.

I wondered, maybe I did a mistake in the installation of the gtts package, so I tried pip install gtts to make sure everything was right and I had:

Requirement already satisfied: gtts in /home/luc/gen_env/lib/python3.7/site-packages (2.2.2)
Requirement already satisfied: six in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (1.15.0)
Requirement already satisfied: requests in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (2.24.0)
Requirement already satisfied: click in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (7.1.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (1.25.10)

gTTs also appears when I type pip list:

gTTS 2.2.2

I also made sure that I installed in the right environment. Respectively, here are the results of which python and which pip:

/home/luc/gen_env/bin/python

/home/luc/gen_env/bin/pip

I thought I could resolve the problem by adding the fourth and fifth lines in my shell file, but without success (The output is the same). I'm pretty sure I have to add some path to PYTHONPATH or something like that, but I am not sure enought of what I am doing and I don't want to break anything.

Thanks in advance.

Solution

I found a solution to my problem. In fact, just as I suspected, there was a missing directory to my PYTHONPATH. It was the directory that contained the gtts package.

Solution: If you have the same problem,

Find the package

I looked at that post

Add it to sys.path (which will also add it to PYTHONPATH)

Add this code at the top of your script (in my case, the pipelines.py):

import sys
sys.path.append("/<the_path_to_your_package>")

Answered By - LukeRain

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 8, 2022

[SOLVED] How to avoid "module not found" error while calling scrapy project from crontab?

Issue

Solution

Popular Posts

Labels