* Mon Feb 18 2019 Hans-Peter Jansen - Fix dependencies- Enable tests * Thu Feb 14 2019 Hans-Peter Jansen - Update to 1.6.0 (2019-01-30): + Highlights: * better Windows support * Python 3.7 compatibility * big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API * feed exports, FilePipeline and MediaPipeline improvements * better extensibility: :signal:`item_error` and :signal:`request_reached_downloader` signals; from_crawler support for feed exporters, feed storages and dupefilters. * scrapy.contracts fixes and new features * telnet console security improvements, first released as a backport in :ref:`release-1.5.2` * clean-up of the deprecated code * various bug fixes, small new features and usability improvements across the codebase. + Selector API changes + While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors, these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated to follow recent parsel API conventions. + Most visible change is that .get() and .getall() selector methods are now preferred over .extract_first() and .extract(). We feel that these new methods result in a more concise and readable code. See :ref:`old-extraction-api` for more details. + Note + There are currently no plans to deprecate .extract() and .extract_first() methods. + Another useful new feature is the introduction of Selector.attrib and SelectorList.attrib properties, which make it easier to get attributes of HTML elements. See :ref:`selecting-attributes`. + CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages. * If you\'re using custom Selector or SelectorList subclasses, a backwards incompatible change in parsel may affect your code. See parsel changelog for a detailed description, as well as for the full list of improvements. + Telnet console * Backwards incompatible: Scrapy\'s telnet console now requires username and password. See :ref:`topics-telnetconsole` for more details. This change fixes a security issue; see :ref:`release-1.5.2` release notes for details. + New extensibility features * from_crawler support is added to feed exporters and feed storages. This, among other things, allows to access Scrapy settings from custom feed storages and exporters (:issue:`1605`, :issue:`3348`). * from_crawler support is added to dupefilters (:issue:`2956`); this allows to access e.g. settings or a spider from a dupefilter. * :signal:`item_error` is fired when an error happens in a pipeline (:issue:`3256`) * :signal:`request_reached_downloader` is fired when Downloader gets a new Request; this signal can be useful e.g. for custom Schedulers (:issue:`3393`). * new SitemapSpider :meth:`~.SitemapSpider.sitemap_filter` method which allows to select sitemap entries based on their attributes in SitemapSpider subclasses (:issue:`3512`). * Lazy loading of Downloader Handlers is now optional; this enables better initialization error handling in custom Downloader Handlers (:issue:`3394`). + New FilePipeline and MediaPipeline features * Expose more options for S3FilesStore: :setting:`AWS_ENDPOINT_URL`, :setting:`AWS_USE_SSL`, :setting:`AWS_VERIFY`, :setting:`AWS_REGION_NAME`. For example, this allows to use alternative or self-hosted AWS- compatible providers (:issue:`2609`, :issue:`3548`). * ACL support for Google Cloud Storage: :setting:`FILES_STORE_GCS_ACL` and :setting:`IMAGES_STORE_GCS_ACL` (:issue:`3199`). + scrapy.contracts improvements * Exceptions in contracts code are handled better (:issue:`3377`) * dont_filter=True is used for contract requests, which allows to test different callbacks with the same URL (:issue:`3381`) * request_cls attribute in Contract subclasses allow to use different Request classes in contracts, for example FormRequest (:issue:`3383`). * Fixed errback handling in contracts, e.g. for cases where a contract is executed for URL which returns non-200 response (:issue:`3371`). + Usability improvements * more stats for RobotsTxtMiddleware (:issue:`3100`) * INFO log level is used to show telnet host/port (:issue:`3115`) * a message is added to IgnoreRequest in RobotsTxtMiddleware (:issue:`3113`) * better validation of url argument in Response.follow (:issue:`3131`) * non-zero exit code is returned from Scrapy commands when error happens on spider inititalization (:issue:`3226`) * Link extraction improvements: \"ftp\" is added to scheme list (:issue:`3152`); \"flv\" is added to common video extensions (:issue:`3165`) * better error message when an exporter is disabled (:issue:`3358`) * scrapy shell --help mentions syntax required for local files (./file.html) - :issue:`3496`. * Referer header value is added to RFPDupeFilter log messages (:issue:`3588`) + Bug fixes * fixed issue with extra blank lines in .csv exports under Windows (:issue:`3039`) * proper handling of pickling errors in Python 3 when serializing objects for disk queues (:issue:`3082`) * flags are now preserved when copying Requests (:issue:`3342`) * FormRequest.from_response clickdata shouldn\'t ignore elements with input[type=image] (:issue:`3153`). * FormRequest.from_response should preserve duplicate keys (:issue:`3247`) + Documentation improvements * Docs are re-written to suggest .get/.getall API instead of .extract/.extract_first. Also, :ref:`topics-selectors` docs are updated and re-structured to match latest parsel docs; they now contain more topics, such as :ref:`selecting- attributes` or :ref:`topics-selectors-css-extensions` (:issue:`3390`). * :ref:`topics-developer-tools` is a new tutorial which replaces old Firefox and Firebug tutorials (:issue:`3400`). * SCRAPY_PROJECT environment variable is documented (:issue:`3518`) * troubleshooting section is added to install instructions (:issue:`3517`) * improved links to beginner resources in the tutorial (:issue:`3367`, :issue:`3468`) * fixed :setting:`RETRY_HTTP_CODES` default values in docs (:issue:`3335`) * remove unused DEPTH_STATS option from docs (:issue:`3245`) * other cleanups (:issue:`3347`, :issue:`3350`, :issue:`3445`, :issue:`3544`, :issue:`3605`). + Deprecation removals + Compatibility shims for pre-1.0 Scrapy module names are removed (:issue:`3318`): * scrapy.command * scrapy.contrib (with all submodules) * scrapy.contrib_exp (with all submodules) * scrapy.dupefilter * scrapy.linkextractor * scrapy.project * scrapy.spider * scrapy.spidermanager * scrapy.squeue * scrapy.stats * scrapy.statscol * scrapy.utils.decorator + See :ref:`module-relocations` for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update your code. + Other deprecation removals: * Deprecated scrapy.interfaces.ISpiderManager is removed; please use scrapy.interfaces.ISpiderLoader. * Deprecated CrawlerSettings class is removed (:issue:`3327`). * Deprecated Settings.overrides and Settings.defaults attributes are removed (:issue:`3327`, :issue:`3359`). + Other improvements, cleanups * All Scrapy tests now pass on Windows; Scrapy testing suite is executed in a Windows environment on CI (:issue:`3315`). * Python 3.7 support (:issue:`3326`, :issue:`3150`, :issue:`3547`). * Testing and CI fixes (:issue:`3526`, :issue:`3538`, :issue:`3308`, :issue:`3311`, :issue:`3309`, :issue:`3305`, :issue:`3210`, :issue:`3299`) * scrapy.http.cookies.CookieJar.clear accepts \"domain\", \"path\" and \"name\" optional arguments (:issue:`3231`). * additional files are included to sdist (:issue:`3495`) * code style fixes (:issue:`3405`, :issue:`3304`) * unneeded .strip() call is removed (:issue:`3519`) * collections.deque is used to store MiddlewareManager methods instead of a list (:issue:`3476`)- Update to 1.5.2 (2019-01-22): + Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023, we haven\'t found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment. + The fix is backwards incompatible, it enables telnet user- password authentication by default with a random generated password. If you can\'t upgrade right away, please consider setting :setting:`TELNET_CONSOLE_PORT` out of its default value. + See :ref:`telnet console ` documentation for more info + Backport CI build failure under GCE environemnt due to boto import error.- Update to 1.5.1 (2018-07-12): + This is a maintenance release with important bug fixes, but no new features: * O(N^2) gzip decompression issue which affected Python 3 and PyPy is fixed (:issue:`3281`) * skipping of TLS validation errors is improved (:issue:`3166`) * Ctrl-C handling is fixed in Python 3.5+ (:issue:`3096`) * testing fixes (:issue:`3092`, :issue:`3263`) * documentation improvements (:issue:`3058`, :issue:`3059`, :issue:`3089`, :issue:`3123`, :issue:`3127`, :issue:`3189`, :issue:`3224`, :issue:`3280`, :issue:`3279`, :issue:`3201`, :issue:`3260`, :issue:`3284`, :issue:`3298`, :issue:`3294`).- Adjust dependencies- Add separate -doc package * Tue Feb 13 2018 jacobwinskiAATTgmail.com- Update spec file to singlespec- Update to Scrapy 1.5.0 * Backwards Incompatible Changes + Scrapy 1.5 drops support for Python 3.3. + Default Scrapy User-Agent now uses https link to scrapy.org (issue 2983). This is technically backwards-incompatible; override USER_AGENT if you relied on old value. + Logging of settings overridden by custom_settings is fixed; this is technically backwards-incompatible because the logger changes from [scrapy.utils.log] to [scrapy.crawler]. If you’re parsing Scrapy logs, please update your log parsers (issue 1343). + LinkExtractor now ignores m4v extension by default, this is change in behavior. + 522 and 524 status codes are added to RETRY_HTTP_CODES (issue 2851) * New features + Support tags in Response.follow (issue 2785) + Support for ptpython REPL (issue 2654) + Google Cloud Storage support for FilesPipeline and ImagesPipeline (issue 2923). + New --meta option of the “scrapy parse” command allows to pass additional request.meta (issue 2883) + Populate spider variable when using shell.inspect_response (issue 2812) + Handle HTTP 308 Permanent Redirect (issue 2844) + Add 522 and 524 to RETRY_HTTP_CODES (issue 2851) + Log versions information at startup (issue 2857) + scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0) + Connections to proxy servers are reused (issue 2743) + Add template for a downloader middleware (issue 2755) + Explicit message for NotImplementedError when parse callback not defined (issue 2831) + CrawlerProcess got an option to disable installation of root log handler (issue 2921) + LinkExtractor now ignores m4v extension by default + Better log messages for responses over DOWNLOAD_WARNSIZE and DOWNLOAD_MAXSIZE limits (issue 2927) + Show warning when a URL is put to Spider.allowed_domains instead of a domain (issue 2250). * Bug fixes + Fix logging of settings overridden by custom_settings; this is technically backwards-incompatible because the logger changes from [scrapy.utils.log] to [scrapy.crawler], so please update your log parsers if needed (issue 1343) + Default Scrapy User-Agent now uses https link to scrapy.org (issue 2983). This is technically backwards-incompatible; override USER_AGENT if you relied on old value. + Fix PyPy and PyPy3 test failures, support them officially (issue 2793, issue 2935, issue 2990, issue 3050, issue 2213, issue 3048) + Fix DNS resolver when DNSCACHE_ENABLED=False (issue 2811) + Add cryptography for Debian Jessie tox test env (issue 2848) + Add verification to check if Request callback is callable (issue 2766) + Port extras/qpsclient.py to Python 3 (issue 2849) + Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning (issue 2862) + Update deprecated test aliases (issue 2876) + Fix SitemapSpider support for alternate links (issue 2853) * Docs + Added missing bullet point for the AUTOTHROTTLE_TARGET_CONCURRENCY setting. (issue 2756) + Update Contributing docs, document new support channels (issue 2762, issue:3038) + Include references to Scrapy subreddit in the docs + Fix broken links; use https:// for external links (issue 2978, issue 2982, issue 2958) + Document CloseSpider extension better (issue 2759) + Use pymongo.collection.Collection.insert_one() in MongoDB example (issue 2781) + Spelling mistake and typos (issue 2828, issue 2837, issue #2884, issue 2924) + Clarify CSVFeedSpider.headers documentation (issue 2826) + Document DontCloseSpider exception and clarify spider_idle (issue 2791) + Update “Releases” section in README (issue 2764) + Fix rst syntax in DOWNLOAD_FAIL_ON_DATALOSS docs (issue 2763) + Small fix in description of startproject arguments (issue 2866) + Clarify data types in Response.body docs (issue 2922) + Add a note about request.meta[\'depth\'] to DepthMiddleware docs (issue 2374) + Add a note about request.meta[\'dont_merge_cookies\'] to CookiesMiddleware docs (issue 2999) + Up-to-date example of project structure (issue 2964, issue 2976) + A better example of ItemExporters usage (issue 2989) + Document from_crawler methods for spider and downloader middlewares (issue 3019)- Update to Scrapy 1.4.0 * Deprecations and Backwards Incompatible Changes + Default to canonicalize=False in scrapy.linkextractors.LinkExtractor (issue 2537, fixes issue 1941 and issue 1982): warning, this is technically backwards-incompatible + Enable memusage extension by default (issue 2539, fixes issue 2187); this is technically backwards-incompatible so please check if you have any non-default MEMUSAGE_ * * * options set. + EDITOR environment variable now takes precedence over EDITOR option defined in settings.py (issue 1829); Scrapy default settings no longer depend on environment variables. This is technically a backwards incompatible change. + Spider.make_requests_from_url is deprecated (issue 1728, fixes issue 1495). * New Features + Accept proxy credentials in proxy request meta key (issue 2526) + Support brotli-compressed content; requires optional brotlipy (issue 2535) + New response.follow shortcut for creating requests (issue 1940) + Added flags argument and attribute to Request objects (issue 2047) + Support Anonymous FTP (issue 2342) + Added retry/count, retry/max_reached and retry/reason_count/ stats to RetryMiddleware (issue 2543) + Added httperror/response_ignored_count and httperror/response_ignored_status_count/ stats to HttpErrorMiddleware (issue 2566) + Customizable Referrer policy in RefererMiddleware (issue 2306) + New data: URI download handler (issue 2334, fixes issue 2156) + Log cache directory when HTTP Cache is used (issue 2611, fixes issue 2604) + Warn users when project contains duplicate spider names (fixes issue 2181) + CaselessDict now accepts Mapping instances and not only dicts (issue 2646) + Media downloads, with FilesPipelines or ImagesPipelines, can now optionally handle HTTP redirects using the new MEDIA_ALLOW_REDIRECTS setting (issue 2616, fixes issue 2004) + Accept non-complete responses from websites using a new DOWNLOAD_FAIL_ON_DATALOSS setting (issue 2590, fixes issue 2586) + Optional pretty-printing of JSON and XML items via FEED_EXPORT_INDENT setting (issue 2456, fixes issue 1327) + Allow dropping fields in FormRequest.from_response formdata when None value is passed (issue 667) + Per-request retry times with the new max_retry_times meta key (issue 2642) + python -m scrapy as a more explicit alternative to scrapy command (issue 2740) * Bug fixes + LinkExtractor now strips leading and trailing whitespaces from attributes (issue 2547, fixes issue 1614) + Properly handle whitespaces in action attribute in FormRequest (issue 2548) + Buffer CONNECT response bytes from proxy until all HTTP headers are received (issue 2495, fixes issue 2491) + FTP downloader now works on Python 3, provided you use Twisted>=17.1 (issue 2599) + Use body to choose response type after decompressing content (issue 2393, fixes issue 2145) + Always decompress Content-Encoding: gzip at HttpCompressionMiddleware stage (issue 2391) + Respect custom log level in Spider.custom_settings (issue 2581, fixes issue 1612) + ‘make htmlview’ fix for macOS (issue 2661) + Remove “commands” from the command list (issue 2695) + Fix duplicate Content-Length header for POST requests with empty body (issue 2677) + Properly cancel large downloads, i.e. above DOWNLOAD_MAXSIZE (issue 1616) + ImagesPipeline: fixed processing of transparent PNG images with palette (issue 2675) * Cleanups & Refactoring + Tests: remove temp files and folders (issue 2570), fixed ProjectUtilsTest on OS X (issue 2569), use portable pypy for Linux on Travis CI (issue 2710) + Separate building request from _requests_to_follow in CrawlSpider (issue 2562) + Remove “Python 3 progress” badge (issue 2567) + Add a couple more lines to .gitignore (issue 2557) + Remove bumpversion prerelease configuration (issue 2159) + Add codecov.yml file (issue 2750) + Set context factory implementation based on Twisted version (issue 2577, fixes issue 2560) + Add omitted self arguments in default project middleware template (issue 2595) + Remove redundant slot.add_request() call in ExecutionEngine (issue 2617) + Catch more specific os.error exception in FSFilesStore (issue 2644) + Change “localhost” test server certificate (issue 2720) + Remove unused MEMUSAGE_REPORT setting (issue 2576) * Documentation + Binary mode is required for exporters (issue 2564, fixes issue 2553) + Mention issue with FormRequest.from_response due to bug in lxml (issue 2572) + Use single quotes uniformly in templates (issue 2596) + Document ftp_user and ftp_password meta keys (issue 2587) + Removed section on deprecated contrib/ (issue 2636) + Recommend Anaconda when installing Scrapy on Windows (issue 2477, fixes issue 2475) + FAQ: rewrite note on Python 3 support on Windows (issue 2690) + Rearrange selector sections (issue 2705) + Remove __nonzero__ from SelectorList docs (issue 2683) + Mention how to disable request filtering in documentation of DUPEFILTER_CLASS setting (issue 2714) + Add sphinx_rtd_theme to docs setup readme (issue 2668) + Open file in text mode in JSON item writer example (issue 2729) + Clarify allowed_domains example (issue 2670)- Update to Scrapy 1.3.3 * Bug fixes + Make SpiderLoader raise ImportError again by default for missing dependencies and wrong SPIDER_MODULES. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced to toggle between warning or exception if needed ; see SPIDER_LOADER_WARN_ONLY for details.- Update to Scrapy 1.3.2 * Bug fixes + Preserve request class when converting to/from dicts (utils.reqser) (issue 2510). + Use consistent selectors for author field in tutorial (issue 2551). + Fix TLS compatibility in Twisted 17+ (issue 2558)- Update to Scrapy 1.3.1 * New features + Support \'True\' and \'False\' string values for boolean settings (issue 2519); you can now do something like scrapy crawl myspider -s REDIRECT_ENABLED=False. + Support kwargs with response.xpath() to use XPath variables and ad-hoc namespaces declarations ; this requires at least Parsel v1.1 (issue 2457). + Add support for Python 3.6 (issue 2485). + Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet). * Bug fixes + Enforce DNS_TIMEOUT setting (issue 2496). + Fix view command ; it was a regression in v1.3.0 (issue 2503). + Fix tests regarding *_EXPIRES settings with Files/Images pipelines (issue 2460). + Fix name of generated pipeline class when using basic project template (issue 2466). + Fix compatiblity with Twisted 17+ (issue 2496, issue 2528). + Fix scrapy.Item inheritance on Python 3.6 (issue 2511). + Enforce numeric values for components order in SPIDER_MIDDLEWARES, DOWNLOADER_MIDDLEWARES, EXTENIONS and SPIDER_CONTRACTS (issue 2420). * Documentation + Reword Code of Coduct section and upgrade to Contributor Covenant v1.4 (issue 2469). + Clarify that passing spider arguments converts them to spider attributes (issue 2483). + Document formid argument on FormRequest.from_response() (issue 2497). + Add .rst extension to README files (issue 2507). + Mention LevelDB cache storage backend (issue 2525). + Use yield in sample callback code (issue 2533). + Add note about HTML entities decoding with .re()/.re_first() (issue 1704). + Typos (issue 2512, issue 2534, issue 2531). * Cleanups + Remove reduntant check in MetaRefreshMiddleware (issue 2542). + Faster checks in LinkExtractor for allow/deny patterns (issue 2538). + Remove dead code supporting old Twisted versions (issue 2544).- Update to Scrapy 1.3.0 * New Features + MailSender now accepts single strings as values for to and cc arguments (issue 2272) + scrapy fetch url, scrapy shell url and fetch(url) inside scrapy shell now follow HTTP redirections by default (issue 2290); See fetch and shell for details. + HttpErrorMiddleware now logs errors with INFO level instead of DEBUG; this is technically backwards incompatible so please check your log parsers. + By default, logger names now use a long-form path, e.g. [scrapy.extensions.logstats], instead of the shorter “top-level” variant of prior releases (e.g. [scrapy]); this is backwards incompatible if you have log parsers expecting the short logger name part. You can switch back to short logger names using LOG_SHORT_NAMES set to True. * Dependencies & Cleanups + Scrapy now requires Twisted >= 13.1 which is the case for many Linux distributions already. + As a consequence, we got rid of scrapy.xlib.tx. * modules, which copied some of Twisted code for users stuck with an “old” Twisted version + ChunkedTransferMiddleware is deprecated and removed from the default downloader middlewares.- Update to Scrapy 1.2.3 * Packaging fix: disallow unsupported Twisted versions in setup.py- Update to Scrapy 1.2.2 * Bug fixes + Fix a cryptic traceback when a pipeline fails on open_spider() (issue 2011) + Fix embedded IPython shell variables (fixing issue 396 that re-appeared in 1.2.0, fixed in issue 2418) + A couple of patches when dealing with robots.txt: - handle (non-standard) relative sitemap URLs (issue 2390) - handle non-ASCII URLs and User-Agents in Python 2 (issue 2373) * Documentation + Document \"download_latency\" key in Request’s meta dict (issue 2033) + Remove page on (deprecated & unsupported) Ubuntu packages from ToC (issue 2335) + A few fixed typos (issue 2346, issue 2369, issue 2369, issue 2380) and clarifications (issue 2354, issue 2325, issue 2414) * Other changes + Advertize conda-forge as Scrapy’s official conda channel (issue 2387) + More helpful error messages when trying to use .css() or .xpath() on non-Text Responses (issue 2264) + startproject command now generates a sample middlewares.py file (issue 2335) + Add more dependencies’ version info in scrapy version verbose output (issue 2404) + Remove all *.pyc files from source distribution (issue 2386)- Update to Scrapy 1.2.1 * Bug fixes + Include OpenSSL’s more permissive default ciphers when establishing TLS/SSL connections (issue 2314). + Fix “Location” HTTP header decoding on non-ASCII URL redirects (issue 2321). * Documentation + Fix JsonWriterPipeline example (issue 2302). + Various notes: issue 2330 on spider names, issue 2329 on middleware methods processing order, issue 2327 on getting multi-valued HTTP headers as lists. * Other changes + Removed www. from start_urls in built-in spider templates (issue 2299).- Update to Scrapy 1.2.0 * New Features + New FEED_EXPORT_ENCODING setting to customize the encoding used when writing items to a file. This can be used to turn off \\uXXXX escapes in JSON output. This is also useful for those wanting something else than UTF-8 for XML or CSV output (issue 2034). + startproject command now supports an optional destination directory to override the default one based on the project name (issue 2005). + New SCHEDULER_DEBUG setting to log requests serialization failures (issue 1610). + JSON encoder now supports serialization of set instances (issue 2058). + Interpret application/json-amazonui-streaming as TextResponse (issue 1503). + scrapy is imported by default when using shell tools (shell, inspect_response) (issue 2248). * Bug fixes + DefaultRequestHeaders middleware now runs before UserAgent middleware (issue 2088). Warning: this is technically backwards incompatible, though we consider this a bug fix. + HTTP cache extension and plugins that use the .scrapy data directory now work outside projects (issue 1581). Warning: this is technically backwards incompatible, though we consider this a bug fix. + Selector does not allow passing both response and text anymore (issue 2153). + Fixed logging of wrong callback name with scrapy parse (issue 2169). + Fix for an odd gzip decompression bug (issue 1606). + Fix for selected callbacks when using CrawlSpider with scrapy parse (issue 2225). + Fix for invalid JSON and XML files when spider yields no items (issue 872). + Implement flush() fpr StreamLogger avoiding a warning in logs (issue 2125). * Refactoring + canonicalize_url has been moved to w3lib.url (issue 2168). * Documentation + Grammar fixes: issue 2128, issue 1566. + Download stats badge removed from README (issue 2160). + New scrapy architecture diagram (issue 2165). + Updated Response parameters documentation (issue 2197). + Reworded misleading RANDOMIZE_DOWNLOAD_DELAY description (issue 2190). + Add StackOverflow as a support channel (issue 2257).- Update to Scrapy 1.1.4 * Packaging fix: disallow unsupported Twisted versions in setup.py- Update to Scrapy 1.1.3 * Bug fixes + Class attributes for subclasses of ImagesPipeline and FilesPipeline work as they did before 1.1.1 (issue 2243, fixes issue 2198) * Documentation + Overview and tutorial rewritten to use http://toscrape.com websites (issue 2236, issue 2249, issue 2252).- Update to Scrapy 1.1.2 * Bug fixes + Introduce a missing IMAGES_STORE_S3_ACL setting to override the default ACL policy in ImagesPipeline when uploading images to S3 (note that default ACL policy is “private” – instead of “public-read” – since Scrapy 1.1.0) + IMAGES_EXPIRES default value set back to 90 (the regression was introduced in 1.1.1)- Update to Scrapy 1.1.1 * Bug fixes + Add “Host” header in CONNECT requests to HTTPS proxies (issue 2069) + Use response body when choosing response class (issue 2001, fixes issue 2000) + Do not fail on canonicalizing URLs with wrong netlocs (issue 2038, fixes issue 2010) + a few fixes for HttpCompressionMiddleware (and SitemapSpider): - Do not decode HEAD responses (issue 2008, fixes issue 1899) - Handle charset parameter in gzip Content-Type header (issue 2050, fixes issue 2049) - Do not decompress gzip octet-stream responses (issue 2065, fixes issue 2063) + Catch (and ignore with a warning) exception when verifying certificate against IP-address hosts (issue 2094, fixes issue 2092) + Make FilesPipeline and ImagesPipeline backward compatible again regarding the use of legacy class attributes for customization (issue 1989, fixes issue 1985) * New features + Enable genspider command outside project folder (issue 2052) + Retry HTTPS CONNECT TunnelError by default (issue 1974) * Documentation + FEED_TEMPDIR setting at lexicographical position (commit 9b3c72c) + Use idiomatic .extract_first() in overview (issue 1994) + Update years in copyright notice (commit c2c8036) + Add information and example on errbacks (issue 1995) + Use “url” variable in downloader middleware example (issue 2015) + Grammar fixes (issue 2054, issue 2120) + New FAQ entry on using BeautifulSoup in spider callbacks (issue 2048) + Add notes about scrapy not working on Windows with Python 3 (issue 2060) + Encourage complete titles in pull requests (issue 2026) * Tests + Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (issue 2095) * Wed Mar 29 2017 jacobwinskiAATTgmail.com- Update spec file: change python-pyasn1 to python2-pyasn1 * Sun Jun 05 2016 jacobwinskiAATTgmail.com- Cleanup spec file- Add Conflicts: python3-Scrapy since now Scrapy supports Python 3 and both create identically named executables * Thu Jun 02 2016 jacobwinskiAATTgmail.com- Update to 1.1.0 * Most important features and bug fixes: + Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See Beta Python 3 Support for more details and some limitations. + Hot new features: - Item loaders now support nested loaders (issue 1467). - FormRequest.from_response improvements (issue 1382, issue 1137). - Added setting AUTOTHROTTLE_TARGET_CONCURRENCY and improved AutoThrottle docs (issue 1324). - Added response.text to get body as unicode (issue 1730). - Anonymous S3 connections (issue 1358). - Deferreds in downloader middlewares (issue 1473). This enables better robots.txt handling (issue 1471). - HTTP caching now follows RFC2616 more closely, added settings HTTPCACHE_ALWAYS_STORE and HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS (issue 1151). - Selectors were extracted to the parsel library (issue 1409). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy. - HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new DOWNLOADER_CLIENT_TLS_METHOD. + These bug fixes may require your attention: - Don’t retry bad requests (HTTP 400) by default (issue 1289). If you need the old behavior, add 400 to RETRY_HTTP_CODES. - Fix shell files argument handling (issue 1710, issue 1550). If you try scrapy shell index.html it will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file. - Robots.txt compliance is now enabled by default for newly-created projects (issue 1724). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (issue 1735). If you want to disable this behavior, update ROBOTSTXT_OBEY in settings.py file after creating a new project. - Exporters now work on unicode, instead of bytes by default (issue 1080). If you use PythonItemExporter, you may want to update your code to disable binary mode which is now deprecated. - Accept XML node names containing dots as valid (issue 1533). - When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default ACL policy is now “private” instead of “public” Warning: backwards incompatible!. You can use FILES_STORE_S3_ACL to change it. - We’ve reimplemented canonicalize_url() for more correct output, especially for URLs with non-ASCII characters (issue 1947). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. Warning: backwards incompatible!. * Beta Python 3 Support with the following limitations: + Scrapy has not been tested on Windows with Python 3 + Sending emails is not supported + FTP download handler is not supported + Telnet console is not supported * Additional New Features and Enhancements: + Scrapy now has a Code of Conduct (issue 1681). + Command line tool now has completion for zsh (issue 934). + Improvements to scrapy shell: - Support for bpython and configure preferred Python shell via SCRAPY_PYTHON_SHELL (issue 1100, issue 1444). - Support URLs without scheme (issue 1498) Warning: backwards incompatible! - Bring back support for relative file path (issue 1710, issue 1550). + Added MEMUSAGE_CHECK_INTERVAL_SECONDS setting to change default check interval (issue 1282). + Download handlers are now lazy-loaded on first request using their scheme (issue 1390, issue 1421). + HTTPS download handlers do not force TLS 1.0 anymore; instead, OpenSSL’s SSLv23_method()/TLS_method() is used allowing to try negotiating with the remote hosts the highest TLS protocol version it can (issue 1794, issue 1629). + RedirectMiddleware now skips the status codes from handle_httpstatus_list on spider attribute or in Request‘s meta key (issue 1334, issue 1364, issue 1447). + Form submission: - now works with