by Paul Gerbino
From Wanderers To Web Robots: A History Of Content Scraping, The Damage It Is Doing, And, What You Can Do About It.
Publishers have unknowingly been blind-sided – and many have just turned a blind eye – to the theft of their content and intellectual property by web scraping. From my conversations with media executives at industry conferences I truly believe they do not have the facts and history of that theft. This is an earnest attempt to fill that knowledge void and offer helpful advice.
Media companies are contributing to a vast ocean of information. Whether daily, weekly, monthly, in collections, books, other products and derivatives– and, in some cases publishing content every minute, the content waves are constantly churning. All sorts of organizations that consume this content were asking early on: “How do we navigate its depths and unearth the data?” “How do we gather this data and make it actionable for the business growth we need?”
Enter content scraping, a technique as old as the web itself, evolving alongside the way we access and utilize online information.
The Early Days
It began in 1989 with the birth of the World Wide Web by Tim Berners-Lee. This revolutionary platform, designed for information sharing between researchers, laid the groundwork for web scraping. The Web contained three crucial elements for scraping: URLs (designating specific locations), hyperlinks (allowing navigation), and the HTML code that structures webpages.
However, the initial use case for web scraping wasn’t about data collection. In 1993, the “World Wide Web Wanderer” emerged – not to extract information, but to measure the size of the web at that time. Soon after came the first web crawler – JumpStation, a search engine that, unlike its human-curated counterparts, relied on automated robots to scour the web and index content.
The early 2000s witnessed a shift. While scraping remained essential for search engines, the rise of APIs (Application Programming Interfaces) offered a more structured way to access website data. Pioneered by companies like Salesforce and eBay, APIs allowed developers to request specific data sets programmatically, eliminating the need for complex scraping techniques.
However, scraping didn’t disappear. It simply transformed. With the explosion of web data, businesses saw the potential for scraping to gain valuable insights – competitor pricing, market trends, and even customer sentiment. This fuelled the development of sophisticated web scraping tools. Unlike the early crawlers, these tools were designed for human users, often with drag-and-drop interfaces that allowed even non-programmers to extract data.
The Moreover.com Effect
Enter moreover.com (Moreover Technologies), a pivotal player in the web scraping landscape. Launched in 1998 and founded by Nick Denton, David Galbraith, and Angus Bankes, moreover.com offered a cloud-based scraping solution specifically designed for businesses. It boasted features like:
- Simplified scraping: Created pre-built templates for popular websites and an intuitive interface for custom scraping scripts.
- Scalability: Being cloud based allowed businesses to scale scraping operations without worrying about overloading their own servers.
- Data management: Capabilities for data cleaning, transformation, and storage made it very easy for businesses to analyze scraped data.
The arrival of moreover.com marked a turning point. Web scraping wasn’t just for tech-savvy individuals anymore; it became a viable option for businesses of all sizes, and a revenue generator for Moreover. This, however, also brought an ethical debate around scraping to the forefront. Moreover, profiting from media companies’ intellectual property (IP) with the media companies not earning any royalties from the use of their content.
Moreover was later purchased by LexisNexis on October 20, 2014.
A Financial Analyst’s Secret Weapon
Financial markets are powerful data-driven beasts. They are some of the earliest users of scraped data. Before scraping, analysts sifted through mountains of paper and information to identify trends, predict performance, and make informed investment decisions. Enter web scraping, a powerful tool that has become an analyst’s secret weapon.
Traditionally, analysts relied on manual data collection – a time-consuming and error-prone process. Web scraping automated this task, extracting valuable data points from financial websites. Stock prices, competitor analysis, economic indicators – you name it, scraping can get it.
How scraping empowers financial analysts:
- Real-time Insights: Markets move fast. Scraping allows analysts to gather up-to-the-minute data, providing a sharper view of current trends and potential opportunities.
- In-depth Research: Scraping opens doors to a wider range of data sources, enabling analysts to delve deeper into company financials, industry reports, and news articles – all crucial for comprehensive analysis.
- Competitive Edge: By scraping competitor data, analysts can track pricing strategies, identify emerging players, and gain valuable insights to inform investment decisions.
Business-to-business (B2B) media became big targets for financial analysis. Due to its coverage of industries of interest to the financial world, as well as its insights and expertise, B2B media content has been sought after by those creating the analysis and derivative financial reports. B2B media companies have also been guilty of wearing blinders on this issue of scraping, as their content was being stolen.
Media Monitoring Keeps Ears to the Web
In today’s digital age, where information flows at breakneck speed, staying on top of media coverage is crucial for businesses and organizations. This is where media monitoring and evaluation (MME) organizations come in. But how do they track mentions across a vast and ever-evolving online landscape? Enter web scraping, a vital tool in their arsenal.
MME organizations use web scraping to automate the collection of data from online sources, including news websites, social media platforms, and blogs. Here’s how scraping empowers them:
- Comprehensive Coverage: Scraping goes beyond traditional media outlets, allowing MME organizations to capture brand mentions on B2B media, forums, review sites, and other online spaces, providing a holistic view of brand sentiment.
- Real-time Alerts: By setting up scraping filters, MME organizations can receive instant notifications whenever their client’s brand is mentioned online. This allows for swift responses to potential crises or emerging trends.
- Sentiment Analysis: Scraping harvested data can be fed into sentiment analysis tools, helping MME organizations understand the tone of online conversations surrounding their clients. This provides valuable insights into brand perception and public opinion.
However, scraping for MME organizations comes with its own set of challenges. Respecting website terms of service and robots.txt files is crucial to avoid ethical and legal issues. Additionally, dealing with dynamic content and CAPTCHAs can require sophisticated scraping techniques. MME organizations and the companies that do the scraping believe they have a right to this content and do not respect robot.txt files or ToS language.
Despite all this, web scraping remains a cornerstone for MME organizations. As online platforms continue to evolve, scraping tools are also becoming more advanced, allowing MME organizations to stay ahead of the curve and deliver comprehensive media insights to their clients, often without proper permissions.
The future of scraping is evolving and today includes AI technologies, increased automation and the use of cloud-based scraping solutions. Machine learning plays a role in identifying patterns and adapting to anti-scraping measures. Regulations and legal frameworks around scraping are keeping up at a rather slow pace. The balance between data accessibility and website protection is not quite tipping in the publisher’s favor.
A Double-Edged Sword
Web scraping, the process of extracting data from websites, has become a powerful tool for businesses and researchers. But this convenience comes with a layer of complexity, raising both ethical and legal concerns.
Legality in the Grey Area
The legal landscape surrounding web scraping is far from clear-cut. Copyright laws protect the creative expression within a website’s content. Scraping large amounts of data, especially copyrighted material, can be a violation. However, scraping factual information is generally considered acceptable.
Further complicating things are website Terms of Service (ToS). If a website explicitly forbids scraping in its ToS, then doing so becomes unauthorized access, potentially falling under trespass to chattel laws.
Ethical Considerations
Even when legal, scraping can be unethical. Scraping excessively can overload a website’s servers, hindering its performance for legitimate users. There’s also the issue of privacy. Scraping personal data, even publicly available information, can raise privacy concerns, especially when done without proper anonymization.
It also raises the issue of compensation. Media companies pay to create their content. Scraping bypasses the requirement to pay for the content that is being lifted from the site. Those involved with scraping avoid having to license and pay for the use of the content which is a model that is not sustainable and will catch up in the courts, just not as quickly as media companies need.
Third-Party Use: Ethical Burden
The responsibility doesn’t end with the scraper. Those who use scraped data also have ethical obligations. Using poorly obtained data, especially one that violates copyright or privacy laws, can lead to legal trouble down the line.
An Attempt by Publishers to Stop It
As scraping becomes more prevalent with AI, concerns are rising. Many websites are not built to handle excessive automated requests, leading to overloading of servers and potential security breaches. This has led to the rise of “anti-scraping” measures such as CAPTCHA, dynamic content generation, and IP blocking – making scraping more challenging.
The response? The scraping community is becoming more sophisticated. Techniques to counteract anti-scraping, like rotating IP addresses, using headless browsers (browsers without a visual interface), and parsing complex JavaScript code emerged to bypass these hurdles. The ethical debate around scraping is intensifying. While some see it as legitimate research and need for business tools, others, including CLI, argue it’s a form of unauthorized data extraction and fair royalties need to be paid.
Today, even a paywall does not stop web scraping. The News Media Alliance published a white paper detailing the theft of their memberships’ content by AI developers who use media companies’ content to train AI systems without permission. NMA advocates for responsible AI development that acknowledges the role creators play and fairly compensates them.
One of Many Paths Forward
Transparency is key. Websites should clearly outline their scraping policies in their ToS. Scrapers should be respectful, adhering to robots.txt guidelines and avoiding overwhelming servers. Finally, those using scraped data should ensure its ethical and legal provenance.
Web scraping offers valuable opportunities, but navigating its ethical and legal complexities is crucial. By prioritizing responsible practices, all parties involved can reap the benefits while minimizing the risks.
A Conclusion…Or Is It?
Content scraping is an integral part of the web’s history, from its humble beginnings as a tool for measuring the web’s size to its current role as a powerful data collection technique. As the web continues to evolve, so too will web scraping, constantly adapting to new challenges and opportunities. Whether it remains a force for good or becomes a digital menace depends on how we, as users and developers, choose to wield this powerful tool.
As a publisher, here are some immediate suggestions:
- Clearly outline scraping policies in terms of service.
- Make sure your Robots.txt file prohibits scraping.
- Set up monitoring technology to flag bots and scrapers and create a workflow to respond to scraping activity with “cease and desist” notifications.
- Advocate for legal frameworks, individually or collectively with other publishers to ensure fair compensation for your content.
Don’t be a media company that loses control by not staying abreast of web scraping and your protections. Advocate for legal frameworks and make sure your organization has dedicated resources around these important issues. We are part of the community coming together to fight for your rights and your money.
Paul Gerbino is the President of Creative Licensing International. He is an expert in digital, content strategy, licensing, product development, advertising, and copyright protections. His expertise is noted with an exemplary track record of transforming low-performing projects into highly profitable revenue streams. Evident in creating and launching innovative digital media products and advertising programs for B2B, B2C, STM, and academic publishers. Paul is passionate about helping publishers improve their performance, productivity, and profitability in the evolving digital landscape.