Deep Web Data Scanning, Scraping and Analyzing

Confidentiality Notice: This case study is confidential and intended for internal reference or vetted prospect presentations only.

Background and Business Context

Organizations operating today in cyberspace face an unprecedented volume of threats. Cybercriminals increasingly leverage deep web platforms, including forums, encrypted marketplaces, and ransomware leak sites, to coordinate, trade stolen data, and distribute malware. Unlike the clearnet (the publicly accessible web indexed by search engines), deep web sites require specific protocols and tools to access, making monitoring them more complex.

Threat intelligence teams need reliable tools for collecting, analyzing, and acting on data from these hidden sources. This case study highlights how our team built and maintained a comprehensive scraping and harvesting infrastructure to systematically gather intelligence from deep web ecosystems and integrated these feeds into a Security Operations Center (SOC) environment.

Data Analyzing / Scraping / Scanning

In-house Scraping Framework

We maintained and continuously improved an in-house framework built on Selenium, specifically designed for deep web websites. This framework enables automated collection of intelligence from:

Deep Web forums (where cybercriminals exchange tactics, malware, or stolen credentials)
Marketplaces (selling illicit goods and services)
Ransomware sites (public shaming portals used by threat actors to pressure victims into payment)

The automation capabilities allowed us to bypass common barriers such as login forms, CAPTCHA challenges, and session management.

Reference: Automated web data extraction is a core method in cyber threat intelligence, enabling proactive detection of malicious activities.

Pythonic Harvester

For each targeted platform, we wrote Python-based harvester configurations, ensuring adaptability to different site structures and content formats. These configurations allowed:

Consistent data extraction across diverse deep web ecosystems
Flexible adjustments when sites changed layouts or defenses
Integration with the broader scraping framework

Harvesting Apps

Beyond our main framework, we also built standalone harvesting applications for specialized use cases, including:

Bluesky social network HTTP/XRPC harvester app – implemented in a producer-consumer architecture using Python and RabbitMQ. This extended coverage to emerging decentralized platforms.
Structured data pipelines for deep web Forums and Markets – again leveraging producer-consumer patterns (Python, RabbitMQ, AWS) to normalize, enrich, and distribute collected intelligence.

This modular approach allowed rapid scaling of new data sources while keeping system performance stable.

In-house Database Parsing Library

Many deep web leaks and breaches are distributed as SQL/CSV dumps. We maintained and refactored an internal database parsing library, enabling:

Parsing at scale across heterogeneous dump formats
Standardization into a unified schema
Integration into downstream analysis pipelines

This ensured that leaked data could be ingested quickly, mapped to organizational assets, and correlated with potential threats.

In-house Utility Application

To streamline infrastructure management, we developed an internal scheduling application that replaced disparate cron jobs across servers. Benefits included:

Centralized job management
Error handling and retry logic
Easier debugging and monitoring

Port Scan System

To augment scraping with external attack-surface intelligence, we built a port-scanning and data propagation system:

Raw scans run with zmap, a high-speed network scanner
Python wrappers for orchestration and data management
Results uploaded to AWS S3 for centralized processing

This helped identify exposed services and potential vulnerabilities across monitored assets.

Global Crawler

Our global crawling system, initially based on a pub-sub model, was later migrated to Kafka for scalability and resilience.

Written in Python
Supported scanning via nuclei templates (GET requests against targets to identify vulnerabilities)
Fully cloud-native deployment on AWS

Parent DNS Project

As part of our discovery augmentation efforts, we built the Parent DNS service:

Implemented in Python
Consumes data from resolver topics
Caching layer built in PostgreSQL

This enabled enrichment of domain intelligence and faster resolution workflows.

TPI Scans (Third-Party Integrated Scans)

We implemented multiple third-party scans to broaden coverage:

SSL and HSTS crawls
DNS scans (DKIM, DMARC, MX, SOA records)
Log4j vulnerability checks via nuclei templates
Weak SSL version acceptance checks

Tools used:

SSL scans with zgrab2
DNS scans with zdns
SSL verification written in Go
Wrappers and orchestration in Python

These scans provided both compliance and vulnerability visibility for monitored infrastructures.

Full Stack Development Contributions

Backoffice Platform

Our team developed and modernized a back-office management platform supporting centralized control for the ASI Module. Capabilities included:

User permission and workflow management
Legacy UI modernization with improved performance and database optimizations
Theme switcher (light/dark/auto) for UX improvement
Audit trail system for compliance and accountability
Checkmate integration for monitoring entities in one place
Automated scan scheduling

Stack: Laravel, Livewire, Alpine.js, Tailwind CSS, Docker, PostgreSQL, Git.

Checkmate Microservice

Checkmate validated consumer authorization for job execution based on defined business rules. Our contributions:

Entity management and retrieval functionality
API-only architecture cleanup (removal of default Laravel components)
Search and filtering capabilities for Constraint Rules
Development workflow improvements

Stack: Laravel, Docker, MySQL, Git.

Frontend Contributions

Asset Slideout Component

We delivered the Asset Slideout React component to replace the outdated AssetModal. It integrated with the new PAPI endpoint, providing:

Comprehensive asset views with multiple tables and nested modals
Faster performance due to optimized API calls
Improved navigation and UX for large, complex datasets

Stack: React, TypeScript, React Query, Styled Components, Context API, Figma.

Explorer Component Refactoring

We refactored the Explorer component, central to browsing and filtering assets.

Goals achieved:

Improved performance on large datasets
Cleaner, modular codebase with reusable components
Prepared foundation for new features

Stack: React, TypeScript, Styled Components, React Query, Context API, Figma, Storybook.

Results and Benefits

Proactive threat intelligence – Continuous scraping of deep web forums and marketplaces provided early detection of malicious campaigns and data leaks.
Streamlined operations – In-house frameworks, schedulers, and crawlers replaced fragmented tooling.
Improved SOC efficiency – Integrated pipelines fed directly into SOC monitoring, enabling faster triage and incident response.
Scalability – Kafka-based migration and modular harvesters allowed for rapid expansion as new platforms emerged.
Enhanced UX – Backoffice and frontend refactoring delivered improved performance and usability for analysts.

Challenges and Mitigation

Constantly evolving deep websites – mitigated with adaptable Pythonic harvesters and modular frameworks.
Data volume and complexity – addressed with structured pipelines and optimized database management.
UI modernization risks – managed through gradual migration, strong QA, and collaboration across design, backend, and frontend teams.

Conclusion

This engagement demonstrates how combining staff augmentation with SOC engineering expertise enables clients to build robust threat intelligence pipelines. By monitoring and harvesting data from deep web ecosystems, enriching it with third-party scans, and modernizing analyst-facing applications, we helped strengthen detection capabilities and operational resilience.

Our work showcases the importance of multidisciplinary expertise in cybersecurity: from Python-based scraping, port scanning, and data pipelines, to full-stack application development and user experience optimization.

Deep Web Data Scanning, Scraping and Analyzing

Deep Web Data Scanning, Scraping and Analyzing

Background and Business Context

Data Analyzing / Scraping / Scanning

In-house Scraping Framework

Pythonic Harvester

Harvesting Apps

In-house Database Parsing Library

In-house Utility Application

Port Scan System

Global Crawler

Parent DNS Project

TPI Scans (Third-Party Integrated Scans)

Full Stack Development Contributions

Backoffice Platform

Checkmate Microservice

Frontend Contributions

Asset Slideout Component

Explorer Component Refactoring

Results and Benefits

Challenges and Mitigation

Conclusion

References

BlueGrid.io Content Team

BlueGrid.io Content Team

Deep Web Data Scanning, Scraping and Analyzing

Deep Web Data Scanning, Scraping and Analyzing

Background and Business Context

Data Analyzing / Scraping / Scanning

In-house Scraping Framework

Pythonic Harvester

Harvesting Apps

In-house Database Parsing Library

In-house Utility Application

Port Scan System

Global Crawler

Parent DNS Project

TPI Scans (Third-Party Integrated Scans)

Full Stack Development Contributions

Backoffice Platform

Checkmate Microservice

Frontend Contributions

Asset Slideout Component

Explorer Component Refactoring

Results and Benefits

Challenges and Mitigation

Conclusion

References

BlueGrid.io Content Team

BlueGrid.io Content Team

Subscribe to our blog

Confirm Your Email Address