Confidentiality Notice: This case study is confidential and intended for internal reference or vetted prospect presentations only.
Background and Business Context
Organizations operating today in cyberspace face an unprecedented volume of threats. Cybercriminals increasingly leverage deep web platforms, including forums, encrypted marketplaces, and ransomware leak sites, to coordinate, trade stolen data, and distribute malware. Unlike the clearnet (the publicly accessible web indexed by search engines), deep web sites require specific protocols and tools to access, making monitoring them more complex.
Threat intelligence teams need reliable tools for collecting, analyzing, and acting on data from these hidden sources. This case study highlights how our team built and maintained a comprehensive scraping and harvesting infrastructure to systematically gather intelligence from deep web ecosystems and integrated these feeds into a Security Operations Center (SOC) environment.
Data Analyzing / Scraping / Scanning
In-house Scraping Framework
We maintained and continuously improved an in-house framework built on Selenium, specifically designed for deep web websites. This framework enables automated collection of intelligence from:
- Deep Web forums (where cybercriminals exchange tactics, malware, or stolen credentials)
- Marketplaces (selling illicit goods and services)
- Ransomware sites (public shaming portals used by threat actors to pressure victims into payment)
The automation capabilities allowed us to bypass common barriers such as login forms, CAPTCHA challenges, and session management.
Reference: Automated web data extraction is a core method in cyber threat intelligence, enabling proactive detection of malicious activities.
Pythonic Harvester
For each targeted platform, we wrote Python-based harvester configurations, ensuring adaptability to different site structures and content formats. These configurations allowed:
- Consistent data extraction across diverse deep web ecosystems
- Flexible adjustments when sites changed layouts or defenses
- Integration with the broader scraping framework
Harvesting Apps
Beyond our main framework, we also built standalone harvesting applications for specialized use cases, including:
- Bluesky social network HTTP/XRPC harvester app – implemented in a producer-consumer architecture using Python and RabbitMQ. This extended coverage to emerging decentralized platforms.
- Structured data pipelines for deep web Forums and Markets – again leveraging producer-consumer patterns (Python, RabbitMQ, AWS) to normalize, enrich, and distribute collected intelligence.
This modular approach allowed rapid scaling of new data sources while keeping system performance stable.
In-house Database Parsing Library
Many deep web leaks and breaches are distributed as SQL/CSV dumps. We maintained and refactored an internal database parsing library, enabling:
- Parsing at scale across heterogeneous dump formats
- Standardization into a unified schema
- Integration into downstream analysis pipelines
This ensured that leaked data could be ingested quickly, mapped to organizational assets, and correlated with potential threats.
In-house Utility Application
To streamline infrastructure management, we developed an internal scheduling application that replaced disparate cron jobs across servers. Benefits included:
- Centralized job management
- Error handling and retry logic
- Easier debugging and monitoring
Port Scan System
To augment scraping with external attack-surface intelligence, we built a port-scanning and data propagation system:
- Raw scans run with zmap, a high-speed network scanner
- Python wrappers for orchestration and data management
- Results uploaded to AWS S3 for centralized processing
This helped identify exposed services and potential vulnerabilities across monitored assets.
Global Crawler
Our global crawling system, initially based on a pub-sub model, was later migrated to Kafka for scalability and resilience.
- Written in Python
- Supported scanning via nuclei templates (GET requests against targets to identify vulnerabilities)
- Fully cloud-native deployment on AWS
Parent DNS Project
As part of our discovery augmentation efforts, we built the Parent DNS service:
- Implemented in Python
- Consumes data from resolver topics
- Caching layer built in PostgreSQL
This enabled enrichment of domain intelligence and faster resolution workflows.
TPI Scans (Third-Party Integrated Scans)
We implemented multiple third-party scans to broaden coverage:
- SSL and HSTS crawls
- DNS scans (DKIM, DMARC, MX, SOA records)
- Log4j vulnerability checks via nuclei templates
- Weak SSL version acceptance checks
Tools used:
- SSL scans with zgrab2
- DNS scans with zdns
- SSL verification written in Go
- Wrappers and orchestration in Python
These scans provided both compliance and vulnerability visibility for monitored infrastructures.
Full Stack Development Contributions
Backoffice Platform
Our team developed and modernized a back-office management platform supporting centralized control for the ASI Module. Capabilities included:
- User permission and workflow management
- Legacy UI modernization with improved performance and database optimizations
- Theme switcher (light/dark/auto) for UX improvement
- Audit trail system for compliance and accountability
- Checkmate integration for monitoring entities in one place
- Automated scan scheduling
Stack: Laravel, Livewire, Alpine.js, Tailwind CSS, Docker, PostgreSQL, Git.
Checkmate Microservice
Checkmate validated consumer authorization for job execution based on defined business rules. Our contributions:
- Entity management and retrieval functionality
- API-only architecture cleanup (removal of default Laravel components)
- Search and filtering capabilities for Constraint Rules
- Development workflow improvements
Stack: Laravel, Docker, MySQL, Git.
Frontend Contributions
Asset Slideout Component
We delivered the Asset Slideout React component to replace the outdated AssetModal. It integrated with the new PAPI endpoint, providing:
- Comprehensive asset views with multiple tables and nested modals
- Faster performance due to optimized API calls
- Improved navigation and UX for large, complex datasets
Stack: React, TypeScript, React Query, Styled Components, Context API, Figma.
Explorer Component Refactoring
We refactored the Explorer component, central to browsing and filtering assets.
Goals achieved:
- Improved performance on large datasets
- Cleaner, modular codebase with reusable components
- Prepared foundation for new features
Stack: React, TypeScript, Styled Components, React Query, Context API, Figma, Storybook.
Results and Benefits
- Proactive threat intelligence – Continuous scraping of deep web forums and marketplaces provided early detection of malicious campaigns and data leaks.
- Streamlined operations – In-house frameworks, schedulers, and crawlers replaced fragmented tooling.
- Improved SOC efficiency – Integrated pipelines fed directly into SOC monitoring, enabling faster triage and incident response.
- Scalability – Kafka-based migration and modular harvesters allowed for rapid expansion as new platforms emerged.
- Enhanced UX – Backoffice and frontend refactoring delivered improved performance and usability for analysts.
Challenges and Mitigation
- Constantly evolving deep websites – mitigated with adaptable Pythonic harvesters and modular frameworks.
- Data volume and complexity – addressed with structured pipelines and optimized database management.
- UI modernization risks – managed through gradual migration, strong QA, and collaboration across design, backend, and frontend teams.
Conclusion
This engagement demonstrates how combining staff augmentation with SOC engineering expertise enables clients to build robust threat intelligence pipelines. By monitoring and harvesting data from deep web ecosystems, enriching it with third-party scans, and modernizing analyst-facing applications, we helped strengthen detection capabilities and operational resilience.
Our work showcases the importance of multidisciplinary expertise in cybersecurity: from Python-based scraping, port scanning, and data pipelines, to full-stack application development and user experience optimization.
References
- Mavroeidis, V., & Bromander, S. (2017). Cyber Threat Intelligence Model: An Evaluation of Taxonomies, Sharing Standards, and Ontologies. International Conference on Information Systems Security and Privacy.
- Europol. (2023). Internet Organised Crime Threat Assessment (IOCTA).
- OWASP Foundation. “OWASP Nuclei”.
- Rapid7. (2022). “Understanding Zmap and Mass Scanning Tools.”.
- MITRE ATT&CK® Framework. (2025). Threat Intelligence Techniques.