dataspot

Dataspot πŸ”₯

Find data concentration patterns and dataspots in your datasets

PyPI version License: MIT Maintained by Frauddi Python 3.9+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

✨ Why Dataspot?

πŸš€ Quick Start

pip install dataspot
from dataspot import Dataspot
from dataspot.models.finder import FindInput, FindOptions

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
]

# Find concentration patterns
dataspot = Dataspot()
result = dataspot.find(
    FindInput(data=data, fields=["country", "device", "user_type"]),
    FindOptions(min_percentage=10.0, limit=5)
)

# Results show where data concentrates
for pattern in result.patterns:
    print(f"{pattern.path} β†’ {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium β†’ 75.0% (3 records)
# country=US > device=mobile β†’ 75.0% (3 records)
# device=mobile β†’ 75.0% (3 records)

🎯 Real-World Use Cases

🚨 Fraud Detection

from dataspot.models.finder import FindInput, FindOptions

# Find suspicious transaction patterns
result = dataspot.find(
    FindInput(
        data=transactions,
        fields=["country", "payment_method", "time_of_day"]
    ),
    FindOptions(min_percentage=15.0, contains="crypto")
)

# Spot unusual concentrations that might indicate fraud
for pattern in result.patterns:
    if pattern.percentage > 30:
        print(f"⚠️ High concentration: {pattern.path}")

πŸ“Š Business Intelligence

from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions

# Discover customer behavior patterns
insights = dataspot.analyze(
    AnalyzeInput(
        data=customer_data,
        fields=["region", "device", "product_category", "tier"]
    ),
    AnalyzeOptions(min_percentage=10.0)
)

print(f"πŸ“ˆ Found {len(insights.patterns)} concentration patterns")
print(f"🎯 Top opportunity: {insights.patterns[0].path}")

πŸ” Temporal Analysis

from dataspot.models.compare import CompareInput, CompareOptions

# Compare patterns between time periods
comparison = dataspot.compare(
    CompareInput(
        current_data=this_month_data,
        baseline_data=last_month_data,
        fields=["country", "payment_method"]
    ),
    CompareOptions(
        change_threshold=0.20,
        statistical_significance=True
    )
)

print(f"πŸ“Š Changes detected: {len(comparison.changes)}")
print(f"πŸ†• New patterns: {len(comparison.new_patterns)}")

🌳 Hierarchical Visualization

from dataspot.models.tree import TreeInput, TreeOptions

# Build hierarchical tree for data exploration
tree = dataspot.tree(
    TreeInput(
        data=sales_data,
        fields=["region", "product_category", "sales_channel"]
    ),
    TreeOptions(min_value=10, max_depth=3, sort_by="value")
)

print(f"🌳 Total records: {tree.value}")
print(f"πŸ“Š Main branches: {len(tree.children)}")

# Navigate the hierarchy
for region in tree.children:
    print(f"  πŸ“ {region.name}: {region.value} records")
    for product in region.children:
        print(f"    πŸ“¦ {product.name}: {product.value} records")

πŸ€– Auto Discovery

from dataspot.models.discovery import DiscoverInput, DiscoverOptions

# Automatically discover important patterns
discovery = dataspot.discover(
    DiscoverInput(data=transaction_data),
    DiscoverOptions(max_fields=3, min_percentage=15.0)
)

print(f"🎯 Top patterns discovered: {len(discovery.top_patterns)}")
for field_ranking in discovery.field_ranking[:3]:
    print(f"πŸ“ˆ {field_ranking.field}: {field_ranking.score:.2f}")

πŸ› οΈ Core Methods

Method Purpose Input Model Options Model Output Model
find() Find concentration patterns FindInput FindOptions FindOutput
analyze() Statistical analysis AnalyzeInput AnalyzeOptions AnalyzeOutput
compare() Temporal comparison CompareInput CompareOptions CompareOutput
discover() Auto pattern discovery DiscoverInput DiscoverOptions DiscoverOutput
tree() Hierarchical visualization TreeInput TreeOptions TreeOutput

Advanced Filtering Options

# Complex analysis with multiple criteria
result = dataspot.find(
    FindInput(
        data=data,
        fields=["country", "device", "payment"],
        query={"country": ["US", "EU"]}  # Pre-filter data
    ),
    FindOptions(
        min_percentage=10.0,      # Only patterns with >10% concentration
        max_depth=3,             # Limit hierarchy depth
        contains="mobile",       # Must contain "mobile" in pattern
        min_count=50,           # At least 50 records
        sort_by="percentage",   # Sort by concentration strength
        limit=20                # Top 20 patterns
    )
)

⚑ Performance

Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.

πŸš€ Real-World Performance

Dataset Size Processing Time Memory Usage Patterns Found
1,000 records ~5ms ~1.4MB 12 patterns
10,000 records ~43ms ~2.8MB 12 patterns
100,000 records ~375ms ~2.9MB 20 patterns
1,000,000 records ~3.7s ~3.0MB 20 patterns

Benchmark Methodology: Performance measured using validated testing with 5 iterations per dataset size on MacBook Pro (M-series). Test data specifications:

πŸ’‘ Performance Tips

# Optimize for speed
result = dataspot.find(
    FindInput(data=large_dataset, fields=fields),
    FindOptions(
        min_percentage=10.0,    # Skip low-concentration patterns
        max_depth=3,           # Limit hierarchy depth
        limit=100             # Cap results
    )
)

# Memory efficient processing
from dataspot.models.tree import TreeInput, TreeOptions

tree = dataspot.tree(
    TreeInput(data=data, fields=["country", "device"]),
    TreeOptions(min_value=10, top=5)  # Simplified tree
)

πŸ“ˆ What Makes Dataspot Different?

Traditional Clustering Dataspot Analysis
Groups similar data points Finds concentration patterns
Equal-sized clusters Identifies where data accumulates
Distance-based Percentage and count based
Hard to interpret Business-friendly hierarchy
Generic approach Built for real-world analysis

🎬 Dataspot in Action

View the algorithm Dataspot in action - Finding data concentration patterns

See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.

πŸ“Š API Structure

Input Models

Options Models

Output Models

πŸ”§ Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

πŸ› οΈ Development Commands

Command Description
make lint Check code for style and quality issues
make lint-fix Automatically fix linting issues where possible
make tests Run all tests with coverage reporting
make check Run both linting and tests
make clean Remove cache files, build artifacts, and temporary files
make install Create virtual environment and install dependencies

πŸ“š Documentation & Examples

🌟 Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn’t be locked behind closed doors. By open-sourcing Dataspot, we hope to:

🀝 Contributing

We welcome contributions! Whether you’re:

See our Contributing Guide for details.

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments


Find your data’s dataspots. Discover what others miss. Built with ❀️ by Frauddi