Find data concentration patterns and dataspots in your datasets
Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.
pip install dataspot
from dataspot import Dataspot
from dataspot.models.finder import FindInput, FindOptions
# Sample transaction data
data = [
{"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
{"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},
{"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},
{"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
]
# Find concentration patterns
dataspot = Dataspot()
result = dataspot.find(
FindInput(data=data, fields=["country", "device", "user_type"]),
FindOptions(min_percentage=10.0, limit=5)
)
# Results show where data concentrates
for pattern in result.patterns:
print(f"{pattern.path} β {pattern.percentage}% ({pattern.count} records)")
# Output:
# country=US > device=mobile > user_type=premium β 75.0% (3 records)
# country=US > device=mobile β 75.0% (3 records)
# device=mobile β 75.0% (3 records)
from dataspot.models.finder import FindInput, FindOptions
# Find suspicious transaction patterns
result = dataspot.find(
FindInput(
data=transactions,
fields=["country", "payment_method", "time_of_day"]
),
FindOptions(min_percentage=15.0, contains="crypto")
)
# Spot unusual concentrations that might indicate fraud
for pattern in result.patterns:
if pattern.percentage > 30:
print(f"β οΈ High concentration: {pattern.path}")
from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions
# Discover customer behavior patterns
insights = dataspot.analyze(
AnalyzeInput(
data=customer_data,
fields=["region", "device", "product_category", "tier"]
),
AnalyzeOptions(min_percentage=10.0)
)
print(f"π Found {len(insights.patterns)} concentration patterns")
print(f"π― Top opportunity: {insights.patterns[0].path}")
from dataspot.models.compare import CompareInput, CompareOptions
# Compare patterns between time periods
comparison = dataspot.compare(
CompareInput(
current_data=this_month_data,
baseline_data=last_month_data,
fields=["country", "payment_method"]
),
CompareOptions(
change_threshold=0.20,
statistical_significance=True
)
)
print(f"π Changes detected: {len(comparison.changes)}")
print(f"π New patterns: {len(comparison.new_patterns)}")
from dataspot.models.tree import TreeInput, TreeOptions
# Build hierarchical tree for data exploration
tree = dataspot.tree(
TreeInput(
data=sales_data,
fields=["region", "product_category", "sales_channel"]
),
TreeOptions(min_value=10, max_depth=3, sort_by="value")
)
print(f"π³ Total records: {tree.value}")
print(f"π Main branches: {len(tree.children)}")
# Navigate the hierarchy
for region in tree.children:
print(f" π {region.name}: {region.value} records")
for product in region.children:
print(f" π¦ {product.name}: {product.value} records")
from dataspot.models.discovery import DiscoverInput, DiscoverOptions
# Automatically discover important patterns
discovery = dataspot.discover(
DiscoverInput(data=transaction_data),
DiscoverOptions(max_fields=3, min_percentage=15.0)
)
print(f"π― Top patterns discovered: {len(discovery.top_patterns)}")
for field_ranking in discovery.field_ranking[:3]:
print(f"π {field_ranking.field}: {field_ranking.score:.2f}")
Method | Purpose | Input Model | Options Model | Output Model |
---|---|---|---|---|
find() |
Find concentration patterns | FindInput |
FindOptions |
FindOutput |
analyze() |
Statistical analysis | AnalyzeInput |
AnalyzeOptions |
AnalyzeOutput |
compare() |
Temporal comparison | CompareInput |
CompareOptions |
CompareOutput |
discover() |
Auto pattern discovery | DiscoverInput |
DiscoverOptions |
DiscoverOutput |
tree() |
Hierarchical visualization | TreeInput |
TreeOptions |
TreeOutput |
# Complex analysis with multiple criteria
result = dataspot.find(
FindInput(
data=data,
fields=["country", "device", "payment"],
query={"country": ["US", "EU"]} # Pre-filter data
),
FindOptions(
min_percentage=10.0, # Only patterns with >10% concentration
max_depth=3, # Limit hierarchy depth
contains="mobile", # Must contain "mobile" in pattern
min_count=50, # At least 50 records
sort_by="percentage", # Sort by concentration strength
limit=20 # Top 20 patterns
)
)
Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.
Dataset Size | Processing Time | Memory Usage | Patterns Found |
---|---|---|---|
1,000 records | ~5ms | ~1.4MB | 12 patterns |
10,000 records | ~43ms | ~2.8MB | 12 patterns |
100,000 records | ~375ms | ~2.9MB | 20 patterns |
1,000,000 records | ~3.7s | ~3.0MB | 20 patterns |
Benchmark Methodology: Performance measured using validated testing with 5 iterations per dataset size on MacBook Pro (M-series). Test data specifications:
- JSON Size: ~164 bytes per JSON record (~0.16 KB each)
- JSON Structure: 8 keys per JSON record (
country
,device
,payment_method
,amount
,user_type
,channel
,status
,id
)- Analysis Scope: 4 fields analyzed simultaneously (
country
,device
,payment_method
,user_type
)- Configuration:
min_percentage=5.0
,limit=50
patterns- Results: Consistently finds 12 concentration patterns across all dataset sizes
- Variance: Minimal timing variance (Β±1-6ms), demonstrating algorithmic stability
- Memory Efficiency: Near-constant memory usage regardless of dataset size
# Optimize for speed
result = dataspot.find(
FindInput(data=large_dataset, fields=fields),
FindOptions(
min_percentage=10.0, # Skip low-concentration patterns
max_depth=3, # Limit hierarchy depth
limit=100 # Cap results
)
)
# Memory efficient processing
from dataspot.models.tree import TreeInput, TreeOptions
tree = dataspot.tree(
TreeInput(data=data, fields=["country", "device"]),
TreeOptions(min_value=10, top=5) # Simplified tree
)
Traditional Clustering | Dataspot Analysis |
---|---|
Groups similar data points | Finds concentration patterns |
Equal-sized clusters | Identifies where data accumulates |
Distance-based | Percentage and count based |
Hard to interpret | Business-friendly hierarchy |
Generic approach | Built for real-world analysis |
See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.
FindInput
- Data and fields for pattern findingAnalyzeInput
- Statistical analysis configurationCompareInput
- Current vs baseline data comparisonDiscoverInput
- Automatic pattern discoveryTreeInput
- Hierarchical tree visualizationFindOptions
- Filtering and sorting for patternsAnalyzeOptions
- Statistical analysis parametersCompareOptions
- Change detection thresholdsDiscoverOptions
- Auto-discovery constraintsTreeOptions
- Tree structure customizationFindOutput
- Pattern discovery results with statisticsAnalyzeOutput
- Enhanced analysis with insights and confidence scoresCompareOutput
- Change detection results with significance testsDiscoverOutput
- Auto-discovery findings with field rankingsTreeOutput
- Hierarchical tree structure with navigation# Install from PyPI
pip install dataspot
# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"
Requirements:
Command | Description |
---|---|
make lint |
Check code for style and quality issues |
make lint-fix |
Automatically fix linting issues where possible |
make tests |
Run all tests with coverage reporting |
make check |
Run both linting and tests |
make clean |
Remove cache files, build artifacts, and temporary files |
make install |
Create virtual environment and install dependencies |
01_basic_query_filtering.py
- Query and filtering basics02_pattern_filtering_basic.py
- Pattern-based filtering06_real_world_scenarios.py
- Business use cases08_auto_discovery.py
- Automatic pattern discovery09_temporal_comparison.py
- A/B testing and change detection10_stats.py
- Statistical analysisDataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldnβt be locked behind closed doors. By open-sourcing Dataspot, we hope to:
We welcome contributions! Whether youβre:
See our Contributing Guide for details.
MIT License - see LICENSE file for details.
Find your dataβs dataspots. Discover what others miss. Built with β€οΈ by Frauddi