Unified Search Implementation Guide

Overview

The Unified Search feature provides a single search interface that simultaneously searches across both the Entrepreneur and Company Elasticsearch indexes, with intelligent result merging, relevance ranking, and sole proprietorship handling.

Key Features:

Single search query across 30M entrepreneurs + 27M companies
Real-time result merging with relevance scoring
Sole proprietorship deduplication
Sub-second response times (P95 < 500ms)
Rate limiting and caching for performance
React-based search UI with live results

Architecture

Backend Components

1. UnifiedSearchService (app/services/unified_search_service.rb)

Coordinates multi-index search via Elasticsearch msearch API
Merges and ranks results by relevance score
Handles sole proprietorship deduplication
Implements caching (5-minute TTL)

2. API Controller (app/controllers/api/search_controller.rb)

Endpoint: POST /api/search/unified
Rate limiting: 30 requests/minute per IP
Error handling with partial results support

3. Background Job (app/jobs/reindex_relationships_job.rb)

Bulk reindexing of relationship fields
Progress tracking via Redis
Batch processing (1000 records/batch)
Error recovery and logging

Frontend Components

1. Search Page (admin/src/pages/Search.tsx)

Full-page search interface
Debounced search input (500ms)
Paginated results display
Type indicators (entrepreneur vs company)
Clickable results → detail pages

2. Search Service (admin/src/services/search.ts)

TypeScript API client
Request/response type definitions
Error handling

3. Custom Hook (admin/src/hooks/index.ts::useUnifiedSearch)

React Query integration
Automatic caching (3-minute stale time)
Loading and error states

Database Schema Changes

Entrepreneur Index

# New fields added to entrepreneurs_#{Rails.env}
indexes :company_ids, type: :integer  # Array of related company IDs
indexes :has_sole_proprietorship, type: :boolean  # Pre-computed flag

Company Index

# New fields added to companies_#{Rails.env}
indexes :entrepreneur_ids, type: :integer  # Array of stakeholder IDs
indexes :is_sole_proprietorship, type: :boolean  # Pre-computed flag

Indexing Methods:

Entrepreneur#as_indexed_json - Populates company_ids and has_sole_proprietorship
Company#as_indexed_json - Populates entrepreneur_ids and is_sole_proprietorship

API Reference

POST /api/search/unified

Request Body:

{
  "query": "Jean Dupont",
  "page": 1,
  "per_page": 20,
  "filters": {
    "entity_type": "all",
    "city_id": null,
    "department_id": null,
    "naf_id": null
  }
}

Response:

{
  "results": [
    {
      "type": "entrepreneur",
      "id": 12345,
      "score": 15.8,
      "full_name": "Jean Dupont",
      "prenoms": "Jean",
      "nom": "Dupont",
      "role": "Gérant",
      "city_id": 75001,
      "company_ids": [678, 789],
      "sole_proprietorship_company": {
        "id": 678,
        "name": "Jean Dupont",
        "siren": "123456789"
      }
    },
    {
      "type": "company",
      "id": 999,
      "score": 12.3,
      "name": "Dupont & Associates",
      "siren": "987654321",
      "legal_status_id": 5,
      "naf_id": 123,
      "city_id": 75002,
      "entrepreneur_ids": [12345, 54321],
      "is_sole_proprietorship": false
    }
  ],
  "meta": {
    "total": 145,
    "page": 1,
    "per_page": 20,
    "total_pages": 8,
    "counts": {
      "entrepreneurs": 67,
      "companies": 78
    },
    "query_time_ms": 145
  }
}

Error Response:

{
  "results": [],
  "meta": {
    "total": 0,
    "page": 1,
    "per_page": 20,
    "total_pages": 0,
    "counts": {
      "entrepreneurs": 0,
      "companies": 0
    },
    "errors": ["Search failed: connection timeout"]
  }
}

HTTP Status Codes:

200 OK - Successful search
206 Partial Content - Search completed with partial errors
400 Bad Request - Invalid query parameters
429 Too Many Requests - Rate limit exceeded
503 Service Unavailable - Search service down

Relevance Ranking Algorithm

Elasticsearch Scoring

Entrepreneur Queries:

{
  should: [
    { match: { "full_name.keyword": { query: "...", boost: 10.0 } } },  # Exact match
    { match_phrase: { full_name: { query: "...", boost: 5.0 } } },      # Phrase match
    { multi_match: { fields: ["full_name^3", "prenoms^2", "nom^2"], boost: 3.0 } }, # Multi-field
    { multi_match: { fields: ["*.autocomplete"], boost: 1.0 } }          # Prefix
  ]
}

Company Queries:

{
  should: [
    { match: { "name.keyword": { query: "...", boost: 10.0 } } },       # Exact match
    { match_phrase: { name: { query: "...", boost: 5.0 } } },           # Phrase match
    { match: { name: { query: "...", boost: 3.0 } } },                  # Full-text
    { match: { "name.autocomplete": { query: "...", boost: 1.0 } } },   # Prefix
    { term: { siren: { value: "...", boost: 15.0 } } }                  # SIREN exact
  ]
}

Result Merging

Fetch: Top 30 results from each index (over-fetching for better merging)
Deduplicate: Remove duplicate sole proprietorships
Sort: By Elasticsearch _score (descending)
Paginate: Apply requested page/per_page

Sole Proprietorship Logic

Detection Criteria

A company is marked as a sole proprietorship when:

Single entrepreneur: entrepreneur_ids.length == 1
Legal status: Legal status ID is 2 (EI) or 42 (EIRL)
Name match: Company name == Entrepreneur name (case-insensitive)

Deduplication Strategy

When both an entrepreneur and their sole proprietorship company appear in results:

Keep the entrepreneur result
Add sole_proprietorship_company embedded data
Remove the company result from the list

Why? Users searching for "Jean Dupont" want to see the person first, with easy access to their company details.

Reindexing Operations

Initial Setup

After deploying this feature, reindex all records to populate relationship fields:

# Reindex all entrepreneurs
rails elasticsearch:relationships:entrepreneurs

# Reindex all companies
rails elasticsearch:relationships:companies

# Or reindex both in parallel
rails elasticsearch:relationships:all

Progress Monitoring

# Check progress
rails elasticsearch:relationships:progress[Entrepreneur]
rails elasticsearch:relationships:progress[Company]

# Example output:
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#   Entrepreneur Reindex Progress
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Total records:         30000000
# Processed:             15000000
# Errors:                     127
# Progress:                 50.0%
#
# Started:           2025-11-13 10:00:00
# Last updated:      2025-11-13 12:30:15
# Elapsed:           2h 30m 15s
# Est. remaining:    2h 30m 15s
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Incremental Reindexing

For reindexing specific ID ranges (useful for parallelization):

# Reindex entrepreneurs with IDs 1M to 5M
rails elasticsearch:relationships:range[Entrepreneur,1000000,5000000]

# Reindex companies with IDs 10M to 15M
rails elasticsearch:relationships:range[Company,10000000,15000000]

Performance

Expected throughput:

~500-1000 records/second (depending on DB/ES load)
Total time for full reindex: ~8-16 hours (30M+27M records)
Memory usage: ~100MB per worker

Performance Optimization

Query-Level Optimizations

1. Result Limiting

limit_per_index = 30  # Fetch top 30 from each index
# After merging, return only requested per_page (typically 20)

2. Source Filtering

_source: ["full_name", "city_id", "company_ids"]  # Only needed fields
# Reduces network payload by 60-80%

3. Timeout Protection

timeout: "500ms"  # Kill slow queries
# Prevents queue backup, returns partial results

4. Disable Total Count (for deep pagination)

track_total_hits: false  # For pages beyond 1
# Saves expensive count aggregation

Caching Strategy

1. Redis Query Cache

cache_key = "unified_search:#{query}:#{page}:#{per_page}"
Rails.cache.fetch(cache_key, expires_in: 5.minutes) { ... }

Popular queries cached for 5 minutes
Estimated hit rate: 30-40%

2. Elasticsearch Request Cache

settings index: {
  requests: { cache: { enable: true } }
}

Caches identical queries at shard level
Automatic invalidation on index updates

Rate Limiting

RATE_LIMIT_MAX_REQUESTS = 30  # per minute per IP
RATE_LIMIT_WINDOW = 1.minute

Bypass:

# Disable rate limiting in test env
DISABLE_RATE_LIMIT=true rails s

Pagination Limits

Max page: 100 (to prevent deep pagination performance issues)
Max per_page: 100
For results beyond page 100, use search_after cursor (future enhancement)

Monitoring & Alerts

Performance Targets

Metric	Target	Alert Threshold
P50 Response Time	< 150ms	> 300ms
P95 Response Time	< 500ms	> 1s
P99 Response Time	< 1s	> 2s
Cache Hit Rate	> 30%	< 20%
ES Timeout Rate	< 1%	> 5%
Error Rate	< 0.1%	> 1%

Key Metrics to Track

1. Search Performance

# Log in controller
Rails.logger.info(
  "Search: query=#{params[:query]} " \
  "time=#{elapsed}ms " \
  "results=#{data[:meta][:total]} " \
  "cache_hit=#{cache_hit}"
)

2. Elasticsearch Health

# Monitor cluster health
GET /_cluster/health

# Monitor index stats
GET /entrepreneurs_production,companies_production/_stats

3. Rate Limiting

# Monitor rate limit hits
Rails.cache.read("search_rate_limit:#{ip}")

Logging

Search Queries:

[INFO] Search: query="Jean Dupont" time=145ms results=67 page=1 cache_hit=false

Errors:

[ERROR] UnifiedSearchService error: Connection timeout
/app/services/unified_search_service.rb:45

Reindexing Progress:

[INFO] Reindex progress for Entrepreneur: 15000000/30000000 (50.0%)

Testing

RSpec Tests

Service Tests (spec/services/unified_search_service_spec.rb)

Search across both indexes
Result merging and ranking
Sole proprietorship deduplication
Error handling
Caching behavior

Controller Tests (spec/controllers/api/search_controller_spec.rb)

Endpoint response format
Rate limiting
Error responses
Authentication

Job Tests (spec/jobs/reindex_relationships_job_spec.rb)

Batch processing
Progress tracking
Error recovery

Manual Testing

1. Test Search Query

curl -X POST http://localhost:3000/api/search/unified \
  -H "Content-Type: application/json" \
  -d '{"query": "Jean Dupont", "page": 1, "per_page": 20}'

2. Test Rate Limiting

# Send 31 requests rapidly
for i in {1..31}; do
  curl -X POST http://localhost:3000/api/search/unified \
    -H "Content-Type: application/json" \
    -d '{"query": "test"}'
done
# Should return 429 Too Many Requests on 31st request

3. Test React UI

cd admin/
npm run dev
# Navigate to http://localhost:3001/search
# Type a query and verify results display correctly

Troubleshooting

Common Issues

1. No results found

Check: Elasticsearch indexes exist and have data
Fix: Run rails elasticsearch:relationships:all to reindex

2. Slow queries (> 1s)

Check: Elasticsearch cluster health GET /_cluster/health
Fix: Increase cluster resources or add replicas

3. Rate limit errors

Check: Multiple users sharing same IP?
Fix: Adjust RATE_LIMIT_MAX_REQUESTS or implement user-based limiting

4. Sole proprietorships showing twice

Check: has_sole_proprietorship and is_sole_proprietorship flags correct
Fix: Re-run reindexing for affected records

5. Frontend errors

Check: API endpoint returning correct data format
Fix: Verify TypeScript types match API response

Debug Queries

View Elasticsearch Query:

# In development, queries are logged to console
rails s
# Search from UI, check terminal for pretty-printed JSON query

Test Elasticsearch Directly:

curl -X GET "localhost:9200/entrepreneurs_production/_search?pretty" \
  -H 'Content-Type: application/json' \
  -d '{
    "query": {
      "match": { "full_name": "Jean Dupont" }
    }
  }'

Check Cache:

rails console
> Rails.cache.read("unified_search:jean dupont:1:20")

Future Enhancements

Phase 2 Features

Advanced Filters
- Filter by entrepreneur role
- Filter by company legal status
- Filter by NAF code
- Filter by city/department
Search Analytics
- Track popular queries
- Measure search quality (clicks on results)
- A/B test ranking algorithms
Performance Improvements
- Implement search_after for deep pagination
- Add search result highlighting
- Prefetch related data (companies for entrepreneurs)
User Experience
- Autocomplete suggestions
- Recent searches
- Saved searches
- Export results to CSV

Support

For questions or issues:

Check Elasticsearch cluster health
Review application logs
Monitor Redis cache
Check reindexing job progress
Test API endpoint directly

Performance issues? Check the Performance Optimization section above.

Overview​

Architecture​

Backend Components​

Frontend Components​

Database Schema Changes​

Entrepreneur Index​

Company Index​

API Reference​

POST /api/search/unified​

Relevance Ranking Algorithm​

Elasticsearch Scoring​

Result Merging​

Sole Proprietorship Logic​

Detection Criteria​

Deduplication Strategy​

Reindexing Operations​

Initial Setup​

Progress Monitoring​

Incremental Reindexing​

Performance​

Performance Optimization​

Query-Level Optimizations​

Caching Strategy​

Rate Limiting​

Pagination Limits​

Monitoring & Alerts​

Performance Targets​

Key Metrics to Track​

Logging​

Testing​

RSpec Tests​

Manual Testing​

Troubleshooting​

Common Issues​

Debug Queries​

Future Enhancements​

Phase 2 Features​

Related Documentation​

Support​