Skip to main content

Unified Search Implementation Guide

Overview

The Unified Search feature provides a single search interface that simultaneously searches across both the Entrepreneur and Company Elasticsearch indexes, with intelligent result merging, relevance ranking, and sole proprietorship handling.

Key Features:

  • Single search query across 30M entrepreneurs + 27M companies
  • Real-time result merging with relevance scoring
  • Sole proprietorship deduplication
  • Sub-second response times (P95 < 500ms)
  • Rate limiting and caching for performance
  • React-based search UI with live results

Architecture

Backend Components

1. UnifiedSearchService (app/services/unified_search_service.rb)

  • Coordinates multi-index search via Elasticsearch msearch API
  • Merges and ranks results by relevance score
  • Handles sole proprietorship deduplication
  • Implements caching (5-minute TTL)

2. API Controller (app/controllers/api/search_controller.rb)

  • Endpoint: POST /api/search/unified
  • Rate limiting: 30 requests/minute per IP
  • Error handling with partial results support

3. Background Job (app/jobs/reindex_relationships_job.rb)

  • Bulk reindexing of relationship fields
  • Progress tracking via Redis
  • Batch processing (1000 records/batch)
  • Error recovery and logging

Frontend Components

1. Search Page (admin/src/pages/Search.tsx)

  • Full-page search interface
  • Debounced search input (500ms)
  • Paginated results display
  • Type indicators (entrepreneur vs company)
  • Clickable results → detail pages

2. Search Service (admin/src/services/search.ts)

  • TypeScript API client
  • Request/response type definitions
  • Error handling

3. Custom Hook (admin/src/hooks/index.ts::useUnifiedSearch)

  • React Query integration
  • Automatic caching (3-minute stale time)
  • Loading and error states

Database Schema Changes

Entrepreneur Index

# New fields added to entrepreneurs_#{Rails.env}
indexes :company_ids, type: :integer # Array of related company IDs
indexes :has_sole_proprietorship, type: :boolean # Pre-computed flag

Company Index

# New fields added to companies_#{Rails.env}
indexes :entrepreneur_ids, type: :integer # Array of stakeholder IDs
indexes :is_sole_proprietorship, type: :boolean # Pre-computed flag

Indexing Methods:

  • Entrepreneur#as_indexed_json - Populates company_ids and has_sole_proprietorship
  • Company#as_indexed_json - Populates entrepreneur_ids and is_sole_proprietorship

API Reference

POST /api/search/unified

Request Body:

{
"query": "Jean Dupont",
"page": 1,
"per_page": 20,
"filters": {
"entity_type": "all",
"city_id": null,
"department_id": null,
"naf_id": null
}
}

Response:

{
"results": [
{
"type": "entrepreneur",
"id": 12345,
"score": 15.8,
"full_name": "Jean Dupont",
"prenoms": "Jean",
"nom": "Dupont",
"role": "Gérant",
"city_id": 75001,
"company_ids": [678, 789],
"sole_proprietorship_company": {
"id": 678,
"name": "Jean Dupont",
"siren": "123456789"
}
},
{
"type": "company",
"id": 999,
"score": 12.3,
"name": "Dupont & Associates",
"siren": "987654321",
"legal_status_id": 5,
"naf_id": 123,
"city_id": 75002,
"entrepreneur_ids": [12345, 54321],
"is_sole_proprietorship": false
}
],
"meta": {
"total": 145,
"page": 1,
"per_page": 20,
"total_pages": 8,
"counts": {
"entrepreneurs": 67,
"companies": 78
},
"query_time_ms": 145
}
}

Error Response:

{
"results": [],
"meta": {
"total": 0,
"page": 1,
"per_page": 20,
"total_pages": 0,
"counts": {
"entrepreneurs": 0,
"companies": 0
},
"errors": ["Search failed: connection timeout"]
}
}

HTTP Status Codes:

  • 200 OK - Successful search
  • 206 Partial Content - Search completed with partial errors
  • 400 Bad Request - Invalid query parameters
  • 429 Too Many Requests - Rate limit exceeded
  • 503 Service Unavailable - Search service down

Relevance Ranking Algorithm

Elasticsearch Scoring

Entrepreneur Queries:

{
should: [
{ match: { "full_name.keyword": { query: "...", boost: 10.0 } } }, # Exact match
{ match_phrase: { full_name: { query: "...", boost: 5.0 } } }, # Phrase match
{ multi_match: { fields: ["full_name^3", "prenoms^2", "nom^2"], boost: 3.0 } }, # Multi-field
{ multi_match: { fields: ["*.autocomplete"], boost: 1.0 } } # Prefix
]
}

Company Queries:

{
should: [
{ match: { "name.keyword": { query: "...", boost: 10.0 } } }, # Exact match
{ match_phrase: { name: { query: "...", boost: 5.0 } } }, # Phrase match
{ match: { name: { query: "...", boost: 3.0 } } }, # Full-text
{ match: { "name.autocomplete": { query: "...", boost: 1.0 } } }, # Prefix
{ term: { siren: { value: "...", boost: 15.0 } } } # SIREN exact
]
}

Result Merging

  1. Fetch: Top 30 results from each index (over-fetching for better merging)
  2. Deduplicate: Remove duplicate sole proprietorships
  3. Sort: By Elasticsearch _score (descending)
  4. Paginate: Apply requested page/per_page

Sole Proprietorship Logic

Detection Criteria

A company is marked as a sole proprietorship when:

  1. Single entrepreneur: entrepreneur_ids.length == 1
  2. Legal status: Legal status ID is 2 (EI) or 42 (EIRL)
  3. Name match: Company name == Entrepreneur name (case-insensitive)

Deduplication Strategy

When both an entrepreneur and their sole proprietorship company appear in results:

  1. Keep the entrepreneur result
  2. Add sole_proprietorship_company embedded data
  3. Remove the company result from the list

Why? Users searching for "Jean Dupont" want to see the person first, with easy access to their company details.


Reindexing Operations

Initial Setup

After deploying this feature, reindex all records to populate relationship fields:

# Reindex all entrepreneurs
rails elasticsearch:relationships:entrepreneurs

# Reindex all companies
rails elasticsearch:relationships:companies

# Or reindex both in parallel
rails elasticsearch:relationships:all

Progress Monitoring

# Check progress
rails elasticsearch:relationships:progress[Entrepreneur]
rails elasticsearch:relationships:progress[Company]

# Example output:
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Entrepreneur Reindex Progress
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Total records: 30000000
# Processed: 15000000
# Errors: 127
# Progress: 50.0%
#
# Started: 2025-11-13 10:00:00
# Last updated: 2025-11-13 12:30:15
# Elapsed: 2h 30m 15s
# Est. remaining: 2h 30m 15s
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Incremental Reindexing

For reindexing specific ID ranges (useful for parallelization):

# Reindex entrepreneurs with IDs 1M to 5M
rails elasticsearch:relationships:range[Entrepreneur,1000000,5000000]

# Reindex companies with IDs 10M to 15M
rails elasticsearch:relationships:range[Company,10000000,15000000]

Performance

Expected throughput:

  • ~500-1000 records/second (depending on DB/ES load)
  • Total time for full reindex: ~8-16 hours (30M+27M records)
  • Memory usage: ~100MB per worker

Performance Optimization

Query-Level Optimizations

1. Result Limiting

limit_per_index = 30  # Fetch top 30 from each index
# After merging, return only requested per_page (typically 20)

2. Source Filtering

_source: ["full_name", "city_id", "company_ids"]  # Only needed fields
# Reduces network payload by 60-80%

3. Timeout Protection

timeout: "500ms"  # Kill slow queries
# Prevents queue backup, returns partial results

4. Disable Total Count (for deep pagination)

track_total_hits: false  # For pages beyond 1
# Saves expensive count aggregation

Caching Strategy

1. Redis Query Cache

cache_key = "unified_search:#{query}:#{page}:#{per_page}"
Rails.cache.fetch(cache_key, expires_in: 5.minutes) { ... }
  • Popular queries cached for 5 minutes
  • Estimated hit rate: 30-40%

2. Elasticsearch Request Cache

settings index: {
requests: { cache: { enable: true } }
}
  • Caches identical queries at shard level
  • Automatic invalidation on index updates

Rate Limiting

RATE_LIMIT_MAX_REQUESTS = 30  # per minute per IP
RATE_LIMIT_WINDOW = 1.minute

Bypass:

# Disable rate limiting in test env
DISABLE_RATE_LIMIT=true rails s

Pagination Limits

  • Max page: 100 (to prevent deep pagination performance issues)
  • Max per_page: 100
  • For results beyond page 100, use search_after cursor (future enhancement)

Monitoring & Alerts

Performance Targets

MetricTargetAlert Threshold
P50 Response Time< 150ms> 300ms
P95 Response Time< 500ms> 1s
P99 Response Time< 1s> 2s
Cache Hit Rate> 30%< 20%
ES Timeout Rate< 1%> 5%
Error Rate< 0.1%> 1%

Key Metrics to Track

1. Search Performance

# Log in controller
Rails.logger.info(
"Search: query=#{params[:query]} " \
"time=#{elapsed}ms " \
"results=#{data[:meta][:total]} " \
"cache_hit=#{cache_hit}"
)

2. Elasticsearch Health

# Monitor cluster health
GET /_cluster/health

# Monitor index stats
GET /entrepreneurs_production,companies_production/_stats

3. Rate Limiting

# Monitor rate limit hits
Rails.cache.read("search_rate_limit:#{ip}")

Logging

Search Queries:

[INFO] Search: query="Jean Dupont" time=145ms results=67 page=1 cache_hit=false

Errors:

[ERROR] UnifiedSearchService error: Connection timeout
/app/services/unified_search_service.rb:45

Reindexing Progress:

[INFO] Reindex progress for Entrepreneur: 15000000/30000000 (50.0%)

Testing

RSpec Tests

Service Tests (spec/services/unified_search_service_spec.rb)

  • Search across both indexes
  • Result merging and ranking
  • Sole proprietorship deduplication
  • Error handling
  • Caching behavior

Controller Tests (spec/controllers/api/search_controller_spec.rb)

  • Endpoint response format
  • Rate limiting
  • Error responses
  • Authentication

Job Tests (spec/jobs/reindex_relationships_job_spec.rb)

  • Batch processing
  • Progress tracking
  • Error recovery

Manual Testing

1. Test Search Query

curl -X POST http://localhost:3000/api/search/unified \
-H "Content-Type: application/json" \
-d '{"query": "Jean Dupont", "page": 1, "per_page": 20}'

2. Test Rate Limiting

# Send 31 requests rapidly
for i in {1..31}; do
curl -X POST http://localhost:3000/api/search/unified \
-H "Content-Type: application/json" \
-d '{"query": "test"}'
done
# Should return 429 Too Many Requests on 31st request

3. Test React UI

cd admin/
npm run dev
# Navigate to http://localhost:3001/search
# Type a query and verify results display correctly

Troubleshooting

Common Issues

1. No results found

  • Check: Elasticsearch indexes exist and have data
  • Fix: Run rails elasticsearch:relationships:all to reindex

2. Slow queries (> 1s)

  • Check: Elasticsearch cluster health GET /_cluster/health
  • Fix: Increase cluster resources or add replicas

3. Rate limit errors

  • Check: Multiple users sharing same IP?
  • Fix: Adjust RATE_LIMIT_MAX_REQUESTS or implement user-based limiting

4. Sole proprietorships showing twice

  • Check: has_sole_proprietorship and is_sole_proprietorship flags correct
  • Fix: Re-run reindexing for affected records

5. Frontend errors

  • Check: API endpoint returning correct data format
  • Fix: Verify TypeScript types match API response

Debug Queries

View Elasticsearch Query:

# In development, queries are logged to console
rails s
# Search from UI, check terminal for pretty-printed JSON query

Test Elasticsearch Directly:

curl -X GET "localhost:9200/entrepreneurs_production/_search?pretty" \
-H 'Content-Type: application/json' \
-d '{
"query": {
"match": { "full_name": "Jean Dupont" }
}
}'

Check Cache:

rails console
> Rails.cache.read("unified_search:jean dupont:1:20")

Future Enhancements

Phase 2 Features

  1. Advanced Filters

    • Filter by entrepreneur role
    • Filter by company legal status
    • Filter by NAF code
    • Filter by city/department
  2. Search Analytics

    • Track popular queries
    • Measure search quality (clicks on results)
    • A/B test ranking algorithms
  3. Performance Improvements

    • Implement search_after for deep pagination
    • Add search result highlighting
    • Prefetch related data (companies for entrepreneurs)
  4. User Experience

    • Autocomplete suggestions
    • Recent searches
    • Saved searches
    • Export results to CSV


Support

For questions or issues:

  1. Check Elasticsearch cluster health
  2. Review application logs
  3. Monitor Redis cache
  4. Check reindexing job progress
  5. Test API endpoint directly

Performance issues? Check the Performance Optimization section above.