Unified Search Implementation Guide
Overview
The Unified Search feature provides a single search interface that simultaneously searches across both the Entrepreneur and Company Elasticsearch indexes, with intelligent result merging, relevance ranking, and sole proprietorship handling.
Key Features:
- Single search query across 30M entrepreneurs + 27M companies
- Real-time result merging with relevance scoring
- Sole proprietorship deduplication
- Sub-second response times (P95 < 500ms)
- Rate limiting and caching for performance
- React-based search UI with live results
Architecture
Backend Components
1. UnifiedSearchService (app/services/unified_search_service.rb)
- Coordinates multi-index search via Elasticsearch msearch API
- Merges and ranks results by relevance score
- Handles sole proprietorship deduplication
- Implements caching (5-minute TTL)
2. API Controller (app/controllers/api/search_controller.rb)
- Endpoint:
POST /api/search/unified - Rate limiting: 30 requests/minute per IP
- Error handling with partial results support
3. Background Job (app/jobs/reindex_relationships_job.rb)
- Bulk reindexing of relationship fields
- Progress tracking via Redis
- Batch processing (1000 records/batch)
- Error recovery and logging
Frontend Components
1. Search Page (admin/src/pages/Search.tsx)
- Full-page search interface
- Debounced search input (500ms)
- Paginated results display
- Type indicators (entrepreneur vs company)
- Clickable results → detail pages
2. Search Service (admin/src/services/search.ts)
- TypeScript API client
- Request/response type definitions
- Error handling
3. Custom Hook (admin/src/hooks/index.ts::useUnifiedSearch)
- React Query integration
- Automatic caching (3-minute stale time)
- Loading and error states
Database Schema Changes
Entrepreneur Index
# New fields added to entrepreneurs_#{Rails.env}
indexes :company_ids, type: :integer # Array of related company IDs
indexes :has_sole_proprietorship, type: :boolean # Pre-computed flag
Company Index
# New fields added to companies_#{Rails.env}
indexes :entrepreneur_ids, type: :integer # Array of stakeholder IDs
indexes :is_sole_proprietorship, type: :boolean # Pre-computed flag
Indexing Methods:
Entrepreneur#as_indexed_json- Populatescompany_idsandhas_sole_proprietorshipCompany#as_indexed_json- Populatesentrepreneur_idsandis_sole_proprietorship
API Reference
POST /api/search/unified
Request Body:
{
"query": "Jean Dupont",
"page": 1,
"per_page": 20,
"filters": {
"entity_type": "all",
"city_id": null,
"department_id": null,
"naf_id": null
}
}
Response:
{
"results": [
{
"type": "entrepreneur",
"id": 12345,
"score": 15.8,
"full_name": "Jean Dupont",
"prenoms": "Jean",
"nom": "Dupont",
"role": "Gérant",
"city_id": 75001,
"company_ids": [678, 789],
"sole_proprietorship_company": {
"id": 678,
"name": "Jean Dupont",
"siren": "123456789"
}
},
{
"type": "company",
"id": 999,
"score": 12.3,
"name": "Dupont & Associates",
"siren": "987654321",
"legal_status_id": 5,
"naf_id": 123,
"city_id": 75002,
"entrepreneur_ids": [12345, 54321],
"is_sole_proprietorship": false
}
],
"meta": {
"total": 145,
"page": 1,
"per_page": 20,
"total_pages": 8,
"counts": {
"entrepreneurs": 67,
"companies": 78
},
"query_time_ms": 145
}
}
Error Response:
{
"results": [],
"meta": {
"total": 0,
"page": 1,
"per_page": 20,
"total_pages": 0,
"counts": {
"entrepreneurs": 0,
"companies": 0
},
"errors": ["Search failed: connection timeout"]
}
}
HTTP Status Codes:
200 OK- Successful search206 Partial Content- Search completed with partial errors400 Bad Request- Invalid query parameters429 Too Many Requests- Rate limit exceeded503 Service Unavailable- Search service down
Relevance Ranking Algorithm
Elasticsearch Scoring
Entrepreneur Queries:
{
should: [
{ match: { "full_name.keyword": { query: "...", boost: 10.0 } } }, # Exact match
{ match_phrase: { full_name: { query: "...", boost: 5.0 } } }, # Phrase match
{ multi_match: { fields: ["full_name^3", "prenoms^2", "nom^2"], boost: 3.0 } }, # Multi-field
{ multi_match: { fields: ["*.autocomplete"], boost: 1.0 } } # Prefix
]
}
Company Queries:
{
should: [
{ match: { "name.keyword": { query: "...", boost: 10.0 } } }, # Exact match
{ match_phrase: { name: { query: "...", boost: 5.0 } } }, # Phrase match
{ match: { name: { query: "...", boost: 3.0 } } }, # Full-text
{ match: { "name.autocomplete": { query: "...", boost: 1.0 } } }, # Prefix
{ term: { siren: { value: "...", boost: 15.0 } } } # SIREN exact
]
}
Result Merging
- Fetch: Top 30 results from each index (over-fetching for better merging)
- Deduplicate: Remove duplicate sole proprietorships
- Sort: By Elasticsearch
_score(descending) - Paginate: Apply requested page/per_page
Sole Proprietorship Logic
Detection Criteria
A company is marked as a sole proprietorship when:
- Single entrepreneur:
entrepreneur_ids.length == 1 - Legal status: Legal status ID is 2 (EI) or 42 (EIRL)
- Name match: Company name == Entrepreneur name (case-insensitive)
Deduplication Strategy
When both an entrepreneur and their sole proprietorship company appear in results:
- Keep the entrepreneur result
- Add
sole_proprietorship_companyembedded data - Remove the company result from the list
Why? Users searching for "Jean Dupont" want to see the person first, with easy access to their company details.
Reindexing Operations
Initial Setup
After deploying this feature, reindex all records to populate relationship fields:
# Reindex all entrepreneurs
rails elasticsearch:relationships:entrepreneurs
# Reindex all companies
rails elasticsearch:relationships:companies
# Or reindex both in parallel
rails elasticsearch:relationships:all
Progress Monitoring
# Check progress
rails elasticsearch:relationships:progress[Entrepreneur]
rails elasticsearch:relationships:progress[Company]
# Example output:
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Entrepreneur Reindex Progress
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Total records: 30000000
# Processed: 15000000
# Errors: 127
# Progress: 50.0%
#
# Started: 2025-11-13 10:00:00
# Last updated: 2025-11-13 12:30:15
# Elapsed: 2h 30m 15s
# Est. remaining: 2h 30m 15s
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Incremental Reindexing
For reindexing specific ID ranges (useful for parallelization):
# Reindex entrepreneurs with IDs 1M to 5M
rails elasticsearch:relationships:range[Entrepreneur,1000000,5000000]
# Reindex companies with IDs 10M to 15M
rails elasticsearch:relationships:range[Company,10000000,15000000]
Performance
Expected throughput:
- ~500-1000 records/second (depending on DB/ES load)
- Total time for full reindex: ~8-16 hours (30M+27M records)
- Memory usage: ~100MB per worker
Performance Optimization
Query-Level Optimizations
1. Result Limiting
limit_per_index = 30 # Fetch top 30 from each index
# After merging, return only requested per_page (typically 20)
2. Source Filtering
_source: ["full_name", "city_id", "company_ids"] # Only needed fields
# Reduces network payload by 60-80%
3. Timeout Protection
timeout: "500ms" # Kill slow queries
# Prevents queue backup, returns partial results
4. Disable Total Count (for deep pagination)
track_total_hits: false # For pages beyond 1
# Saves expensive count aggregation
Caching Strategy
1. Redis Query Cache
cache_key = "unified_search:#{query}:#{page}:#{per_page}"
Rails.cache.fetch(cache_key, expires_in: 5.minutes) { ... }
- Popular queries cached for 5 minutes
- Estimated hit rate: 30-40%
2. Elasticsearch Request Cache
settings index: {
requests: { cache: { enable: true } }
}
- Caches identical queries at shard level
- Automatic invalidation on index updates
Rate Limiting
RATE_LIMIT_MAX_REQUESTS = 30 # per minute per IP
RATE_LIMIT_WINDOW = 1.minute
Bypass:
# Disable rate limiting in test env
DISABLE_RATE_LIMIT=true rails s
Pagination Limits
- Max page: 100 (to prevent deep pagination performance issues)
- Max per_page: 100
- For results beyond page 100, use
search_aftercursor (future enhancement)
Monitoring & Alerts
Performance Targets
| Metric | Target | Alert Threshold |
|---|---|---|
| P50 Response Time | < 150ms | > 300ms |
| P95 Response Time | < 500ms | > 1s |
| P99 Response Time | < 1s | > 2s |
| Cache Hit Rate | > 30% | < 20% |
| ES Timeout Rate | < 1% | > 5% |
| Error Rate | < 0.1% | > 1% |
Key Metrics to Track
1. Search Performance
# Log in controller
Rails.logger.info(
"Search: query=#{params[:query]} " \
"time=#{elapsed}ms " \
"results=#{data[:meta][:total]} " \
"cache_hit=#{cache_hit}"
)
2. Elasticsearch Health
# Monitor cluster health
GET /_cluster/health
# Monitor index stats
GET /entrepreneurs_production,companies_production/_stats
3. Rate Limiting
# Monitor rate limit hits
Rails.cache.read("search_rate_limit:#{ip}")
Logging
Search Queries:
[INFO] Search: query="Jean Dupont" time=145ms results=67 page=1 cache_hit=false
Errors:
[ERROR] UnifiedSearchService error: Connection timeout
/app/services/unified_search_service.rb:45
Reindexing Progress:
[INFO] Reindex progress for Entrepreneur: 15000000/30000000 (50.0%)
Testing
RSpec Tests
Service Tests (spec/services/unified_search_service_spec.rb)
- Search across both indexes
- Result merging and ranking
- Sole proprietorship deduplication
- Error handling
- Caching behavior
Controller Tests (spec/controllers/api/search_controller_spec.rb)
- Endpoint response format
- Rate limiting
- Error responses
- Authentication
Job Tests (spec/jobs/reindex_relationships_job_spec.rb)
- Batch processing
- Progress tracking
- Error recovery
Manual Testing
1. Test Search Query
curl -X POST http://localhost:3000/api/search/unified \
-H "Content-Type: application/json" \
-d '{"query": "Jean Dupont", "page": 1, "per_page": 20}'
2. Test Rate Limiting
# Send 31 requests rapidly
for i in {1..31}; do
curl -X POST http://localhost:3000/api/search/unified \
-H "Content-Type: application/json" \
-d '{"query": "test"}'
done
# Should return 429 Too Many Requests on 31st request
3. Test React UI
cd admin/
npm run dev
# Navigate to http://localhost:3001/search
# Type a query and verify results display correctly
Troubleshooting
Common Issues
1. No results found
- Check: Elasticsearch indexes exist and have data
- Fix: Run
rails elasticsearch:relationships:allto reindex
2. Slow queries (> 1s)
- Check: Elasticsearch cluster health
GET /_cluster/health - Fix: Increase cluster resources or add replicas
3. Rate limit errors
- Check: Multiple users sharing same IP?
- Fix: Adjust
RATE_LIMIT_MAX_REQUESTSor implement user-based limiting
4. Sole proprietorships showing twice
- Check:
has_sole_proprietorshipandis_sole_proprietorshipflags correct - Fix: Re-run reindexing for affected records
5. Frontend errors
- Check: API endpoint returning correct data format
- Fix: Verify TypeScript types match API response
Debug Queries
View Elasticsearch Query:
# In development, queries are logged to console
rails s
# Search from UI, check terminal for pretty-printed JSON query
Test Elasticsearch Directly:
curl -X GET "localhost:9200/entrepreneurs_production/_search?pretty" \
-H 'Content-Type: application/json' \
-d '{
"query": {
"match": { "full_name": "Jean Dupont" }
}
}'
Check Cache:
rails console
> Rails.cache.read("unified_search:jean dupont:1:20")
Future Enhancements
Phase 2 Features
-
Advanced Filters
- Filter by entrepreneur role
- Filter by company legal status
- Filter by NAF code
- Filter by city/department
-
Search Analytics
- Track popular queries
- Measure search quality (clicks on results)
- A/B test ranking algorithms
-
Performance Improvements
- Implement
search_afterfor deep pagination - Add search result highlighting
- Prefetch related data (companies for entrepreneurs)
- Implement
-
User Experience
- Autocomplete suggestions
- Recent searches
- Saved searches
- Export results to CSV
Related Documentation
Support
For questions or issues:
- Check Elasticsearch cluster health
- Review application logs
- Monitor Redis cache
- Check reindexing job progress
- Test API endpoint directly
Performance issues? Check the Performance Optimization section above.