Elasticsearch Performance Optimization Guide

Overview

This guide documents the comprehensive Elasticsearch optimizations implemented to improve performance and reduce memory usage across the ODAPI application.

Performance Improvements Summary

Metric	Before	After (Expected)	Improvement
Heap Memory (Fielddata)	127 MB	< 5 MB	-96%
Fielddata Evictions	83	0	-100%
Index Size	2.9 GB	~2.0 GB	-30%
Segments	67 total	2 (1 per index)	-97%
Search Latency	Baseline	-40-60%	2-3x faster
Autocomplete Latency	Baseline	-70-90%	5-10x faster

Changes Made

1. Index Mapping Optimizations

Companies Index

Before:

indexes :name, type: :text, analyzer: :folding_french, fielddata: true  # ⚠️ Memory intensive!
indexes :legal_status_id, type: :keyword
indexes :city_id, type: :integer
# Missing capital_amount and other fields

After:

# Multi-field for name: text for search, keyword for sorting/aggregations
indexes :name, type: :text, analyzer: :folding_french do
  indexes :keyword, type: :keyword, normalizer: :lowercase_normalizer
  indexes :sort, type: :keyword, normalizer: :sortable_normalizer
  indexes :autocomplete, type: :text, analyzer: :autocomplete, search_analyzer: :autocomplete_search
end

# Added missing fields
indexes :capital_amount, type: :long, doc_values: true
indexes :completion_score, type: :integer, doc_values: true

# All fields now use doc_values for efficient aggregations
indexes :legal_status_id, type: :keyword, doc_values: true
indexes :city_id, type: :integer, doc_values: true

Key Benefits:

Eliminated fielddata memory usage by using keyword subfields for sorting
Added autocomplete subfield for faster prefix matching
Added missing fields (capital_amount, completion_score)
Enabled doc_values on all aggregation fields (disk-based, not heap)

Entrepreneurs Index

Similar optimizations applied:

Multi-field name mapping with keyword and autocomplete subfields
Added autocomplete to f_name (prenoms) field
Enabled doc_values on all numeric/date fields

2. Index Settings Enhancements

New Settings Added:

settings index: {
  refresh_interval: "30s",        # Reduced from default 1s (less CPU usage)
  codec: "best_compression",      # LZ4 compression for smaller index size
  requests: {
    cache: { enable: true }       # Enable shard request cache for aggregations
  }
}

Benefits:

30s refresh interval: Reduces CPU overhead (was refreshing every 1s)
best_compression: 15-25% smaller index size
Request cache: Caches aggregation results for repeated queries

3. Analyzer & Normalizer Configuration

Added Edge N-gram Autocomplete Analyzer:

analyzer: {
  autocomplete: {
    tokenizer: "autocomplete_tokenizer",
    filter: ["lowercase", "asciifolding"]
  },
  autocomplete_search: {
    tokenizer: "standard",
    filter: ["lowercase", "asciifolding"]
  }
}

tokenizer: {
  autocomplete_tokenizer: {
    type: "edge_ngram",
    min_gram: 2,
    max_gram: 20,
    token_chars: ["letter", "digit"]
  }
}

Benefits:

5-10x faster autocomplete queries
Better prefix matching without expensive wildcard queries
Accent-insensitive search for French names

Added Normalizers for Sorting:

normalizer: {
  lowercase_normalizer: {
    type: "custom",
    filter: ["lowercase", "asciifolding"]
  },
  sortable_normalizer: {
    type: "custom",
    filter: ["lowercase", "asciifolding", "trim"]
  }
}

Benefits:

Case-insensitive sorting without fielddata
Accent-normalized sorting for French text
Uses doc_values (disk-based, not heap memory)

4. Search Query Optimizations

Sorting Updated:

# Before (uses fielddata on text field - expensive!)
{ name: { missing: :_last } }

# After (uses keyword subfield with doc_values - efficient!)
{ "name.sort": { missing: :_last, order: :asc } }

Autocomplete Query Updated:

# Before (searches non-existent fields, slow)
"fields": ["name", "legal_name", "_id"]

# After (uses optimized autocomplete fields)
"fields": [
  "name.autocomplete^3",    # Boost autocomplete results
  "name^2",
  "f_name.autocomplete^3",
  "f_name^2"
]

Benefits:

No more fielddata loading for sorting operations
Faster autocomplete using edge n-grams instead of wildcard queries
Better relevance with field boosting

Migration Steps

Step 1: Review Current Performance

Run the stats task to see current baseline:

bundle exec rake es:stats

Expected output shows:

Fielddata Memory: ~127 MB ⚠️
Fielddata Evictions: 83 ⚠️
Segments: 32-35 per index
Deleted Docs: 4-5%

Step 2: Backup Current Indices (Optional)

If you want to keep the old indices as backup:

# Create snapshots or simply note the index names
curl -X GET "localhost:9200/_cat/indices/companies_development,entrepreneurs_development?v"

Step 3: Recreate Indices with New Mappings

IMPORTANT: This will delete and rebuild the indices. Do this during low-traffic periods.

# Interactive reindexing (recommended)
bundle exec rake es:reindex

# Select Company index → Choose "Recreate index"
# Select Entrepreneur index → Choose "Recreate index"

What happens:

Old index is deleted
New index created with optimized mappings
All data is reindexed from PostgreSQL
Progress is displayed (success/error counts)

Expected Duration:

Companies (25M docs): ~30-60 minutes
Entrepreneurs (28M docs): ~30-60 minutes

Step 4: Force Merge to Optimize Segments

After reindexing, merge segments to optimize performance:

bundle exec rake es:optimize

Type y to confirm. This will:

Merge all segments into 1 per index
Reclaim space from deleted documents
Improve query performance by 10-30%

Expected Duration: 5-15 minutes per index

Step 5: Verify Performance Improvements

Run stats again to verify improvements:

bundle exec rake es:stats

Expected Results:

✅ Fielddata Memory: < 5 MB (was 127 MB)
✅ Fielddata Evictions: 0 (was 83)
✅ Segments: 1-2 (was 67)
✅ Index Size: Reduced by 20-30%

Step 6: Run Performance Benchmarks

Test search performance:

bundle exec rake es:benchmark

This runs 5 iterations of common queries and reports average times.

New Rake Tasks Available

Performance Monitoring

# Show detailed stats (memory, segments, cache hit rates, health checks)
bundle exec rake es:stats

# Benchmark search and autocomplete performance
bundle exec rake es:benchmark

Index Management

# Interactive index management (recreate, reindex, update mappings)
bundle exec rake es:reindex

# Force merge segments to optimize performance
bundle exec rake es:optimize

# Clear fielddata cache to free memory
bundle exec rake es:clear_cache

Understanding the Changes

Multi-Field Mappings

Why do we have multiple subfields for name?

indexes :name, type: :text, analyzer: :folding_french do
  indexes :keyword      # For exact matching, aggregations
  indexes :sort         # For sorting (case-insensitive, normalized)
  indexes :autocomplete # For prefix matching
end

name (text): Full-text search with French analyzer
name.keyword: Exact matching, used in aggregations
name.sort: Used for sorting (replaces fielddata)
name.autocomplete: Fast prefix matching for autocomplete

Access in Queries:

Search: { match: { name: "query" } }
Sort: { sort: { "name.sort": "asc" } }
Autocomplete: { match: { "name.autocomplete": "query" } }

Doc Values vs Fielddata

Fielddata (OLD - BAD):

Loaded into JVM heap memory
Fast but consumes lots of RAM
Causes evictions and GC pressure
Used on text fields with fielddata: true

Doc Values (NEW - GOOD):

Stored on disk, loaded on-demand
Uses OS file system cache
Minimal heap memory usage
Default for keyword, numeric, date fields

Impact:

Fielddata: 127 MB heap → Doc Values: 0 MB heap
No more evictions or memory pressure

Edge N-grams for Autocomplete

Before (Wildcard queries):

{ "wildcard": { "name": "par*" } }

Slow (scans all terms)
Doesn't use index efficiently

After (Edge n-grams):

{ "match": { "name.autocomplete": "par" } }

Fast (indexed as: "pa", "par", "pari", "paris")
Uses inverted index efficiently
5-10x faster for autocomplete

Monitoring & Maintenance

Regular Health Checks

Run stats weekly to monitor:

bundle exec rake es:stats

Watch for:

⚠️ Deleted docs > 5% → Run rake es:optimize
⚠️ Segments > 50 → Run rake es:optimize
⚠️ Fielddata > 10 MB → Check for rogue queries using text field sorting
⚠️ Cache hit rate < 10% → Review query patterns

When to Force Merge

Run rake es:optimize when:

After bulk data imports
Deleted docs exceed 5%
Segment count > 50
Query performance degrades

Warning: Don't force merge on actively written indices!

When to Reindex

Full reindex needed when:

Changing analyzers or tokenizers
Adding new fields with complex mappings
Major Elasticsearch version upgrade

Note: Simple field additions can use update_mapping instead.

Troubleshooting

Issue: Fielddata Memory Still High After Migration

Cause: Old queries still using text field for sorting

Fix: Find and update queries:

# Bad
Company.search(sort: { name: :asc })

# Good
Company.search(sort: { "name.sort": :asc })

Issue: Autocomplete Not Working

Cause: Need to recreate index with new autocomplete analyzer

Fix:

bundle exec rake es:reindex
# Select index → Choose "Recreate index"

Issue: Force Merge Taking Too Long

Cause: Large index with many segments

Fix:

Run during off-peak hours
Consider merging to 5 segments first: max_num_segments: 5
Monitor progress: curl localhost:9200/_cat/tasks?v

Issue: Reindexing Fails with Errors

Cause: Data validation issues or missing associations

Check:

Review error output from rake task
Check for nil values in required fields
Verify associations are loaded (includes)

Fix:

# Update as_indexed_json to handle nil values
def as_indexed_json(options = {})
  {
    name: legal_name,
    city_id: address&.source_town_id,
    # ... other fields
  }.compact  # Remove nil values
end

Performance Tuning Tips

1. Adjust Refresh Interval for Bulk Operations

During large imports, reduce refresh overhead:

# Before bulk import
client.indices.put_settings(
  index: index_name,
  body: { index: { refresh_interval: "-1" } }  # Disable refresh
)

# Do bulk import
Company.index_import

# After bulk import
client.indices.put_settings(
  index: index_name,
  body: { index: { refresh_interval: "30s" } }
)
client.indices.refresh(index: index_name)

2. Use Bulk API for Multiple Updates

Instead of individual updates, batch them:

# Bad (N individual updates)
companies.each { |c| c.__elasticsearch__.index_document }

# Good (bulk operation)
Company.__elasticsearch__.import(companies)

3. Filter Before Aggregating

# Bad (aggregates all docs then filters)
{
  aggs: { ... },
  query: { match_all: {} },
  post_filter: { term: { status: "active" } }
}

# Good (filters before aggregating)
{
  query: { term: { status: "active" } },
  aggs: { ... }
}

4. Use Result Caching for Repeated Queries

# Add Redis caching layer
def self.cached_search(options = {})
  cache_key = "es:company:#{Digest::MD5.hexdigest(options.to_json)}"
  Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
    search(options)
  end
end

Next Steps & Future Optimizations

Potential Future Enhancements

Separate Hot/Cold Data
- Move old companies to read-only indices
- Reduce refresh rate on cold data
- Save memory and CPU
Implement Index Lifecycle Management (ILM)
- Auto-rollover indices at certain sizes
- Auto-merge and optimize old indices
- Delete very old data
Add Search Analytics
- Track slow queries
- Monitor popular search terms
- Optimize based on actual usage
Shard Optimization
- Currently: 1 shard per index
- For scaling: Consider 3-5 shards if data grows significantly
- Balance: more shards = better parallelism, but overhead
Query Result Highlighting
- Add highlighting to show matched terms
- Improves user experience
- Minimal performance cost with proper field configuration

References

Support

For issues or questions:

Check troubleshooting section above
Run rake es:stats and review health checks
Review Elasticsearch logs: tail -f /var/log/elasticsearch/
Open an issue in the project repository

Last Updated: 2025-01-03 Version: 1.0 Author: Claude Code Optimization

Overview​

Performance Improvements Summary​

Changes Made​

1. Index Mapping Optimizations​

Companies Index​

Entrepreneurs Index​

2. Index Settings Enhancements​

3. Analyzer & Normalizer Configuration​

4. Search Query Optimizations​

Migration Steps​

Step 1: Review Current Performance​

Step 2: Backup Current Indices (Optional)​

Step 3: Recreate Indices with New Mappings​

Step 4: Force Merge to Optimize Segments​

Step 5: Verify Performance Improvements​

Step 6: Run Performance Benchmarks​

New Rake Tasks Available​

Performance Monitoring​

Index Management​

Understanding the Changes​

Multi-Field Mappings​

Doc Values vs Fielddata​

Edge N-grams for Autocomplete​

Monitoring & Maintenance​

Regular Health Checks​

When to Force Merge​

When to Reindex​

Troubleshooting​

Issue: Fielddata Memory Still High After Migration​

Issue: Autocomplete Not Working​

Issue: Force Merge Taking Too Long​

Issue: Reindexing Fails with Errors​

Performance Tuning Tips​

1. Adjust Refresh Interval for Bulk Operations​

2. Use Bulk API for Multiple Updates​

3. Filter Before Aggregating​

4. Use Result Caching for Repeated Queries​

Next Steps & Future Optimizations​

Potential Future Enhancements​

References​

Support​

Overview

Performance Improvements Summary

Changes Made

1. Index Mapping Optimizations

Companies Index

Entrepreneurs Index

2. Index Settings Enhancements

3. Analyzer & Normalizer Configuration

4. Search Query Optimizations

Migration Steps

Step 1: Review Current Performance

Step 2: Backup Current Indices (Optional)

Step 3: Recreate Indices with New Mappings

Step 4: Force Merge to Optimize Segments

Step 5: Verify Performance Improvements

Step 6: Run Performance Benchmarks

New Rake Tasks Available

Performance Monitoring

Index Management

Understanding the Changes

Multi-Field Mappings

Doc Values vs Fielddata

Edge N-grams for Autocomplete

Monitoring & Maintenance

Regular Health Checks

When to Force Merge

When to Reindex

Troubleshooting

Issue: Fielddata Memory Still High After Migration

Issue: Autocomplete Not Working

Issue: Force Merge Taking Too Long

Issue: Reindexing Fails with Errors

Performance Tuning Tips

1. Adjust Refresh Interval for Bulk Operations

2. Use Bulk API for Multiple Updates

3. Filter Before Aggregating

4. Use Result Caching for Repeated Queries

Next Steps & Future Optimizations

Potential Future Enhancements

References

Support