-
Notifications
You must be signed in to change notification settings - Fork 361
4146 add troubleshooting section #4181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
32feb0b
60f3fbe
c619204
aa15a8a
0c7d189
7c94a46
9166323
2bbe1bc
bf67a27
e10f908
dc751b2
0ba3eb4
e9a6adf
0f7d07e
7159177
03abb7d
8e8fd7b
4867bbe
ec2036f
de5e966
21e518c
33935f4
f3806da
d8e3ca2
395d484
aa308be
bc0e0eb
1dd1100
f37c70a
c0cf0fa
d546215
6e98868
12fb3ef
500d515
fc5d1a1
267527c
32e1d78
2ef96e4
fa9d8e3
d337d25
765216f
7c913dc
3083430
cc2bab0
98c1bbd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
--- | ||
sidebar_position: 1 | ||
slug: /tips-and-tricks/community-wisdom | ||
sidebar_label: 'Community Wisdom' | ||
doc_type: 'overview' | ||
keywords: [ | ||
'database tips', | ||
'community wisdom', | ||
'production troubleshooting', | ||
'performance optimization', | ||
'database debugging', | ||
'clickhouse guides', | ||
'real world examples', | ||
'database best practices', | ||
'meetup insights', | ||
'production lessons', | ||
'interactive tutorials', | ||
'database solutions' | ||
] | ||
title: 'ClickHouse community wisdom' | ||
description: 'Learn from the ClickHouse community with real world scenarios and lessons learned' | ||
--- | ||
|
||
# ClickHouse community wisdom: tips and tricks from meetups {#community-wisdom} | ||
|
||
*These interactive guides represent collective wisdom from hundreds of production deployments. Each runnable example helps you understand ClickHouse patterns using real GitHub events data - practice these concepts to avoid common mistakes and accelerate your success.* | ||
|
||
Combine this collected knowledge with our [Best Practices](/best-practices) guide for optimal ClickHouse Experience. | ||
|
||
## Problem-specific quick jumps {#problem-specific-quick-jumps} | ||
|
||
| Issue | Document | Description | | ||
|-------|---------|-------------| | ||
| **Production issue** | [Debugging insights](./debugging-insights.md) | Community production debugging tips | | ||
| **Slow queries** | [Performance optimization](./performance-optimization.md) | Optimize Performance | | ||
| **Materialized views** | [MV double-edged sword](./materialized-views.md) | Avoid 10x storage instances | | ||
| **Too many parts** | [Too many parts](./too-many-parts.md) | Addressing the 'Too Many Parts' error and performance slowdown | | ||
| **High costs** | [Cost optimization](./cost-optimization.md) | Optimize Cost | | ||
| **Creative use cases** | [Success stories](./creative-usecases.md) | Examples of ClickHouse in 'Outside the Box' use cases | | ||
|
||
**Last Updated:** Based on community meetup insights through 2024-2025 | ||
**Contributing:** Found a mistake or have a new lesson? Community contributions welcome |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
sidebar_position: 1 | ||
slug: /community-wisdom/cost-optimization | ||
sidebar_label: 'Cost Optimization' | ||
doc_type: 'how-to-guide' | ||
keywords: [ | ||
'cost optimization', | ||
'storage costs', | ||
'partition management', | ||
'data retention', | ||
'storage analysis', | ||
'database optimization', | ||
'clickhouse cost reduction', | ||
'storage hot spots', | ||
'ttl performance', | ||
'disk usage', | ||
'compression strategies', | ||
'retention analysis' | ||
] | ||
title: 'Lessons - cost optimization' | ||
description: 'Battle-tested cost optimization strategies from ClickHouse community meetups with real production examples and verified techniques.' | ||
--- | ||
|
||
# Cost optimization: battle-tested strategies {#cost-optimization} | ||
*This guide is part of a collection of findings gained from community meetups. The findings on this page cover community wisdom related to optimizing cost while using ClickHouse. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).* | ||
|
||
## The ContentSquare migration: 11x cost reduction {#contentsquare-migration} | ||
|
||
ContentSquare's migration from Elasticsearch to ClickHouse shows the cost optimization potential when moving to ClickHouse for analytics workloads, involving over 1,000 enterprise customers and processing over one billion page views daily. Before migration, ContentSquare ran 14 Elasticsearch clusters, each with 30 nodes, and struggled to make them bigger while keeping them stable. They were unable to host very large clients with high traffic, and frequently had to move clients between clusters as their traffic grew beyond cluster capacity. | ||
|
||
ContentSquare took a phased approach to avoid disrupting business operations. They first tested ClickHouse on a new mobile analytics product, which took four months to ship to production. This success convinced them to migrate their main web analytics platform. The full web migration took ten months to port all endpoints, followed by careful client-by-client migration of 600 clients in batches to avoid performance issues. They built extensive automation for non-regression testing, allowing them to complete the migration with zero regressions. | ||
|
||
After migration, the infrastructure became 11x cheaper while storing six times more data and delivering 10x faster performance on the 99th percentile queries. *"We are saving multiple millions per year using ClickHouse,"* the team noted. The performance improvements were particularly notable for their slowest queries - while fast queries (200ms on Elasticsearch) only improved to about 100ms on ClickHouse, their worst-performing queries went from over 15 seconds on Elasticsearch to under 2 seconds on ClickHouse. | ||
|
||
Their current ClickHouse setup includes 16 clusters across four regions on AWS and Azure, with over 100 nodes total. Each cluster typically has nine shards with two replicas per shard. They process approximately 100,000 analytics queries daily with an average response time of 200 milliseconds, while also increasing data retention from 3 months to 13 months. | ||
|
||
**Key Results:** | ||
- 11x reduction in infrastructure costs | ||
- 6x increase in data storage capacity | ||
- 10x faster 99th percentile query performance | ||
- Multiple millions in annual savings | ||
- Increased data retention from 3 months to 13 months | ||
- Zero regressions during migration | ||
|
||
## Compression Strategy: LZ4 vs ZSTD in Production {#compression-strategy} | ||
|
||
When Microsoft Clarity needed to handle hundreds of terabytes of data, they discovered that compression choices have dramatic cost implications. At their scale, every bit of storage savings matters, and they faced a classic trade-off: performance versus storage costs. Microsoft Clarity handles massive volumes—two petabytes of uncompressed data per month across all accounts, processing around 60,000 queries per hour across eight nodes and serving billions of page views from millions of websites. At this scale, compression strategy becomes a critical cost factor. | ||
|
||
They initially used ClickHouse's default [LZ4](/sql-reference/statements/create/table#lz4) compression but discovered significant cost savings were possible with [ZSTD](/sql-reference/statements/create/table#zstd). While LZ4 is faster, ZSTD provides better compression at the cost of slightly slower performance. After testing both approaches, they made a strategic decision to prioritize storage savings. The results were significant: 50% storage savings on large tables with manageable performance impact on ingestion and queries. | ||
|
||
**Key Results:** | ||
- 50% storage savings on large tables through ZSTD compression | ||
- 2 petabytes monthly data processing capacity | ||
- Manageable performance impact on ingestion and queries | ||
- Significant cost reduction at hundreds of TB scale | ||
|
||
## Column-Based Retention Strategy {#column-retention} | ||
|
||
One of the most powerful cost optimization techniques comes from analyzing which columns are actually being used. Microsoft Clarity implements sophisticated column-based retention strategies using ClickHouse's built-in telemetry capabilities. ClickHouse provides detailed metrics on storage usage by column as well as comprehensive query patterns: which columns are accessed, how frequently, query duration, and overall usage statistics. | ||
|
||
This data-driven approach enables strategic decisions about retention policies and column lifecycle management. By analyzing this telemetry data, Microsoft can identify storage hot spots - columns that consume significant space but receive minimal queries. For these low-usage columns, they can implement aggressive retention policies, reducing storage time from 30 months to just one month, or delete the columns entirely if they're not queried at all. This selective retention strategy reduces storage costs without impacting user experience. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we link resources on how to configure retention policies and column lifecycle management? |
||
|
||
**The Strategy:** | ||
- Analyze column usage patterns using ClickHouse telemetry | ||
- Identify high-storage, low-query columns | ||
- Implement selective retention policies | ||
- Monitor query patterns for data-driven decisions | ||
|
||
## Partition-based data management {#partition-management} | ||
|
||
Microsoft Clarity discovered that partitioning strategy impacts both performance and operational simplicity. Their approach: partition by date, order by hour. This strategy delivers multiple benefits beyond just cleanup efficiency—it enables trivial data cleanup, simplifies billing calculations for their customer-facing service, and supports GDPR compliance requirements for row-based deletion. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to resources on how to manage partitions in ClickHouse |
||
|
||
**Key Benefits:** | ||
- Trivial data cleanup (drop partition vs row-by-row deletion) | ||
- Simplified billing calculations | ||
- Better query performance through partition elimination | ||
- Easier operational management | ||
|
||
## String-to-integer conversion strategy {#string-integer-conversion} | ||
|
||
Analytics platforms often face a storage challenge with categorical data that appears repeatedly across millions of rows. Microsoft's engineering team encountered this problem with their search analytics data and developed an effective solution that achieved 60% storage reduction on affected datasets. | ||
|
||
In Microsoft's web analytics system, search results trigger different types of answers - weather cards, sports information, news articles, and factual responses. Each query result was tagged with descriptive strings like "weather_answer," "sports_answer," or "factual_answer." With billions of search queries processed, these string values were being stored repeatedly in ClickHouse, consuming massive amounts of storage space and requiring expensive string comparisons during queries. | ||
|
||
Microsoft implemented a string-to-integer mapping system using a separate MySQL database. Instead of storing the actual strings in ClickHouse, they store only integer IDs. When users run queries through the UI and request data for `weather_answer`, their query optimizer first consults the MySQL mapping table to get the corresponding integer ID, then converts the query to use that integer before sending it to ClickHouse. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if the mapping solution could be implemented using a Dictionary here instead of MySQL. I understand we want to share the story as-is from the customer, but maybe we could suggest "better" solution if exists in ClickHouse. |
||
|
||
This architecture preserves the user experience - people still see meaningful labels like `weather_answer` in their dashboards - while the backend storage and queries operate on much more efficient integers. The mapping system handles all translation transparently, requiring no changes to the user interface or user workflows. | ||
|
||
**Key Benefits:** | ||
- 60% storage reduction on affected datasets | ||
- Faster query performance on integer comparisons | ||
- Reduced memory usage for joins and aggregations | ||
- Lower network transfer costs for large result sets | ||
|
||
## Video sources {#video-sources} | ||
|
||
- **[Microsoft Clarity and ClickHouse](https://www.youtube.com/watch?v=rUVZlquVGw0)** - Microsoft Clarity Team | ||
- **[ClickHouse journey in Contentsquare](https://www.youtube.com/watch?v=zvuCBAl2T0Q)** - Doron Hoffman & Guram Sigua (ContentSquare) | ||
|
||
*These community cost optimization insights represent strategies from companies processing hundreds of terabytes to petabytes of data, showing real-world approaches to reducing ClickHouse operational costs.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure how we feel about it, but maybe a good place to remind people that best way to reduce cost is to use Cloud :) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
--- | ||
sidebar_position: 1 | ||
slug: /community-wisdom/creative-use-cases | ||
sidebar_label: 'Creative Use Cases' | ||
doc_type: 'how-to-guide' | ||
keywords: [ | ||
'clickhouse creative use cases', | ||
'clickhouse success stories', | ||
'unconventional database uses', | ||
'clickhouse rate limiting', | ||
'analytics database applications', | ||
'clickhouse mobile analytics', | ||
'customer-facing analytics', | ||
'database innovation', | ||
'clickhouse real-time applications', | ||
'alternative database solutions', | ||
'breaking database conventions', | ||
'production success stories' | ||
] | ||
title: 'Lessons - Creative Use Cases' | ||
description: 'Find solutions to the most common ClickHouse problems including slow queries, memory errors, connection issues, and configuration problems.' | ||
--- | ||
|
||
# Breaking the rules: success stories {#breaking-the-rules} | ||
|
||
*This guide is part of a collection of findings gained from community meetups. For more real world solutions and insights you can [browse by specific problem](./community-wisdom.md).* | ||
*Need tips on debugging an issue in prod? Check out the [Debugging Insights](./debugging-insights.md) community guide.* | ||
|
||
These stories showcase how companies found success by using ClickHouse for unconventional use cases, challenging traditional database categories and proving that sometimes the "wrong" tool becomes exactly the right solution. | ||
|
||
## ClickHouse as rate limiter {#clickhouse-rate-limiter} | ||
|
||
When Craigslist needed to add tier-one rate limiting to protect their users, they faced the same decision every engineering team encounters - follow conventional wisdom and use Redis, or explore something different. Brad Lhotsky, working at Craigslist, knew Redis was the standard choice - virtually every rate limiting tutorial and example online uses Redis for good reason. It has rich primitives for rate limiting operations, well-established patterns, and proven track record. But Craigslist's experience with Redis wasn't matching the textbook examples. *"Our experience with Redis is not like what you've seen in the movies... there are a lot of weird maintenance issues that we've hit where we reboot a node in a Redis cluster and some latency spike hits the front end."* For a small team that values maintenance simplicity, these operational headaches were becoming a real problem. | ||
|
||
So when Brad was approached with the rate limiting requirements, he took a different approach: *"I asked my boss, 'What do you think of this idea? Maybe I can try this with ClickHouse?'"* The idea was unconventional - using an analytical database for what's typically a caching layer problem - but it addressed their core requirements: fail open, impose no latency penalties, and be maintenance-safe for a small team. The solution leveraged their existing infrastructure where access logs were already flowing into ClickHouse via Kafka. Instead of maintaining a separate Redis cluster, they could analyze request patterns directly from the access log data and inject rate limiting rules into their existing ACL API. The approach meant slightly higher latency than Redis, which *"is kind of cheating by instantiating that data set upfront"* rather than doing real-time aggregate queries, but the queries still completed in under 100 milliseconds. | ||
|
||
**Key Results:** | ||
- Dramatic improvement over Redis infrastructure | ||
- Built-in TTL for automatic cleanup eliminated maintenance overhead | ||
- SQL flexibility enabled complex rate limiting rules beyond simple counters | ||
- Leveraged existing data pipeline instead of requiring separate infrastructure | ||
|
||
## ClickHouse for customer analytics {#customer-analytics} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it really a creative use case? It looks like standard RTA to me |
||
|
||
When ServiceNow needed to upgrade their mobile analytics platform, they faced a simple question: *"Why would we replace something that works?"* Amir Vaza from ServiceNow knew their existing system was reliable, but customer demands were outgrowing what it could handle. *"The motivation to replace an existing reliable model is actually from the product world,"* Amir explained. ServiceNow offered mobile analytics as part of their solution for web, mobile, and chatbots, but customers wanted analytical flexibility that went beyond pre-aggregated data. | ||
|
||
Their previous system used about 30 different tables with pre-aggregated data segmented by fixed dimensions: application, app version, and platform. For custom properties—key-value pairs that customers could send—they created separate counters for each group. This approach delivered fast dashboard performance but came with a major limitation. *"While this is great for quick value breakdown, I mentioned limitation leads to a lot of loss of analytical context,"* Amir noted. Customers couldn't perform complex customer journey analysis or ask questions like "how many sessions started with the search term 'research RSA token'" and then analyze what those users did next. The pre-aggregated structure destroyed the sequential context needed for multi-step analysis, and every new analytical dimension required engineering work to pre-aggregate and store. | ||
|
||
So when the limitations became clear, ServiceNow moved to ClickHouse and eliminated these pre-computation constraints entirely. Instead of calculating every variable upfront, they broke metadata into data points and inserted everything directly into ClickHouse. They used ClickHouse's async insert queue, which Amir called *"actually amazing,"* to handle data ingestion efficiently. The approach meant customers could now create their own segments, slice data freely across any dimensions, and perform complex customer journey analysis that wasn't possible before. | ||
|
||
**Key Results:** | ||
- Dynamic segmentation across any dimensions without pre-computation | ||
- Complex customer journey analysis became possible | ||
- Customers could create their own segments and slice data freely | ||
- No more engineering bottlenecks for new analytical requirements | ||
|
||
```sql runnable editable | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a wider issue, but I noticed the scroll stop working when hovering the code editor. |
||
-- Challenge: Try different customer journey analysis - track user flows across multiple touchpoints | ||
-- Experiment: Test complex segmentation that wasn't possible with pre-aggregated tables | ||
SELECT | ||
'Dynamic Customer Journey Analysis' as feature, | ||
actor_login as user_id, | ||
arrayStringConcat(groupArray(event_type), ' -> ') as user_journey, | ||
count() as journey_frequency, | ||
toStartOfDay(min(created_at)) as journey_start_date, | ||
'Real-time multi-dimensional analysis' as capability | ||
FROM github.github_events | ||
WHERE created_at >= '2024-01-15' | ||
AND created_at < '2024-01-16' | ||
AND event_type IN ('WatchEvent', 'StarEvent', 'ForkEvent', 'IssuesEvent') | ||
GROUP BY user_id | ||
HAVING journey_frequency >= 3 | ||
ORDER BY journey_frequency DESC | ||
LIMIT 15; | ||
``` | ||
|
||
### The Pattern of Innovation {#pattern-of-innovation} | ||
|
||
Both success stories follow a similar pattern: teams that succeeded by questioning database orthodoxy rather than accepting conventional limitations. The breakthrough came when engineering leaders asked themselves whether the "right" tool was actually serving their specific needs. | ||
|
||
Craigslist's moment came when Brad asked: *"What do you think of this idea? Maybe I can try this with ClickHouse?"* Instead of accepting Redis maintenance complexity, they found a path that leveraged existing infrastructure. ServiceNow's realization was similar—rather than accepting that analytics must be slow or pre-computed, they recognized that customers needed the ability to segment data and slice it dynamically without constraints. | ||
dhtclk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Both teams succeeded because they designed around ClickHouse's unique strengths rather than trying to force it into traditional database patterns. They understood that sometimes the "analytical database" becomes the perfect operational solution when speed and SQL flexibility matter more than traditional OLTP guarantees. ClickHouse's combination of speed, SQL flexibility, and operational simplicity enables use cases that traditional database categories can't address - proving that the best tool is often the one that solves your specific problems, not the one that fits the textbook definition. | ||
|
||
## Video sources {#video-sources} | ||
|
||
- **[Breaking the Rules - Building a Rate Limiter with ClickHouse](https://www.youtube.com/watch?v=wRwqrbUjRe4)** - Brad Lhotsky (Craigslist) | ||
- **[ClickHouse as an Analytical Solution in ServiceNow](https://www.youtube.com/watch?v=b4Pmpx3iRK4)** - Amir Vaza (ServiceNow) | ||
|
||
*These stories demonstrate how questioning conventional database wisdom can lead to breakthrough solutions that redefine what's possible with analytical databases.* |
Uh oh!
There was an error while loading. Please reload this page.