Powered by RND
PodcastsTechnologyData Engineering Weekly
Listen to Data Engineering Weekly in the App
Listen to Data Engineering Weekly in the App
(398)(247,963)
Save favourites
Alarm
Sleep timer

Data Engineering Weekly

Podcast Data Engineering Weekly
Ananth Packkildurai
The Weekly Data Engineering Newsletter www.dataengineeringweekly.com

Available Episodes

5 of 18
  • Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses
    The modern data stack constantly evolves, with new technologies promising to solve age-old problems like scalability, cost, and data silos. Apache Iceberg, an open table format, has recently generated significant buzz. But is it truly revolutionary, or is it destined to repeat the pitfalls of past solutions like Hadoop?In a recent episode of the Data Engineering Weekly podcast, we delved into this question with Daniel Palma, Head of Marketing at Estuary and a seasoned data engineer with over a decade of experience. Danny authored a thought-provoking article comparing Iceberg to Hadoop, not on a purely technical level, but in terms of their hype cycles, implementation challenges, and the surrounding ecosystems. This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it.Hadoop: A Brief History LessonFor those unfamiliar with Hadoop's trajectory, it's crucial to understand the context. In the mid-2000s, Hadoop emerged as a groundbreaking solution for processing massive datasets. It promised to address key pain points:* Scaling: Handling ever-increasing data volumes.* Cost: Reducing storage and processing expenses.* Speed: Accelerating data insights.* Data Silos: Breaking down barriers between data sources.Hadoop achieved this through distributed processing and storage, using a framework called MapReduce and the Hadoop Distributed File System (HDFS). However, while the promise was alluring, the reality proved complex. Many organizations struggled with Hadoop's operational overhead, leading to high failure rates (Gartner famously estimated that 80% of Hadoop projects failed). The complexity stemmed from managing distributed clusters, tuning configurations, and dealing with issues like the "small file problem."Iceberg: The Modern ContenderApache Iceberg enters the scene as a modern table format designed for massive analytic datasets. Like Hadoop, it aims to tackle scalability, cost, speed, and data silos. However, Iceberg focuses specifically on the table format layer, offering features like:* Schema Evolution: Adapting to changing data structures without rewriting tables.* Time Travel: Querying data as it existed at a specific time.* ACID Transactions: Ensuring data consistency and reliability.* Partition Evolution: Changing data partitioning without breaking existing queries.Iceberg's design addresses Hadoop's shortcomings, particularly data consistency and schema evolution. But, as Danny emphasizes, an open table format alone isn't enough.The Ecosystem Challenge: Beyond the Table FormatIceberg, by itself, is not a complete solution. It requires a surrounding ecosystem to function effectively. This ecosystem includes:* Catalogs: Services that manage metadata about Iceberg tables (e.g., table schemas, partitions, and file locations).* Compute Engines: Tools that query and process data stored in Iceberg tables (e.g., Trino, Spark, Snowflake, DuckDB).* Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata.The ecosystem is where the comparison to Hadoop becomes particularly relevant. Hadoop also had a vast ecosystem (Hive, Pig, HBase, etc.), and managing this ecosystem was a significant source of complexity. Iceberg faces a similar challenge.Operational Complexity: The Elephant in the RoomDanny highlights operational complexity as a major hurdle for Iceberg adoption. While the Iceberg itself simplifies some aspects of data management, the surrounding ecosystem introduces new challenges:* Small File Problem (Revisited): Like Hadoop, Iceberg can suffer from small file problems. Data ingestion tools often create numerous small files, which can degrade performance during query execution. Iceberg addresses this through table maintenance, specifically compaction (merging small files into larger ones). However, many data ingestion tools don't natively support compaction, requiring manual intervention or dedicated Spark clusters.* Metadata Overhead: Iceberg relies heavily on metadata to track table changes and enable features like time travel. If not handled correctly, managing this metadata can become a bottleneck. Organizations need automated processes for metadata cleanup and compaction.* Catalog Wars: The catalog choice is critical, and the market is fragmented. Major data warehouse providers (Snowflake, Databricks) have released their flavors of REST catalogs, leading to compatibility issues and potential vendor lock-in. The dream of a truly interoperable catalog layer, where you can seamlessly switch between providers, remains elusive.* Infrastructure Management: Setting up and maintaining an Iceberg-based data lakehouse requires expertise in infrastructure-as-code, monitoring, observability, and data governance. The maintenance demands a level of operational maturity that many organizations lack.Key Considerations for Iceberg AdoptionIf your organization is considering Iceberg, Danny stresses the importance of careful planning and evaluation:* Define Your Use Case: Clearly articulate your specific needs. Are you prioritizing performance, cost, or both? What are your data governance and security requirements? Your answers will influence your choices for storage, computing, and cataloging.* Evaluate Compatibility: Ensure your existing infrastructure and tools (query engines, data ingestion pipelines) are compatible with Iceberg and your chosen catalog.* Consider Cloud Vendor Lock-in: Be mindful of potential lock-in, especially with catalogs. While Iceberg is open, cloud providers have tightly coupled implementation specific to their ecosystem.* Build vs. Buy: Decide whether you have the resources to build and maintain your Iceberg infrastructure or if a managed service is better. Many organizations prefer to outsource table maintenance and catalog management to avoid operational overhead.* Talent and Expertise: Do you have the in-house expertise to manage Spark clusters (for compaction), configure query engines, and manage metadata? If not, consider partnering with consultants or investing in training.* Start the Data Governance Process: Don't wait until the last minute to build the data governance framework. You must create the framework and processes before jumping into adoption.The Catalog Conundrum: Beyond Structured DataThe role of the catalog is evolving. Initially, catalogs focused on managing metadata for structured data in Iceberg tables. However, the vision is expanding to encompass unstructured data (images, videos, audio) and AI models. This "catalog of catalogs" or "uber catalog" approach aims to provide a unified interface for accessing all data types.The benefits of a unified catalog are clear: simplified data access, consistent semantics, and easier integration across different systems. However, building such a catalog is complex, and the industry is still grappling with the best approach.S3 Tables: A New Player?Amazon's recent announcement of S3 Tables raised eyebrows. These tables combine object storage with a table format, offering a highly managed solution. However, they are currently limited in terms of interoperability. They don't support external catalogs, making integrating them into existing Iceberg-based data stacks difficult. The jury is still unsure whether S3 Tables will become a significant player in the open table format landscape.Query Engine ConsiderationsChoosing the right query engine is crucial for performance and cost optimization. While some engines like Snowflake boast excellent performance with Iceberg tables (with minimal overhead compared to native tables), others may lag. Factors to consider include:* Performance: Benchmark different engines with your specific workloads.* Cost: Evaluate the cost of running queries on different engines.* Scalability: Ensure the engine can handle your anticipated data volumes and query complexity.* Compatibility: Verify compatibility with your chosen catalog and storage layer.* Use Case: Different engines excel at different tasks. Trino is popular for ad-hoc queries, while DuckDB is gaining traction for smaller-scale analytics.Is Iceberg Worth the Pain?The ultimate question is whether the benefits of Iceberg outweigh the complexities. For many organizations, especially those with limited engineering resources, fully managed solutions like Snowflake or Redshift might be a more practical starting point. These platforms handle the operational overhead, allowing teams to focus on data analysis rather than infrastructure management.However, Iceberg can be a compelling option for organizations with specific requirements (e.g., strict data residency rules, a need for a completely open-source stack, or a desire to avoid vendor lock-in). The key is approaching adoption strategically, clearly understanding the challenges, and a plan to address them.The Future of Table Formats: Consolidation and AbstractionDanny predicts consolidation in the table format space. Managed service providers will likely bundle table maintenance and catalog management with their Iceberg offerings, simplifying the developer experience. The next step will be managing the compute layer, providing a fully end-to-end data lakehouse solution.Initiatives like Apache XTable aim to provide a standardized interface on top of different table formats (Iceberg, Hudi, Delta Lake). However, whether such abstraction layers will gain widespread adoption remains to be seen. Some argue that standardizing on a single table format is a simpler approach.Iceberg's Role in Event-Driven Architectures and Machine LearningBeyond traditional analytics, Iceberg has the potential to contribute significantly to event-driven architectures and machine learning. Its features, such as time travel, ACID transactions, and data versioning, make it a suitable backend for streaming systems and change data capture (CDC) pipelines.Unsolved ChallengesSeveral challenges remain in the open table format landscape:* Simplified Data Ingestion: Writing data into Iceberg is still unnecessarily complex, often requiring Spark clusters. Simplifying this process is crucial for broader adoption.* Catalog Standardization: The lack of a standardized catalog interface hinders interoperability and increases the risk of vendor lock-in.* Developer-Friendly Tools: The ecosystem needs more developer-friendly tools for managing table maintenance, metadata, and query optimization.Conclusion: Proceed with Caution and ClarityApache Iceberg offers a powerful approach to building modern data lakehouses. It addresses many limitations of previous solutions like Hadoop, but it's not a silver bullet. Organizations must carefully evaluate their needs, resources, and operational capabilities before embarking on an Iceberg journey.Start small, test thoroughly, automate aggressively, and prioritize data governance. Organizations can unlock their potential by approaching Iceberg adoption cautiously and clearly while avoiding the pitfalls plaguing earlier data platform initiatives. The future of the data lakehouse is open, but the path to get there requires careful navigation.All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
    --------  
    41:43
  • The State of Lakehouse Architecture: A Conversation with Roy Hassan on Maturity, Challenges, and Future Trends
    Lakehouse architecture represents a major evolution in data engineering. It combines data lakes' flexibility with data warehouses' structured reliability, providing a unified platform for diverse data workloads ranging from traditional business intelligence to advanced analytics and machine learning. Roy Hassan, a product leader at Upsolver, now Qlik, offers a comprehensive reality check on Lakehouse implementations, shedding light on their maturity, challenges, and future directions.Defining Lakehouse ArchitectureA Lakehouse is not a specific product, tool, or service but an architectural framework. This distinction is critical because it allows organizations to tailor implementations to their needs and technological environments. For instance, Databricks users inherently adopt a Lakehouse approach by storing data in object storage, managing it with the Delta Lake format, and analyzing it directly on the data lake.Assessing the Maturity of Lakehouse ImplementationsThe adoption and maturity of Lakehouse implementations vary across cloud platforms and ecosystems:Databricks: Many organizations have built mature Lakehouse implementations using Databricks, leveraging its robust capabilities to handle diverse workloads.Amazon Web Services (AWS): While AWS provides services like Athena, Glue, Redshift, and EMR to access and process data in object storage, many users still rely on traditional data lakes built on Parquet files. However, a growing number are adopting Lakehouse architectures with open table formats such as Iceberg, which has gained traction within the AWS ecosystem.Azure Fabric: Built on the Delta Lake format, Azure Fabric offers a vertically integrated Lakehouse experience, seamlessly combining storage, cataloging, and computing resources.Snowflake: Organizations increasingly use Snowflake in a Lakehouse-oriented manner, storing data in S3 and managing it with Iceberg. While new workloads favor Iceberg, most existing data remains within Snowflake’s internal storage.Google BigQuery: The Lakehouse ecosystem in Google Cloud is still evolving. Many users prefer to keep their workloads within BigQuery due to its simplicity and integrated storage.Despite these differences in maturity, the industry-wide adoption of Lakehouse architectures continues to expand, and their implementation is becoming increasingly sophisticated.Navigating Open Table Formats: Iceberg, Delta Lake, and HudiDiscussions about open table formats often spark debate, but each format offers unique strengths and is backed by a dedicated engineering community:Iceberg and Delta Lake share many similarities, with ongoing discussions about potential standardization.Hudi specializes in streaming use cases and optimizing real-time data ingestion and processing. [Listen to The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi]Most modern query engines support Delta Lake and Iceberg, reinforcing their prominence in the Lakehouse ecosystem. While Hudi and Paimon have smaller adoption, broader query engine support for all major formats is expected over time.Examining Apache XTable’s RoleApache XTable aims to improve interoperability between different table formats. While the concept is practical, its long-term relevance remains uncertain. As the industry consolidates around fewer preferred formats, converting between them may introduce unnecessary complexity, latency, and potential points of failure—especially at scale.Challenges and Criticisms of Lakehouse ArchitectureOne common criticism of Lakehouse architecture is its lower abstraction level than traditional databases. Developers often need to understand the underlying file system, whereas databases provide a more seamless experience by abstracting storage management. The challenge is to balance Lakehouse's flexibility and traditional databases' ease of use.Best Practices for Lakehouse AdoptionA successful Lakehouse implementation starts with a well-defined strategy that aligns with business objectives. Organizations should:• Establish a clear vision and end goals.• Design a scalable and efficient architecture from the outset.• Select the right open table format based on workload requirements.The Significance of Shared StorageShared storage is a foundational principle of Lakehouse architecture. Organizations can analyze data using multiple tools and platforms by storing it in a single location and transforming it once. This approach reduces costs, simplifies data management, and enhances agility by allowing teams to choose the most suitable tool for each task.Catalogs: Essential Components of a LakehouseCatalogs are crucial in Lakehouse implementations as metadata repositories describing data assets. These catalogs fall into two categories:Technical catalogs, which focus on data management and organization.Business catalogs, which provide a business-friendly view of the data landscape.A growing trend in the industry is the convergence of technical and business catalogs to offer a unified view of data across the organization. Innovations like the Iceberg REST catalog specification have advanced catalog management by enabling a decoupled and standardized approach.The Future of Catalogs: AI and Machine Learning IntegrationIn the coming years, AI and machine learning will drive the evolution of data catalogs. Automated data discovery, governance, and optimization will become more prevalent, allowing organizations to unlock new AI-powered insights and streamline data management processes.The Changing Role of Data Engineers in the AI EraThe rise of AI is transforming the role of data engineers. Traditional responsibilities like building data pipelines are shifting towards platform engineering and enabling AI-driven data capabilities. Moving forward, data engineers will focus on:• Designing and maintaining AI-ready data infrastructure.• Developing tools that empower software engineers to leverage data more effectively.Final ThoughtsLakehouse architecture is rapidly evolving, with growing adoption across cloud ecosystems and advancements in open table formats, cataloging, and AI integration. While challenges remain—particularly around abstraction and complexity—the benefits of flexibility, cost efficiency, and scalability make it a compelling approach for modern data workloads.Organizations investing in a Lakehouse strategy should prioritize best practices, stay informed about emerging trends, and build architectures that support current and future data needs.All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
    --------  
    1:02:41
  • Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics
    Fluss is a compelling new project in the realm of real-time data processing. I spoke with Jark Wu, who leads the Fluss and Flink SQL team at Alibaba Cloud, to understand its origins and potential. Jark is a key figure in the Apache Flink community, known for his work in building Flink SQL from the ground up and creating Flink CDC and Fluss.You can read the Q&A version of the conversation here, and don’t forget to listen to the podcast. What is Fluss and its use cases?Fluss is a streaming storage specifically designed for real-time analytics. It addresses many of Kafka's challenges in analytical infrastructure. The combination of Kafka and Flink is not a perfect fit for real-time analytics; the integration of Kafka and Lakehouse is very shallow. Fluss is an analytical Kafka that builds on top of Lakehouse and integrates seamlessly with Flink to reduce costs, achieve better performance, and unlock new use cases for real-time analytics.How do you compare Fluss with Apache Kafka?Fluss and Kafka differ fundamentally in design principles. Kafka is designed for streaming events, but Fluss is designed for streaming analytics.Architecture DifferenceThe first difference is the Data Model. Kafka is designed to be a black box to collect all kinds of data, so Kafka doesn't have built-in schema and schema enforcement; this is the biggest problem when integrating with schematized systems like Lakehouse. In contrast, Fluss adopts a Lakehouse-native design with structured tables, explicit schemas, and support for all kinds of data types; it directly mirrors the Lakehouse paradigm. Instead of Kafka's topics, Fluss organizes data into database tables with partitions and buckets. This Lakehouse-first approach eliminates the friction of using Lakehouse as a deep storage for Fluss.The second difference is the Storage Model. Fluss introduces Apache Arrow as its columnar log storage model for efficient analytical queries, whereas Kafka persists data as unstructured and row-oriented logs for efficient sequence scans. Analytics requires strong data-skipping ability in storage, so sequence scanning is not common; columnar pruning and filter pushdown are basic functionalities of analytical storage. Among the 20,000 Flink SQL jobs at Alibaba, only 49% of columns of Kafka data are read on average.The third difference is Data Mutability: Fluss natively supports real-time updates (e.g., row-level modifications) through LSM tree mechanisms and provides read-your-writes consistency with milli-second latency and high throughput. While Kafka primarily handles append-only streams, the Kafka compacted topic only provides a weak update semantic that compact will keep at least one value for a key, not only the latest.The fourth difference is the Lakehouse Architecture. Fluss embraces the Lakehouse Architecture. Fluss uses Lakehouse as a tiered storage, and data will be converted and tiered into data lakes periodically; Fluss only retains a small portion of recent data. So you only need to store one copy of data for your streaming and Lakehouse. But the true power of this architecture is it provides a union view of Streaming and Lakehouse, so whether it is a Kafka client or a query engine on Lakehouse, they all can visit the streaming data and Lakehouse data as a union view as a single table. It brings powerful analytics to streaming data users.On the other hand, it provides second-level data insights for Lakehouse users. Most importantly, you only need to store one copy of data for your streaming and Lakehouse, which reduces costs. In contrast, Kafka's tiered storage only stores Kafka log segments in remote storage; it is only a storage cost optimization for Kafka and has nothing to do with Lakehouse.The Lakehouse storage serves as the historical data layer for the streaming storage, which is optimized for storing long-term data with minute-level latencies. On the other hand, streaming storage serves as the real-time data layer for Lakehouse storage, which is optimized for storing short-term data with millisecond-level latencies. The data is shared and is exposed as a single table. For streaming queries on the table, it firstly uses the Lakehouse storage as historical data to have efficient catch-up read performance and then seamlessly transitions to the streaming storage for real-time data, ensuring no duplicate data is read. For batch queries on the table, streaming storage supplements real-time data for Lakehouse storage, enabling second-level freshness for Lakehouse analytics. This capability, termed Union Read, allows both layers to work in tandem for highly efficient and accurate data access.Confluent Tableflow can bridge Kafka and Iceberg data, but that is just a data movement that data integration tools like Fivetran or Airbyte can also achieve. Tableflow is a Lambda Architecture that uses two separate systems (streaming and batch), leading to challenges like data inconsistency, dual storage costs, and complex governance. On the other hand, Fluss is a Kappa Architecture; it stores one copy of data and presents it as a stream or a table, depending on the use case. Benefits:* Cost and Time Efficiency: no longer need to move data between system* Data Consistency: reduces the occurrence of similar-yet-different datasets, leading to fewer data pipelines and simpler data management.* Analytics on Stream* Freshness on LakehouseWhen to use Kafka Vs. FlussKafka is a general-purpose distributed event streaming platform optimized for high-throughput messaging and event sourcing. It excels in event-driven architectures and data pipelines. Fluss is tailored for real-time analytics. It works with streaming processing like Flink and Lakehouse formats like Iceberg and Paimon.How do you compare Fluss with OLAP Engines like Apache Pinot?Architecture: Pinot is an OLAP database that supports storing offline and real-time data and supports low-latency analytical queries. In contrast, Fluss is a storage to store real-time streaming data but doesn't provide OLAP abilities; it utilizes external query engines to process/analyze data, such as Flink and StarRocks/Spark/Trino (on the roadmap). Therefore, Pinot has additional query servers for OLAP serving, and Fluss has fewer components.Pinot is a monolithic architecture that provides complete capabilities from storage to computation. Fluss is used in a composable architecture that can plug multiple engines into different scenarios. The rise of Iceberg and Lakehouse has proven the power of composable architecture. Users use Parquet as the file format and Iceberg as the table format, Fluss on top of Iceberg as the real-time data layer, Flink for streaming processing, and StarRocks/Trino for OLAP queries. Fluss in the architecture can augment the existing Lakehouse with mill-second-level fresh data insights.API: The API of Fluss is RPC protocols like Kafka, which provides an SDK library, and query engines like Flink provide SQL API. Pinot provides SQL for OLAP queries and BI tool integrations.Streaming reads and writes: Fluss provides comprehensive streaming reads and writes like Kafka, but Pinot doesn't natively support them. Pinot connects to external streaming systems to ingest data using a pull-based mechanism and doesn't support a push-based mechanism.When to use Fluss vs Apache Pinot?If you want to build streaming analytics streaming pipelines, use Fluss (and usually Flink together). If you want to build OLAP systems for low-latency complex queries, use Pinot. If you want to augment your Lakehouse with streaming data, use Fluss.How is Fluss integrated with Apache Flink?Fluss focuses on storing streaming data and does not offer streaming processing capabilities. On the other hand, Flink is the de facto standard for streaming processing. Fluss aims to be the best storage for Flink and real-time analytics. The vision behind the integration is to provide users with a seamless streaming warehouse or streaming database experience. This requires seamless integration and in-depth optimization from storage to computation. For instance, Fluss already supports all of Flink's connector interfaces, including catalog, source, sink, lookup, and pushdown interfaces.In contrast, Kafka can only implement the source and sink interfaces. Our team is the community's core contributor to Flink SQL; we have the most committers and PMC members. We are committed to advancing the deep integration and optimization of Flink SQL and Fluss.Can you elaborate on Fluss's internal architecture?A Fluss cluster consists of two main processes: the CoordinatorServer and the TabletServer. The CoordinatorServer is the central control and management component. It maintains metadata, manages tablet allocation, lists nodes, and handles permissions. The TabletServer stores data and provides I/O services directly to users. The Fluss architecture is similar to the Kafka broker and uses the same durability and leader-based replication mechanism.Consistency: A table creation will request CoordinatorServer, which creates the metadata and assigns replicas to TabeltServers (three replicas by default), one of which is the leader. The replica leader writes the incoming logs and replica followers fetch logs from the replica leader. Once all replicas replicate the log, the log write response will be successfully returned.Fault Tolerance: If the TabletServer fails, CoordinatorServer will assign a new leader from the replica list, and it becomes the new leader to accept new read/write requests. Once a failed TabeltServer comes back, it catches up with the logs from the new leader.Scalability: Fluss can scale up linearly by adding TabletServers.How did Fluss implement the columnar storage?Let’s start with why we need columnar storage for streaming data. Fluss is designed for real-time analytics. In analytical queries, it's common that only a portion of the columns are read, and a filter condition can prune a significant amount of data. This applies to streaming analytics, such as a Flink SQL query on Kafka data. For example, among the 20,000 Flink SQL jobs at Alibaba, only 49% of the columns of Kafka data are read on average. Still, you must read 100% of the data and deserialize all the columns.We introduced Apache Arrow as our underlying log storage format. Apache Arrow is a columnar format that arranges data in columns. In the implementation, clients send Arrow batches to the Fluss server, and the Fluss server continuously appends the arrow batches into log files. When a read requests to read specific columns, the server returns the necessary column vectors to users, thus reducing networking costs and improving performance. In our benchmark, if you read only 10% columns, you will have a 10x increase in read throughput.How Fluss manages Real-Time Updates and Changelog Management?Fluss has 2 table types:* Log Table* Primary Key TableLog table only supports appending data, just like Kafka topics. The primary key table has the primary key definition and thus supports real-time updates on the primary key. Log Table uses LogStore to store data in Arrow format in the storage model. Primary Key Table uses LogStore to store changelogs and KvStore to store the materialized view of the changelog. KvStore leverages RocksDB to support real-time updates. RocksDB is a key-value embedded storage engine based on the LSM tree; the key is the primary key, and the value is the row.Write path: when an update request is to the TabletServer, it first looks KvStore for the previous row of the key, combines the previous row and new row as the changelog, writes the changelog to LogStore, and uses it as a WAL for KvStore recovery, then it writes the new row into KvStore. Flink can consume changelogs from the table's LogStore to process streams.Partial updates: Look up the previous row of the key, merge the previous row and the new row on the update columns, and write back the merged row to KvStore.How does Fluss handle high throughput and low latency?Fluss achieves high throughput and low latency through a combination of innovative techniques. It utilizes end-to-end zero-copy operations, transferring data directly from the producer, through the network, to the server and filesystem, and back to the consumer without unnecessary data duplication. Data is processed in batches (defaulting to 1 MB), making the system latency-insensitive. Further efficiency is gained through zstd level 3 compression, reducing data size. Asynchronous writes allow multiple batches in transit simultaneously, eliminating delays in waiting for write confirmations. Finally, columnar pruning minimizes the amount of data transferred by only sending the necessary columns for a given query.How is Fluss fault tolerance and data recovery work?We utilize the same approach with Kafka, the synchronous replication, and the ISR (in-sync-replicas) strategy.Recovery time: like Kafka, within seconds. But for the primary key table, it may take minutes as it has to download snapshots of RocksDB from remote storage.What about the scalability of Fluss?The Fluss cluster can scale linearly by adding TabletServers. A table can scale up throughput by adding more buckets (the partition concept in Kafka). We don't support data rebalancing across multiple nodes, but this is a work in progress.What is the future roadmap for Fluss?Fluss is undergoing significant lakehouse refactoring to enhance its capabilities and flexibility. This includes making the data lake format pluggable and expanding beyond the current Paimon support to incorporate formats like Iceberg and Hudi through collaborations with companies like Bytedance and Onehouse. Support for additional query engines is also being developed, with Spark integration currently in progress and StarRocks planned for the near future. Finally, to ensure seamless integration with existing infrastructure, Fluss is being made compatible with Kafka, allowing Kafka clients and tools to interact directly with the platform.References:https://www.alibabacloud.com/blog/why-fluss-top-4-challenges-of-using-kafka-for-real-time-analytics_601879https://www.alibabacloud.com/blog/introducing-fluss-streaming-storage-for-real-time-analytics_601921All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
    --------  
    36:30
  • The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi
    Exploring the Evolution of Lakehouse Technology: A Conversation with Vinoth Chandar and Onehouse CEOIn this episode, Ananth, author of Data Engineering Weekly and CEO of Onehouse, discusses the latest developments in the Lakehouse technology space, particularly focusing on Apache Hudi, Iceberg, and Delta Lake. They discuss the intricacies of building high-scale data ecosystems, the impact of table format standardization, and technical advances in incremental processing and indexing. The conversation delves into the role of open source in shaping the future of data engineering and addresses community questions about integrating various databases and improving operational efficiency.00:00 Introduction and New Year Greetings01:19 Introduction to Apache Hudi and Its Impact02:22 Challenges and Innovations in Data Engineering04:16 Technical Deep Dive: Hudi's Evolution and Features05:57 Comparing Hudi with Other Data Formats13:22 Hudi 1.0: New Features and Enhancements20:37 Industry Perception and the Future of Data Formats24:29 Technical Differentiators and Project Longevity26:05 Open Standards and Vendor Games26:41 Standardization and Data Platforms28:43 Competition and Collaboration in Data Formats33:38 Future of Open Source and Data Community36:14 Technical Questions from the Audience47:26 Closing Remarks and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
    --------  
    47:41
  • Agents of Change: Navigating 2025 with AI and Data Innovation
    Agents of Change: Navigating 2025 with AI and Data InnovationIn this episode of Dew, the hosts and guests discuss their predictions for 2025, focusing on the rise and impact of agentic AI. The conversation covers three main categories:1. The role of agent AI2. The future workforce dynamic involving human and AI agent3. Innovations in data platforms heading into 2025.Highlights include insights from Ashwin and our special guest, Rajesh, on building robust agent systems, strategies for data engineers and AI engineers to remain relevant, data quality and observability, and the evolving landscape of Lakehouse architectures.The discussion also discusses the challenges of integrating multi-agent systems and the economic implications of AI sovereignty and data privacy.00:00 Introduction and Predictions for 202501:49 Exploring Agentic AI04:44 The Evolution of AI Models16:36 Enterprise Data and AI Integration25:06 Managing AI Agents36:37 Opportunities in AI and Agent Development38:02 The Evolving Role of AI and Data Engineers38:31 Managing AI Agents and Data Pipelines39:05 The Future of Data Scientists in AI40:03 Multi-Agent Systems and Interoperability44:09 Economic Viability of Multi-Agent Systems47:06 Data Platforms and Lakehouse Implementations53:14 Data Quality, Observability, and Governance01:02:20 The Rise of Multi-Cloud and Multi-Engine Systems01:06:21 Final Thoughts and Future Outlook This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
    --------  
    1:10:37

More Technology podcasts

About Data Engineering Weekly

The Weekly Data Engineering Newsletter www.dataengineeringweekly.com
Podcast website

Listen to Data Engineering Weekly, Lightcone Podcast and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.11.0 | © 2007-2025 radio.de GmbH
Generated: 3/17/2025 - 12:28:15 PM