Issue
I want make a data lake for my self without using any cloud service. I now have an Debian server and I want create this data lake with Databricks solution, Delta Lake.
As I search all sample for stablish Delta Lake in could service.
How can I do this in my own server?
Maybe I want create an cluster for store data and doing machine learning. And I want use only python for create Delta Lake.
Solution
It's a broad question. The Delta Lake itself is just a library that allows you to work with data in a specific format. To use it you need few things:
Compute layer that will read & save Delta Lake data. You can run Apache Spark on the local machine or on the
Hadoop or Kubernetes cluster
or work with Delta files using Python or Rust libraries (although you may not get all features available). Full list of integrations is available here.Storage layer to keep your Delta Lake tables - if you use one server, then you can use local file system, but as data size grows then you need to think about distributed filesystem, like, HDFS, MinIO, etc.
Data access layer - how you will access that data. It could be Spark code, or something like that, but you may need also to expose data via JDBC/ODBC - in this case you may need to setup Spark's Thrift server or something like that.
Answered By - Alex Ott Answer Checked By - Pedro (WPSolving Volunteer)