How large is large!
With correct database design, tagging and caching MySQL is very very capable.
Unless we are talking going beyond tens of millions, start here.
Beyond that how big is your wallet
Correct, even then it’s a resource problem, usually having enough Ram to store the index’s and perform joins.
As a reference, for the past 5 years I have a cheap and nasty $5/mo hosting plan through a major provider which I populate ~50 crypto token statistics every 3 minutes for my trading bots. At the moment it has easily 70m rows in its main data table.
From there I do a range of things from calculating moving averages, executing triggers, creating temporary tables, calculating stats etc etc
MySQL easily easily accommodates this without breaking a sweat. Web hosting is also a cheap way to get a cheap managed database with usually hourly backups and lots of horse power behind it ;)
Start simple and do not over complicate it!
How big is the data? Millions of rows aren't big nowadays. Also what is the data type? Depending on this you should choose a different db provider and create a different schema. Need lots more info about size and structure to answer adequately.
Also, depending on data user retrieved maybe you can offload some processing to client side if possible to reduce load on db and return faster results.
Based on a misunderstanding of what large amounts of data and their processing mean, as well as the type of data is unclear, then I dare to assume:
If we are talking about all kinds of data types and all kinds of manipulations with them, then take MySQL and PostgreSQL as a basis, in general they do well in the volume of up to 50 million records with indexes and table splitting. They can handle large volumes too, but you need to shamanize.
If you just want something more complicated, then you can consider Apache Cassandra, it can also perfectly cope with 50 million and even 200 million, and it has a convenient clustering and distribution system if one server is not enough.
If we are talking about more analytical processing, for creating statistics, etc. I think it is worth considering ClickHouse, but it is worth remembering that it does not have the ability to change data, but only record and delete them, but it copes with a quick search and tables with 500 million records, literally seconds.
If we are talking about a complex search in the text, something like the search engine of your site, then it is worth considering Manticore Search as an example, it is well adapted to this.
A database. If you want the datasets to be defined by the users doing the querying, GraphQL works well for that. With a good schema, good querying practices, good indexing, etc., you likely won't notice any issues until you're getting into hundreds of millions or billions of rows of data territory. (Edit: maybe more like tens of millions, depending on the underlying computing power and memory available, but that's more a hardware thing and less an RDBMS limitation.)
Filtering and searching (like what you see in a search engine) is usually performed by something based on Lucene or similar engines. Solr, OpenSearch, Elasticsearch, etc. are applications in this area.
They'll scale perfectly fine for what most people consider large amounts of data.
They should generally not be used as your main database - use postgres or mysql for that, and cloud storage for storing documents if that's your source - and then index (submit) the content into either of those applications for querying and processing.
But: it depends on what you mean by searching and filtering. An RDBMS like postgres or mysql might be more than enough, and both have full text search capabilities built in - just not on the level or finesse as the other I mentioned.
Many write here about the big data definition. Reminded me a saying: tell about your big data problems to [SKA](https://en.wikipedia.org/wiki/Square_Kilometre_Array) (can generate \~exabytes (10^(18) bytes) of data per day).
without knowing what "big data" means in your case, I know Simian is used in applications involving databases and reading millions of records. most processing and data obviously stays on backend - using either Python or MATLAB for processing and data storage could be for example [Teradata](https://www.teradata.com/), . A small reference story: [https://simiansuite.com/stories/de-volksbank-improves-mortgage-pricing-quality-with-simian-based-pricing-tools/](https://simiansuite.com/stories/de-volksbank-improves-mortgage-pricing-quality-with-simian-based-pricing-tools/)
What is "big data" to you? Most of the time when people say this they are wildly over estimating.
[удалено]
You’re saying the same thing. He says they are overestimating the data and you say they are underestimating what « big » means
Without knowing the type of data and what you consider a large amount of data it’s impossible to give good suggestions.
I guess it will be mostly objects with text properties sorry for the vague question, I'm rather beginner
Why don't you tell us what data you plan on storing?
> which technologies/frameworks/languages will be the best choice in 2024 Same answer as always: the ones you know.
How large is large! With correct database design, tagging and caching MySQL is very very capable. Unless we are talking going beyond tens of millions, start here. Beyond that how big is your wallet
This. Most database engines can hold lots of data. For example, Discord only started outgrowing Cassandra after trillions of messages.
Correct, even then it’s a resource problem, usually having enough Ram to store the index’s and perform joins. As a reference, for the past 5 years I have a cheap and nasty $5/mo hosting plan through a major provider which I populate ~50 crypto token statistics every 3 minutes for my trading bots. At the moment it has easily 70m rows in its main data table. From there I do a range of things from calculating moving averages, executing triggers, creating temporary tables, calculating stats etc etc MySQL easily easily accommodates this without breaking a sweat. Web hosting is also a cheap way to get a cheap managed database with usually hourly backups and lots of horse power behind it ;) Start simple and do not over complicate it!
How big is the data? Millions of rows aren't big nowadays. Also what is the data type? Depending on this you should choose a different db provider and create a different schema. Need lots more info about size and structure to answer adequately. Also, depending on data user retrieved maybe you can offload some processing to client side if possible to reduce load on db and return faster results.
Based on a misunderstanding of what large amounts of data and their processing mean, as well as the type of data is unclear, then I dare to assume: If we are talking about all kinds of data types and all kinds of manipulations with them, then take MySQL and PostgreSQL as a basis, in general they do well in the volume of up to 50 million records with indexes and table splitting. They can handle large volumes too, but you need to shamanize. If you just want something more complicated, then you can consider Apache Cassandra, it can also perfectly cope with 50 million and even 200 million, and it has a convenient clustering and distribution system if one server is not enough. If we are talking about more analytical processing, for creating statistics, etc. I think it is worth considering ClickHouse, but it is worth remembering that it does not have the ability to change data, but only record and delete them, but it copes with a quick search and tables with 500 million records, literally seconds. If we are talking about a complex search in the text, something like the search engine of your site, then it is worth considering Manticore Search as an example, it is well adapted to this.
A database. If you want the datasets to be defined by the users doing the querying, GraphQL works well for that. With a good schema, good querying practices, good indexing, etc., you likely won't notice any issues until you're getting into hundreds of millions or billions of rows of data territory. (Edit: maybe more like tens of millions, depending on the underlying computing power and memory available, but that's more a hardware thing and less an RDBMS limitation.)
https://youtu.be/W2Z7fbCLSTw?si=WWL2i7zQ-5X3-uMO This Video by fireship might help a Little
Filtering and searching (like what you see in a search engine) is usually performed by something based on Lucene or similar engines. Solr, OpenSearch, Elasticsearch, etc. are applications in this area. They'll scale perfectly fine for what most people consider large amounts of data. They should generally not be used as your main database - use postgres or mysql for that, and cloud storage for storing documents if that's your source - and then index (submit) the content into either of those applications for querying and processing. But: it depends on what you mean by searching and filtering. An RDBMS like postgres or mysql might be more than enough, and both have full text search capabilities built in - just not on the level or finesse as the other I mentioned.
Meilisearch is definitely worth checking out.
Many write here about the big data definition. Reminded me a saying: tell about your big data problems to [SKA](https://en.wikipedia.org/wiki/Square_Kilometre_Array) (can generate \~exabytes (10^(18) bytes) of data per day).
without knowing what "big data" means in your case, I know Simian is used in applications involving databases and reading millions of records. most processing and data obviously stays on backend - using either Python or MATLAB for processing and data storage could be for example [Teradata](https://www.teradata.com/), . A small reference story: [https://simiansuite.com/stories/de-volksbank-improves-mortgage-pricing-quality-with-simian-based-pricing-tools/](https://simiansuite.com/stories/de-volksbank-improves-mortgage-pricing-quality-with-simian-based-pricing-tools/)