The entirety of Spotify is being scraped and archived via torrents

proper 🔩

Dec 21, 2025

·

2 replies

Reply

4

Platinum

7mo

My point is that you wouldnt want to pick and choose albums - complete opposite of what you are talking about. Feel you arent reading my messages as Im obviously not talking about your average music listener

oh u guys are talking about some ai s*** lol nvm

idgaf about ai

Platinum

Dec 21, 2025

·

1 reply

Reply

Oblivion X

7mo

Yea you dont need to download the whole thing. But if ur not downloading the whole thing, whats the difference than just using another music scrapper where you can download music in batches already ?

That scale is definitely feasible with other scraping tools. You can download whole discographies easily via torrents that amounts to that size pretty easily.

If youre super determined yes thats true, but my feeling is your average ML person is not going out there way to download that amount of songs, with the needed thought into what disocgraphies are good to search for, how big the discography is etc. Whereas now with the dataset readily available its an 'interesting problem' ready for someone to tackle.

So thats what I mean about less friction. Its like having a kaggle competition ready for you to tackle. (I assume you work in the field based on your msgs, but if not then apologies if im using terms youre not familar with)

I also think close to 300 TB is not too much of an ask for an inidividual/team that really wants to tackle this. You wont be storing that data long term and will never need to store it all at once as youll just change it all into features to train your model on and then youll delete the audio files.

Anyway in case this comes across as rambling now then sorry lol, was a good convo though

Oblivion X

Dec 22, 2025

·

1 reply

Reply

Platinum

7mo

If youre super determined yes thats true, but my feeling is your average ML person is not going out there way to download that amount of songs, with the needed thought into what disocgraphies are good to search for, how big the discography is etc. Whereas now with the dataset readily available its an 'interesting problem' ready for someone to tackle.

So thats what I mean about less friction. Its like having a kaggle competition ready for you to tackle. (I assume you work in the field based on your msgs, but if not then apologies if im using terms youre not familar with)

I also think close to 300 TB is not too much of an ask for an inidividual/team that really wants to tackle this. You wont be storing that data long term and will never need to store it all at once as youll just change it all into features to train your model on and then youll delete the audio files.

Anyway in case this comes across as rambling now then sorry lol, was a good convo though

I mean they could just pick the top listed artists and just scrap their discogs, it would even be better since it would be in higher quality than this database.

While yeah having it all in one place makes it easier to download, I think the difference is not that much to be against this.

For a team doing the full 300tb, storing the data wouldn't be the biggest roadblocks it would be the training of all that data and how much power it would take

Hazyeyefreakmonstr

Dec 22, 2025

Reply

Lmao

Dipset Forever

Dec 22, 2025

·

1 reply

Reply

suzuki

7mo

Slsk all u need

Wrong

Dipset Forever

Dec 22, 2025

Reply

F*** Spotify

Platinum

Dec 22, 2025

·

edited

·

1 reply

Reply

Oblivion X

7mo

I mean they could just pick the top listed artists and just scrap their discogs, it would even be better since it would be in higher quality than this database.

While yeah having it all in one place makes it easier to download, I think the difference is not that much to be against this.

For a team doing the full 300tb, storing the data wouldn't be the biggest roadblocks it would be the training of all that data and how much power it would take

I dont really see much correlation between the 300 tb files and the power needed to train a model. That 300tb will be much reduced when you convert it into data you will actually train on.

Its possible we are thinking of training on different things though.

The point you make about it being higher quality on things like soulseek is true I agree. Ultimately though I think what will probably happen is you get some startup or existing company download the data and use it in a model thats used behind the scenes in some type of serivce to labels/muscians.

Do you work/study in the field?

Oblivion X

Dec 22, 2025

·

1 reply

Reply

Platinum

7mo

I dont really see much correlation between the 300 tb files and the power needed to train a model. That 300tb will be much reduced when you convert it into data you will actually train on.

Its possible we are thinking of training on different things though.

The point you make about it being higher quality on things like soulseek is true I agree. Ultimately though I think what will probably happen is you get some startup or existing company download the data and use it in a model thats used behind the scenes in some type of serivce to labels/muscians.

Do you work/study in the field?

Larger training models end up using more energy in terms of power to train

Not data science, but Electrical and Computer engineering

Platinum

Dec 22, 2025

Reply

1

Oblivion X

7mo

Larger training models end up using more energy in terms of power to train

Not data science, but Electrical and Computer engineering

Yes thats true but I meant that it won't be trained on literally 300TB of data. (There is a correlation between dataset size and power needed to train but 300TB audio files =/= 300tb dataset size)

And cool, nice Eletrical engineering is still such a good foundation to have