Jun 11th 2013, 0:26:53
I was planning on using the shared unlimited hosts, or the dreamhost VPS which is unlimited disk/bandwidth, to begin with just so I can gauge what kind of numbers I'll be looking at. After everything is up and running I should have enough usage data to look for other options that will provide better performance but allow for enough bandwidth/disk space.
In terms of making the system distributed, the current setup has a different console app/windows process for each logically different scrape.
I have worked with PHP, Java, C/C++, C#, VB, in the past but was planning to go with Python due to some libraries available for it which I plan to utilize for some of the modelling. However, if a specific language will be more efficient than another, I'm okay having the scrapes in one language and the models in another.
There will be maybe a dozen pages that are of the high frequency type; but if it does become a significant issue, I can just run something on one of my boxes at home to handle those. I was just hoping to have everything on a hosted server for security and accessibility.
Here's an example of something that I scrape:
www dot spp dot org/XML/LIP-Pricing dot xml
This is not one of the frequently updated sources though, this will only be revised every five minutes. And yes, I did replace . with dot lol. Just don't need this thread showing up on a google search result if a competitor searches the same url.