Meanwhile in Switzerland I participated in a big semester project at Globis research group. Actually it was a great opportunity (this year given only to 16 first enrolled people) to have a look on a real world projects and take part in solving interesting problems.
Each of four teams had to accomplish a compound project strongly connected with cutting-edge research problems important to the Globis group. Our project was called Content-driven Database Refactoring, Cleansing and Integration, and as it is possible to guess, was binded with the database and data topic. The idea behind the problem was the following. Imagine someone who possesses a database with some data and use it, let’s say, for some web site. All necessary data is extracted from the database and placed on the website via multiple scripts, and the website content looks quite nice. More important, those scripts take into account the data meaning itself, as script writer knew how and where to put fields from the database on the web site in order to make them easily understandable by visitors. And at some point a problem arises, that the old database schema does not meet necessary requirements anymore or just owner doesn’t like it anymore. What to do? Obvious solution – look at the old database schema and try to create a new one based on it with some new extra things. But what if he can use data characteristics and features for building a schema? Data itself better knows, how to be organized! So a person who faced with problem of old schema, goes to his own web site and uses the fact that his data is nicely represented on it (and nobody knows what effort you paid for that neat representation!). He collects that data based directly on how it lies on the website and now can automatically generate schema based on it! Of course this scenario isn’t the only one which can motivate you to look at your database from the data point of you. If you have another one, more suitable for your situation, please strike out my explanation and add yours.
Originally the flow of the project was proposed by our supervisors, but eventually due to our ideas and work style the project structure transformed into two almost separate stages, which although were connected by their goal and the database itself. Apart from the project flow, we were given precise project goals, though it was possible to extend the limits of our work if we had valuable ideas. So what eventually we had to do? We had to work with JSON files in order to automatically generate database schema based on its structure (ideally – have user interface allowing user to modify a schema if he wants something different from what the system suggested), and it became first part of our project; than we needed to cleanse that data based on the generated schema; afterwards we had to implement integration of other sources to our newly created and cleansed database and adding new records through a nice interface with validation during the run itself – all this things got included into the second part of our project. It is worth mentioning, that some parts of our system took into consideration data domain, but it is possible to implement the same approach for another domain or even for generic data.
At the beginning we were quite overwhelmed with the number of goals we got, amount of technologies we had to use and complete misunderstanding how to start. It took us several meetings with supervisors and even more within the team to understand completely what-how-when. I need to say that sometimes our meetings were a lot of fun, because obviously discussions were jumping to non-project topics. It took quite sufficient amount of my nervous cells to cope with constant laughs and an ocean of boiling ideas during team meetings (hello, my dear teammates! 🙂 ). Ultimately we set our architecture, divided work within the group and started developing process.
At the same time, work in the second part of project was developing in a full swing. Guys were creating scripts which would allow us to integrate external data sources into cleansed database, brainstorming rules for database cleansing and trying to understand how to work with Drools – the system we used for online validation of adding new content.
The more we worked, the more we and our supervisors realized that the project contained such a big field for a potential work, that we were slowly including more and more functionality to our system. At some point it was decided to stop, otherwise we would be still working probably. But nevertheless we added all our ideas to a huge final report!
I can’t say that this post contains a lot of technical details about our project, it’s just a quickly overview of what we did during approximately 14 weeks. This project became an important experience for me, as now I understand more how team development works (slowly and with a lot of jokes) and how multifunctional complicated information systems are built (with a lot of brainstorming).